代码之家 › 专栏 › 技术社区 › probat

Python 3+,读入文本文件并写入新文件,不包括行范围

writefile readfile text python

2

probat · 技术社区 · 7 年前

我正在Windows机器上使用Python 3.6版。我正在用with读取文本文件 open() 和 readlines() . 读入文本文件行后,我想将某些行写入新的文本文件,但排除某些行范围。我不知道要排除的行的行号。文本文件很大,要排除的行的范围因我正在阅读的文本文件而异。我可以搜索一些已知的关键字来查找要从我要写入的文本文件中排除的范围的开始和结束。

我在网上到处搜索,但似乎找不到一个有效的优雅解决方案。下面是我努力实现的一个例子。

a  
b  
BEGIN  
c  
d  
e  
END  
f  
g  
h  
i  
j  
BEGIN  
k  
l  
m  
n  
o  
p  
q  
END  
r  
s  
t  
u  
v  
BEGIN  
w  
x  
y  
END  
z

总之,我想将上述内容读入Python。然后,写入一个新文件,但排除从开始到结束关键字的所有行。

新文件应包含以下内容:

a  
b  
f  
g  
h  
i  
j  
r  
s  
t  
u  
v  
z

3 回复 | 直到 7 年前

1

Rob Hansen 7 年前

如你所说,如果文本文件很大,你应该避免使用 readlines() 因为这会将整个事情载入内存。相反,逐行读取并使用状态变量来控制您是否处于应该抑制输出的块中。有点像,

import re

begin_re = re.compile("^BEGIN.*$")
end_re = re.compile("^END.*$")
should_write = True

with open("input.txt") as input_fh:
    with open("output.txt", "w", encoding="UTF-8") as output_fh:
        for line in input_fh:
            # Strip off whitespace: we'll add our own newline
            # in the print statement
            line = line.strip()

            if begin_re.match(line):
                should_write = False
            if should_write:
                print(line, file=output_fh)
            if end_re.match(line):
                should_write = True

2

1

Ashish Ranjan 7 年前

您可以使用以下正则表达式来实现这一点:

regex = r"(\bBEGIN\b([\w\n]*?)\bEND\b\n)"

现场演示 here

可以使用上述正则表达式进行匹配,然后替换为空字符串( '' )

Here's Python中的一个工作示例。

代码

result = re.sub(regex, '', test_str, 0) # test_str is your file's content
>>> print(result)
>>> 
a
b
f
g
h
i
j
r
s
t
u
v
z

3

0

actionjezus6 7 年前

你有没有试过这样的方法:

with open("<readfile>") as read_file:
    with open("<savefile>", "w") as write_file:
        currently_skipping = False
        for line in read_file:
            if line == "BEGIN":
                currently_skipping = True
            else if line == "END":
                currently_skipping = False

            if currently_skipping:
                continue

            write_file.write(line)

这基本上应该做你需要做的事情。基本上,不要通过“readlines”将所有内容读入内存,而是采用更多的逐行方法——这也应该更节省内存。