代码之家 › 专栏 › 技术社区 › Lee Yaan

python中的文本文件处理

file list python

Lee Yaan · 技术社区 · 6 年前

我有以下文件:

1    real madrid,barcelona,chelsea,arsenal,cska
2    chelsea,arsenal,milan,napoli,juventus
5    bayern,dortmund,celtic,napoli
7    cska,psg,arsenal,atalanta
9    atletic bilbao,las palmas,milan,barcelona

我想用这个输出生成一个新文件(在那里我有节点,现在我有每个团队,在第二列中我有将这个团队作为属性的节点):

real madrid    1
barcelona    1,9
chelsea    1,2
arsenal    1,2,7
cska    1,7
milan    2,9
etc...

首先,我打开了文件,并将每一列保存到一个列表中:

file1 = open("myfile.txt","r")
lines1 = file1.readlines()
nodes1 = []
attrs1 = []


for x in lines1:
    x = x.strip()
    x = x.split('\t')
    nodes1.append(x[0])
    attrs1.append(x[1].split(','))

但是现在如何检查属性和节点以生成输出文件?

5 回复 | 直到 6 年前

Petr Blahos 6 年前

最好在读取文件时创建字典:

line_map = {}
for x in lines1:
    (row_no, teams) = x.strip().split("\t")
    for i in teams.split(","):
        if not i in line_map:
            line_map[i] = set()
        line_map[i].add(row_no)

现在,line\u map包含团队名称到其所在行列表的映射。您可以轻松打印:

for (k, v) in line_map.items():
    print("%s: %s" % (k, ",".join(v)))

如果我没弄错的话。。。

编辑:应已添加附加。

zwer 6 年前

您可以创建一个字典来容纳您的团队,并在遇到他们时使用节点进行填充:

import collections

teams = collections.defaultdict(set)  # initiate each team with a set for nodes
with open("myfile.txt", "r") as f:  # open the file for reading
    for line in f:  # read the file line by line
        row = line.strip().split("\t")  # assuming a tab separator as in your code
        if not row:  # just a precaution for empty lines
            continue
        for team in row[1].split(","):  # split and iterate over each team
            teams[team].add(row[0].strip())  # add a node to the current team

# and you can now print it out:
for team, nodes in teams.items():
    print("{}\t{}".format(team, ",".join(nodes)))

这将产生:

arsenal    2,1,7
atalanta    7
chelsea 2,1
cska    1,7
psg 7
juventus    2
real madrid 1
barcelona   9,1
dortmund    5
celtic  5
napoli  2,5
milan   9,2
las palmas  9
atletic bilbao  9
bayern  5

用于您的数据。虽然订单不能保证,但你可以随时申请 sorted() 让他们按你想要的顺序排列。

使现代化 :要将结果保存到文件中,只需使用 handle.write() :

with open("out_file.txt", "w") as f:  # open the file for writing
    for team, nodes in teams.items():  # iterate through the collected team-node pairs
        f.write("{}\t{}\n".format(team, ",".join(nodes)))  # write each as a new line

snakes_on_a_keyboard 6 年前

这是一种方法(?)使用正则表达式。快乐编码:)

#!/usr/bin/env python3.6
import re, io, itertools

if __name__ == '__main__':
    groups = [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|') 
              for line in io.StringIO(open('f.txt').read())]
    enums = sorted([[word, n] for group, n in groups for word in group.split(',')], key=lambda x: x[0])
    for a, b in itertools.groupby(enums, lambda x: x[0]):
        print(a, ','.join(sorted(map(lambda x: x[1], b), key=int)))

解释(种类)

#!/usr/bin/env python3.6
import re, io, itertools

if __name__ == '__main__':
    # ('\d*') <-- match and capture leading integers
    # '\s*' <---- match but don't capture intervening space
    # ('.*') <--- match and capture the everything else

    # ('\g<2>|\g<1>') <--- swaps the second capture group with the first
    #                      and puts a "|" in between for easy splitting

    # io.StringIO is a great wrapper for a string, makes it easy to process text

    # re.subn is used to perform the regex swapping
    groups = [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|') for line in io.StringIO(open('f.txt').read())]

    # convert [[place1,place2 1], [place3,place4, 2] ...] -> [[place1, 1], place2, 1], [place3, 2], [place4, 2] ...]
    enums = sorted([[word, n] for group, n in groups for word in group.split(',')], key=lambda x: x[0])
    # group together, extract numbers, ...?, profit!
    for a, b in itertools.groupby(enums, lambda x: x[0]):
        print(a, ','.join(sorted(map(lambda x: x[1], b), key=int)))

奖励:一行“激怒你的同事”版

#!/usr/bin/env python3.6
import io
import itertools
import re

if __name__ == '__main__':
    groups = [[place, lines]
              for a, b in itertools.groupby(sorted([[word, n]
              for line in io.StringIO(open('f.txt').read())
              for group, n in [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|')]
              for word in group.split(',')], key=lambda x: x[0]), key=lambda x: x[0])
              for place, lines in [[a, ','.join(sorted(map(lambda x: x[1], b), key=int))]]]

    for place, lines in groups:
        print(place, lines)

“奖金”#2:将输出直接写入文件,激怒同事无生活版v1。2.

#!/usr/bin/env python3.6
import io
import itertools
import re

if __name__ == '__main__':
    with open('output.txt', 'w') as f:
        groups = [print(place, lines, file=f)
                  for a, b in itertools.groupby(sorted([[word, n]
                  for line in io.StringIO(open('f.txt').read())
                  for group, n in [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|')]
                  for word in group.split(',')], key=lambda x: x[0]), key=lambda x: x[0])
                  for place, lines in [[a, ','.join(sorted(map(lambda x: x[1], b), key=int))]]]

“奖金”#3:terminal-tables-because-I-Get-Fire-for-pissing-off-my-coworkers-so-I-have-free-time-edition v75。2.

Note: requires terminaltables 3rd party library

#!/usr/bin/env python3.6
import io
import itertools
import re
import terminaltables

if __name__ == '__main__':
    print(terminaltables.AsciiTable(
        [['Places', 'Line No.'], *[[place, lines]
          for a, b in itertools.groupby(sorted([[word, n]
          for line in io.StringIO(open('f.txt').read())
          for group, n in [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|')]
          for word in group.split(',')], key=lambda x: x[0]), key=lambda x: x[0])
          for place, lines in [[a, ','.join(sorted(map(lambda x: x[1], b), key=int))]]]]).table)

输出

+----------------+----------+
| Places         | Line No. |
+----------------+----------+
| arsenal        | 1,2,7    |
| atalanta       | 7        |
| atletic bilbao | 9        |
| barcelona      | 1,9      |
| bayern         | 5        |
| celtic         | 5        |
| chelsea        | 1,2      |
| cska           | 1,7      |
| dortmund       | 5        |
| juventus       | 2        |
| las palmas     | 9        |
| milan          | 2,9      |
| napoli         | 2,5      |
| psg            | 7        |
| real madrid    | 1        |
+----------------+----------+

handle 6 年前

# for this example, instead of reading a file just include the contents as string ..
file1 = """
1\treal madrid,barcelona,chelsea,arsenal,cska
2\tchelsea,arsenal,milan,napoli,juventus
5\tbayern,dortmund,celtic,napoli
7\tcska,psg,arsenal,atalanta
9\tatletic bilbao,las palmas,milan,barcelona
"""

# .. which can be split into a list (same result as with readlines)
lines1 = file1.strip().split('\n')
print(lines1)

# using separate lists requires handling indexes, so I'd use a dictionary instead
output_dict = {}

# iterate as before
for x in lines1:
    # you can chain the methods, and assign both parts of the line 
    # simultaneously (must be two parts exactly, so one TAB, or there
    # will be an error (Exception))
    node, attrs = x.strip().split('\t')

    # separate the list of clubs
    clubs = attrs.split(',')

    # collect each club in the output ..
    for club in clubs:
        # and with it, a list of the node(s)
        if club in output_dict:
            # add entry to the list for the existing club
            output_dict[club].append(node)
        else:
            # insert the club with a new list containing the first entry
            output_dict[club] = [node]

    # that should be it, let's see ..

# iterate the dict(ionary)
for club in output_dict:
    # convert list of node(s) to a string by joining the elements with a comma
    nodestr = ','.join(output_dict[club])

    # create a formatted string with the club and its nodes
    clubstr = "{:20}\t{}".format(club, nodestr)

    # print to stdout (e.g. console)
    print( clubstr )

印刷品

['1\treal madrid,barcelona,chelsea,arsenal,cska', '2\tchelsea,arsenal,milan,napoli,juventus', '5\tbayern,dortmund,celtic,napoli', '7\tcska,psg,arsenal,atalanta', '9\tatletic bilbao,las palmas,milan,barcelona']
real madrid             1
barcelona               1,9
chelsea                 1,2
arsenal                 1,2,7
cska                    1,7
milan                   2,9
napoli                  2,5
juventus                2
bayern                  5
dortmund                5
celtic                  5
psg                     7
atalanta                7
atletic bilbao          9
las palmas              9

Ben.T 6 年前

这里有一个关于熊猫的解决方案(为什么不)

import pandas as pd
path_file_input = 'path\to\input_file.txt'
path_file_output = 'path\to\output_file.txt'

# Read the data from a txt file (with a tab separating the columns)
data = pd.read_csv(path_file_input, sep ='\t', header=None, names=[ 'Nodes', 'List Teams'], dtype=str)
# Create a column with all couple team-node
data_split = data['List Teams'].str.split(',', expand=True).stack().reset_index(level=0)\
                .set_index('level_0').rename(columns={0:'Teams'}).join(data.drop('List Teams',1), how='left')             
# Merge the data per team and join the nodes
data_merged = data_split.groupby('Teams')['Nodes'].apply(','.join).reset_index()

# Save as a txt file
data_merged.to_csv(path_file_output, sep='\t', index=False, header=False, float_format = str)
# or display the data
print (data_merged.to_csv(sep='\t', header=False, index=False))

看见 normalizing data by duplication 对于从 data_split