代码之家  ›  专栏  ›  技术社区  ›  Lee Yaan

python中的文本文件处理

  •  2
  • Lee Yaan  · 技术社区  · 6 年前

    我有以下文件:

    1    real madrid,barcelona,chelsea,arsenal,cska
    2    chelsea,arsenal,milan,napoli,juventus
    5    bayern,dortmund,celtic,napoli
    7    cska,psg,arsenal,atalanta
    9    atletic bilbao,las palmas,milan,barcelona
    

    我想用这个输出生成一个新文件(在那里我有节点,现在我有每个团队,在第二列中我有将这个团队作为属性的节点):

    real madrid    1
    barcelona    1,9
    chelsea    1,2
    arsenal    1,2,7
    cska    1,7
    milan    2,9
    etc...
    

    首先,我打开了文件,并将每一列保存到一个列表中:

    file1 = open("myfile.txt","r")
    lines1 = file1.readlines()
    nodes1 = []
    attrs1 = []
    
    
    for x in lines1:
        x = x.strip()
        x = x.split('\t')
        nodes1.append(x[0])
        attrs1.append(x[1].split(','))
    

    但是现在如何检查属性和节点以生成输出文件?

    5 回复  |  直到 6 年前
        1
  •  4
  •   Petr Blahos    6 年前

    最好在读取文件时创建字典:

    line_map = {}
    for x in lines1:
        (row_no, teams) = x.strip().split("\t")
        for i in teams.split(","):
            if not i in line_map:
                line_map[i] = set()
            line_map[i].add(row_no)
    

    现在,line\u map包含团队名称到其所在行列表的映射。您可以轻松打印:

    for (k, v) in line_map.items():
        print("%s: %s" % (k, ",".join(v)))
    

    如果我没弄错的话。。。

    编辑:应已添加附加。

        2
  •  3
  •   zwer    6 年前

    您可以创建一个字典来容纳您的团队,并在遇到他们时使用节点进行填充:

    import collections
    
    teams = collections.defaultdict(set)  # initiate each team with a set for nodes
    with open("myfile.txt", "r") as f:  # open the file for reading
        for line in f:  # read the file line by line
            row = line.strip().split("\t")  # assuming a tab separator as in your code
            if not row:  # just a precaution for empty lines
                continue
            for team in row[1].split(","):  # split and iterate over each team
                teams[team].add(row[0].strip())  # add a node to the current team
    
    # and you can now print it out:
    for team, nodes in teams.items():
        print("{}\t{}".format(team, ",".join(nodes)))
    

    这将产生:

    arsenal    2,1,7
    atalanta    7
    chelsea 2,1
    cska    1,7
    psg 7
    juventus    2
    real madrid 1
    barcelona   9,1
    dortmund    5
    celtic  5
    napoli  2,5
    milan   9,2
    las palmas  9
    atletic bilbao  9
    bayern  5

    用于您的数据。虽然订单不能保证,但你可以随时申请 sorted() 让他们按你想要的顺序排列。

    使现代化 :要将结果保存到文件中,只需使用 handle.write() :

    with open("out_file.txt", "w") as f:  # open the file for writing
        for team, nodes in teams.items():  # iterate through the collected team-node pairs
            f.write("{}\t{}\n".format(team, ",".join(nodes)))  # write each as a new line
    
        3
  •  2
  •   snakes_on_a_keyboard    6 年前

    这是一种方法(?)使用正则表达式。快乐编码:)

    #!/usr/bin/env python3.6
    import re, io, itertools
    
    if __name__ == '__main__':
        groups = [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|') 
                  for line in io.StringIO(open('f.txt').read())]
        enums = sorted([[word, n] for group, n in groups for word in group.split(',')], key=lambda x: x[0])
        for a, b in itertools.groupby(enums, lambda x: x[0]):
            print(a, ','.join(sorted(map(lambda x: x[1], b), key=int)))
    

    解释(种类)

    #!/usr/bin/env python3.6
    import re, io, itertools
    
    if __name__ == '__main__':
        # ('\d*') <-- match and capture leading integers
        # '\s*' <---- match but don't capture intervening space
        # ('.*') <--- match and capture the everything else
    
        # ('\g<2>|\g<1>') <--- swaps the second capture group with the first
        #                      and puts a "|" in between for easy splitting
    
        # io.StringIO is a great wrapper for a string, makes it easy to process text
    
        # re.subn is used to perform the regex swapping
        groups = [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|') for line in io.StringIO(open('f.txt').read())]
    
        # convert [[place1,place2 1], [place3,place4, 2] ...] -> [[place1, 1], place2, 1], [place3, 2], [place4, 2] ...]
        enums = sorted([[word, n] for group, n in groups for word in group.split(',')], key=lambda x: x[0])
        # group together, extract numbers, ...?, profit!
        for a, b in itertools.groupby(enums, lambda x: x[0]):
            print(a, ','.join(sorted(map(lambda x: x[1], b), key=int)))
    

    奖励:一行“激怒你的同事”版

    #!/usr/bin/env python3.6
    import io
    import itertools
    import re
    
    if __name__ == '__main__':
        groups = [[place, lines]
                  for a, b in itertools.groupby(sorted([[word, n]
                  for line in io.StringIO(open('f.txt').read())
                  for group, n in [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|')]
                  for word in group.split(',')], key=lambda x: x[0]), key=lambda x: x[0])
                  for place, lines in [[a, ','.join(sorted(map(lambda x: x[1], b), key=int))]]]
    
        for place, lines in groups:
            print(place, lines)
    

    “奖金”#2:将输出直接写入文件,激怒同事无生活版v1。2.

    #!/usr/bin/env python3.6
    import io
    import itertools
    import re
    
    if __name__ == '__main__':
        with open('output.txt', 'w') as f:
            groups = [print(place, lines, file=f)
                      for a, b in itertools.groupby(sorted([[word, n]
                      for line in io.StringIO(open('f.txt').read())
                      for group, n in [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|')]
                      for word in group.split(',')], key=lambda x: x[0]), key=lambda x: x[0])
                      for place, lines in [[a, ','.join(sorted(map(lambda x: x[1], b), key=int))]]]
    

    “奖金”#3:terminal-tables-because-I-Get-Fire-for-pissing-off-my-coworkers-so-I-have-free-time-edition v75。2.

    Note: requires terminaltables 3rd party library

    #!/usr/bin/env python3.6
    import io
    import itertools
    import re
    import terminaltables
    
    if __name__ == '__main__':
        print(terminaltables.AsciiTable(
            [['Places', 'Line No.'], *[[place, lines]
              for a, b in itertools.groupby(sorted([[word, n]
              for line in io.StringIO(open('f.txt').read())
              for group, n in [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|')]
              for word in group.split(',')], key=lambda x: x[0]), key=lambda x: x[0])
              for place, lines in [[a, ','.join(sorted(map(lambda x: x[1], b), key=int))]]]]).table)
    
    输出
    +----------------+----------+
    | Places         | Line No. |
    +----------------+----------+
    | arsenal        | 1,2,7    |
    | atalanta       | 7        |
    | atletic bilbao | 9        |
    | barcelona      | 1,9      |
    | bayern         | 5        |
    | celtic         | 5        |
    | chelsea        | 1,2      |
    | cska           | 1,7      |
    | dortmund       | 5        |
    | juventus       | 2        |
    | las palmas     | 9        |
    | milan          | 2,9      |
    | napoli         | 2,5      |
    | psg            | 7        |
    | real madrid    | 1        |
    +----------------+----------+
    
        4
  •  1
  •   handle    6 年前
    # for this example, instead of reading a file just include the contents as string ..
    file1 = """
    1\treal madrid,barcelona,chelsea,arsenal,cska
    2\tchelsea,arsenal,milan,napoli,juventus
    5\tbayern,dortmund,celtic,napoli
    7\tcska,psg,arsenal,atalanta
    9\tatletic bilbao,las palmas,milan,barcelona
    """
    
    # .. which can be split into a list (same result as with readlines)
    lines1 = file1.strip().split('\n')
    print(lines1)
    
    # using separate lists requires handling indexes, so I'd use a dictionary instead
    output_dict = {}
    
    # iterate as before
    for x in lines1:
        # you can chain the methods, and assign both parts of the line 
        # simultaneously (must be two parts exactly, so one TAB, or there
        # will be an error (Exception))
        node, attrs = x.strip().split('\t')
    
        # separate the list of clubs
        clubs = attrs.split(',')
    
        # collect each club in the output ..
        for club in clubs:
            # and with it, a list of the node(s)
            if club in output_dict:
                # add entry to the list for the existing club
                output_dict[club].append(node)
            else:
                # insert the club with a new list containing the first entry
                output_dict[club] = [node]
    
        # that should be it, let's see ..
    
    # iterate the dict(ionary)
    for club in output_dict:
        # convert list of node(s) to a string by joining the elements with a comma
        nodestr = ','.join(output_dict[club])
    
        # create a formatted string with the club and its nodes
        clubstr = "{:20}\t{}".format(club, nodestr)
    
        # print to stdout (e.g. console)
        print( clubstr )
    

    印刷品

    ['1\treal madrid,barcelona,chelsea,arsenal,cska', '2\tchelsea,arsenal,milan,napoli,juventus', '5\tbayern,dortmund,celtic,napoli', '7\tcska,psg,arsenal,atalanta', '9\tatletic bilbao,las palmas,milan,barcelona']
    real madrid             1
    barcelona               1,9
    chelsea                 1,2
    arsenal                 1,2,7
    cska                    1,7
    milan                   2,9
    napoli                  2,5
    juventus                2
    bayern                  5
    dortmund                5
    celtic                  5
    psg                     7
    atalanta                7
    atletic bilbao          9
    las palmas              9
    
        5
  •  0
  •   Ben.T    6 年前

    这里有一个关于熊猫的解决方案(为什么不)

    import pandas as pd
    path_file_input = 'path\to\input_file.txt'
    path_file_output = 'path\to\output_file.txt'
    
    # Read the data from a txt file (with a tab separating the columns)
    data = pd.read_csv(path_file_input, sep ='\t', header=None, names=[ 'Nodes', 'List Teams'], dtype=str)
    # Create a column with all couple team-node
    data_split = data['List Teams'].str.split(',', expand=True).stack().reset_index(level=0)\
                    .set_index('level_0').rename(columns={0:'Teams'}).join(data.drop('List Teams',1), how='left')             
    # Merge the data per team and join the nodes
    data_merged = data_split.groupby('Teams')['Nodes'].apply(','.join).reset_index()
    
    # Save as a txt file
    data_merged.to_csv(path_file_output, sep='\t', index=False, header=False, float_format = str)
    # or display the data
    print (data_merged.to_csv(sep='\t', header=False, index=False))
    

    看见 normalizing data by duplication 对于从 data_split