代码之家  ›  专栏  ›  技术社区  ›  Ratha

在python并行处理中,如何找到第一个进程调用的函数?

  •  0
  • Ratha  · 技术社区  · 6 年前

    我有以下代码片段,它读取CSV文件列表并将它们合并到一个CSV中。

    import multiprocessing
    
    def do():
        pool = multiprocessing.Pool(max_threads)
        list_of_csvs=[]
        outputdir = 'output/'
        for csvFile in glob(outputdir + '*.csv'):
            list_of_csvs.append(csvFile)
        pool.map(writeToSingleCSV, list_of_csvs)
        pool.close()
    
    def writeToSingleCSV(csvFile):
        with open('singleDataFile.csv', 'a') as singleFile:
            inFile = open(csvFile, 'r')
            for line in inFile:
                singleFile.write(line)
    

    上面是代码工作,但我想跳过以下CSV文件的头。(因为所有CSV文件都包含相同的头)我如何才能跳过第二个文件的头?

    3 回复  |  直到 6 年前
        1
  •  1
  •   Arghya Saha    6 年前

    你为什么不把标题分开写呢?像这样的东西

    import multiprocessing
    
    def do():
        pool = multiprocessing.Pool(max_threads)
        list_of_csvs=[]
        outputdir = 'output/'
        for csvFile in glob(outputdir + '*.csv'):
            list_of_csvs.append(csvFile)
        writeToHEADERCSV(list_of_csvs[0])
        pool.map(writeToSingleCSV, list_of_csvs)
        pool.close()
    
    def writeToHEADERCSV(csvFile):
        with open('singleDataFile.csv', 'a') as singleFile:
            inFile = open(csvFile, 'r')
            # Get the first line and write it on the file 
    
    def writeToSingleCSV(csvFile):
        with open('singleDataFile.csv', 'a') as singleFile:
            inFile = open(csvFile, 'r')
            for line in inFile:
                # skip the first line which is header
    
        2
  •  2
  •   Roshan Bagdiya    6 年前

    另一种方法:使用熊猫可以帮助 ignore_index=True 可以解决头问题

    import pandas as pd
    import numpy as np
    import glob
    all_data = pd.DataFrame()
    for f in glob.glob("*.xlsx"): #read all xlsx file from a folder
        df = pd.read_excel(f)
        all_data = all_data.append(df,ignore_index=True)
    print (all_data.describe())
    all_data.to_excel('SingleFile.xlsx')
    
        3
  •  0
  •   spencer.pinegar    6 年前

    我只需在执行到writeToSingleCSV的映射之前附加头,并使writeToSingleCSV在默认情况下忽略头。

    import multiprocessing
    
    def do():
        pool = multiprocessing.Pool(max_threads)
        list_of_csvs=[]
        outputdir = 'output/'
        for csvFile in glob(outputdir + '*.csv'):
            list_of_csvs.append(csvFile)
         #Write a CSV file with the header
         csv_with_header = list_of_csvs.pop()
         writeToSingleCSV(csv_with_header, ignoreHeader=False)
         #Write the following CSV files without the header
         pool.map(writeToSingleCSV, list_of_csvs)
         pool.close()
    
    def writeToSingleCSV(csvFile, ignoreHeader=True):
        with open('singleDataFile.csv', 'a') as singleFile:
            inFile = open(csvFile, 'r')
            if ignoreHeader:
                #Ignore/Remove header from inFile - I would ignore len(header) characters
            for line in inFile:
                singleFile.write(line)
    

    这使它保持简单、明确,并且应该易于实现。