代码之家  ›  专栏  ›  技术社区  ›  Anshuman Sinha

使用OCR将多个img文件多重处理为文本

  •  0
  • Anshuman Sinha  · 技术社区  · 7 月前

    我有超过10万个图像文件需要处理成文本。在我的设备上执行传统的for循环需要花费大量时间,所以我决定在一个有8个CPU内核和32GB ram的azure实例上运行代码。我试图在他的帮助下使代码并行 multiprocessing 但它似乎根本没有加快这一过程(或大幅加快)。早些时候,我运行的是我的本地4核笔记本电脑(但代码不是并行的)。

    我希望一次处理多个文件,即使用不同的可用内核来获得 8x 速度(我知道在实践中我们不会得到8倍,但我只是在写参考。)。顺便说一句,我目前有一台苹果MacBook Pro,它有16个矿石和32 GB RAM作为本地设备,如果我们能设法用它来加速,那也很棒。

    如果可能的话,有人能建议我在Azure或AWS上使用的机器配置吗?这实际上可以给我带来巨大的时间收益。因为以目前的速度,我认为需要300多个小时。

    我正在运行的代码如下:

    代码:

    import os
    import easyocr
    from PIL import Image
    import pandas as pd
    from concurrent.futures import ProcessPoolExecutor
    from tqdm import tqdm
    import multiprocessing
    
    # Use 'spawn' start method for multiprocessing to avoid CUDA issues
    multiprocessing.set_start_method('spawn', force=True)
    
    # Initialize the EasyOCR reader with GPU support outside the function to avoid reloading the model multiple times
    reader = easyocr.Reader(['en'], gpu=True)  # Set gpu=True to use GPU
    
    def ocr_and_extract_text(image_path):
        try:
            # Load and process the image for OCR
            image = Image.open(image_path)
            # Convert the image to grayscale (optional, based on your OCR model requirements)
            image = image.convert('L')
    
            # Perform OCR using EasyOCR
            extracted_text = reader.readtext(image_path, detail=0)
            extracted_text = ' '.join(extracted_text)  # Join list into a single string
            return extracted_text
        except Exception as e:
            print(f"Error processing {image_path}: {e}")
            return 'ERROR'
    
    def process_images_in_parallel(csv_file, output_text_csv, num_workers=None):
        # Read the CSV file containing image paths and labels
        data = pd.read_csv(csv_file)
        
        # Add a new column to store the extracted text
        data['extracted_text'] = ''
    
        # Create a list of image paths
        image_paths = data['image_path'].tolist()
    
        # Use ProcessPoolExecutor to parallelize the OCR
        with ProcessPoolExecutor(max_workers=num_workers) as executor:
            results = list(tqdm(executor.map(ocr_and_extract_text, image_paths), total=len(image_paths), desc="Processing images"))
    
        # Update the DataFrame with the extracted text
        for image_path, extracted_text in zip(image_paths, results):
            data.loc[data['image_path'] == image_path, 'extracted_text'] = extracted_text
    
        # Save the results to a new CSV file
        data.to_csv(output_text_csv, index=False)
    
    
    def main():
    
        output_folder = '/parallel_img_out'
        output_text_csv = os.path.join(output_folder, 'extracted_texts.csv')
        input_csv = os.path.join(output_folder, 'output_labels_huoston.csv')
    
        # Check the label counts (not required but useful for debugging)
        data = pd.read_csv(input_csv)
        label_counts = data['label'].value_counts()
        print('label_counts', label_counts)
    
        # Get the number of available CPU cores
        num_cores = os.cpu_count()
        print(f"Number of available cores: {num_cores}")
        
        # Process images using parallelism across multiple cores
        process_images_in_parallel(input_csv, output_text_csv, num_workers=num_cores - 1)
    
    # Example usage
    if __name__ == "__main__":
        main()
    
    0 回复  |  直到 7 月前