代码之家 › 专栏 › 技术社区 › Anshuman Sinha

使用OCR将多个img文件多重处理为文本

ocr multiprocessing parallel-processing

Anshuman Sinha · 技术社区 · 10 月前

我有超过10万个图像文件需要处理成文本。在我的设备上执行传统的for循环需要花费大量时间,所以我决定在一个有8个CPU内核和32GB ram的azure实例上运行代码。我试图在他的帮助下使代码并行 multiprocessing 但它似乎根本没有加快这一过程(或大幅加快)。早些时候,我运行的是我的本地4核笔记本电脑(但代码不是并行的)。

我希望一次处理多个文件,即使用不同的可用内核来获得 8x 速度(我知道在实践中我们不会得到8倍,但我只是在写参考。)。顺便说一句,我目前有一台苹果MacBook Pro,它有16个矿石和32 GB RAM作为本地设备,如果我们能设法用它来加速,那也很棒。

如果可能的话,有人能建议我在Azure或AWS上使用的机器配置吗?这实际上可以给我带来巨大的时间收益。因为以目前的速度,我认为需要300多个小时。

我正在运行的代码如下:

代码:

import os
import easyocr
from PIL import Image
import pandas as pd
from concurrent.futures import ProcessPoolExecutor
from tqdm import tqdm
import multiprocessing

# Use 'spawn' start method for multiprocessing to avoid CUDA issues
multiprocessing.set_start_method('spawn', force=True)

# Initialize the EasyOCR reader with GPU support outside the function to avoid reloading the model multiple times
reader = easyocr.Reader(['en'], gpu=True)  # Set gpu=True to use GPU

def ocr_and_extract_text(image_path):
    try:
        # Load and process the image for OCR
        image = Image.open(image_path)
        # Convert the image to grayscale (optional, based on your OCR model requirements)
        image = image.convert('L')

        # Perform OCR using EasyOCR
        extracted_text = reader.readtext(image_path, detail=0)
        extracted_text = ' '.join(extracted_text)  # Join list into a single string
        return extracted_text
    except Exception as e:
        print(f"Error processing {image_path}: {e}")
        return 'ERROR'

def process_images_in_parallel(csv_file, output_text_csv, num_workers=None):
    # Read the CSV file containing image paths and labels
    data = pd.read_csv(csv_file)
    
    # Add a new column to store the extracted text
    data['extracted_text'] = ''

    # Create a list of image paths
    image_paths = data['image_path'].tolist()

    # Use ProcessPoolExecutor to parallelize the OCR
    with ProcessPoolExecutor(max_workers=num_workers) as executor:
        results = list(tqdm(executor.map(ocr_and_extract_text, image_paths), total=len(image_paths), desc="Processing images"))

    # Update the DataFrame with the extracted text
    for image_path, extracted_text in zip(image_paths, results):
        data.loc[data['image_path'] == image_path, 'extracted_text'] = extracted_text

    # Save the results to a new CSV file
    data.to_csv(output_text_csv, index=False)


def main():

    output_folder = '/parallel_img_out'
    output_text_csv = os.path.join(output_folder, 'extracted_texts.csv')
    input_csv = os.path.join(output_folder, 'output_labels_huoston.csv')

    # Check the label counts (not required but useful for debugging)
    data = pd.read_csv(input_csv)
    label_counts = data['label'].value_counts()
    print('label_counts', label_counts)

    # Get the number of available CPU cores
    num_cores = os.cpu_count()
    print(f"Number of available cores: {num_cores}")
    
    # Process images using parallelism across multiple cores
    process_images_in_parallel(input_csv, output_text_csv, num_workers=num_cores - 1)

# Example usage
if __name__ == "__main__":
    main()

0 回复 | 直到 10 月前