我有超过10万个图像文件需要处理成文本。在我的设备上执行传统的for循环需要花费大量时间,所以我决定在一个有8个CPU内核和32GB ram的azure实例上运行代码。我试图在他的帮助下使代码并行
multiprocessing
但它似乎根本没有加快这一过程(或大幅加快)。早些时候,我运行的是我的本地4核笔记本电脑(但代码不是并行的)。
我希望一次处理多个文件,即使用不同的可用内核来获得
8x
速度(我知道在实践中我们不会得到8倍,但我只是在写参考。)。顺便说一句,我目前有一台苹果MacBook Pro,它有16个矿石和32 GB RAM作为本地设备,如果我们能设法用它来加速,那也很棒。
如果可能的话,有人能建议我在Azure或AWS上使用的机器配置吗?这实际上可以给我带来巨大的时间收益。因为以目前的速度,我认为需要300多个小时。
我正在运行的代码如下:
代码:
import os
import easyocr
from PIL import Image
import pandas as pd
from concurrent.futures import ProcessPoolExecutor
from tqdm import tqdm
import multiprocessing
multiprocessing.set_start_method('spawn', force=True)
reader = easyocr.Reader(['en'], gpu=True)
def ocr_and_extract_text(image_path):
try:
image = Image.open(image_path)
image = image.convert('L')
extracted_text = reader.readtext(image_path, detail=0)
extracted_text = ' '.join(extracted_text)
return extracted_text
except Exception as e:
print(f"Error processing {image_path}: {e}")
return 'ERROR'
def process_images_in_parallel(csv_file, output_text_csv, num_workers=None):
data = pd.read_csv(csv_file)
data['extracted_text'] = ''
image_paths = data['image_path'].tolist()
with ProcessPoolExecutor(max_workers=num_workers) as executor:
results = list(tqdm(executor.map(ocr_and_extract_text, image_paths), total=len(image_paths), desc="Processing images"))
for image_path, extracted_text in zip(image_paths, results):
data.loc[data['image_path'] == image_path, 'extracted_text'] = extracted_text
data.to_csv(output_text_csv, index=False)
def main():
output_folder = '/parallel_img_out'
output_text_csv = os.path.join(output_folder, 'extracted_texts.csv')
input_csv = os.path.join(output_folder, 'output_labels_huoston.csv')
data = pd.read_csv(input_csv)
label_counts = data['label'].value_counts()
print('label_counts', label_counts)
num_cores = os.cpu_count()
print(f"Number of available cores: {num_cores}")
process_images_in_parallel(input_csv, output_text_csv, num_workers=num_cores - 1)
if __name__ == "__main__":
main()