代码之家  ›  专栏  ›  技术社区  ›  Saqib Ali

当pypdf2正在解析的pdf被破坏时,我能让它正常地失败吗?

  •  1
  • Saqib Ali  · 技术社区  · 6 年前

    我有一个python应用程序,它从公共网站上收集数百个pdf文件,并使用这个python库来解析它们 PyPDF2

    在数百个成功解析的文件中,有一个文件让我心痛。它有18页长。文件名为“bad.pdf”。你可以看到它 here .

    这是我的代码,它将解析整个文档:

    $ virtualenv my_env
    $ source my_env/bin/activate
    (my_env) $ pip install PyPDF2==1.26.0
    (my_env) $ python
    >>> import PyPDF2
    >>> def parse_pdf_doc():
    >>>     pdfFileObj = open('bad.pdf', 'rb')
    >>>     pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    >>>     for curr_page_num in range(pdfReader.numPages):
    >>>         print 'curr_page_num = {}'.format(curr_page_num)
    >>>         pageObj = pdfReader.getPage(curr_page_num)
    >>>         print '\tPage Retrieved successfully'
    >>>         page_text = pageObj.extractText()
    >>>         print '\tText extracted successfully'
    

    当我运行这段代码时,它成功地解析了前9页。但在第十页,它就挂着。永远:

    >>> parse_pdf_doc()
    curr_page_num = 0
        Page Retrieved successfully
        Text extracted successfully
    curr_page_num = 1
        Page Retrieved successfully
        Text extracted successfully
    curr_page_num = 2
        Page Retrieved successfully
        Text extracted successfully
    curr_page_num = 3
        Page Retrieved successfully
        Text extracted successfully
    curr_page_num = 4
        Page Retrieved successfully
        Text extracted successfully
    curr_page_num = 5
        Page Retrieved successfully
        Text extracted successfully
    curr_page_num = 6
        Page Retrieved successfully
        Text extracted successfully
    curr_page_num = 7
        Page Retrieved successfully
        Text extracted successfully
    curr_page_num = 8
        Page Retrieved successfully
        Text extracted successfully
    curr_page_num = 9
        Page Retrieved successfully
    <... hung here forever ...>
    

    第10页有什么问题?让我们在查看器中打开它。哦哇:即使是谷歌文档也无法解析第10页。所以这一页肯定有什么东西被破坏了:

    enter image description here

    但是,我仍然需要pypdf抛出异常或以其他方式失败,而不仅仅是进入无限循环。它扼杀了我的工作流程。如何处理pdf文件中损坏的页面?

    1 回复  |  直到 6 年前
        1
  •  0
  •   Hayat    6 年前

    下面的模板将给你一个如何实现这一点的想法。

    from multiprocessing import Process
    pdfFileObj = open('bad.pdf', 'rb')
    for page in PDFPage.get_pages(pdfFileObj):
                        processTimeout = 20
                        extractTextProcess = Process(target=Function_to_extract_text, args=(pdfObject,page)
    

    open 你的文件来自 with 关键字(保存内存泄漏)