我有一个python应用程序,它从公共网站上收集数百个pdf文件,并使用这个python库来解析它们
PyPDF2
在数百个成功解析的文件中,有一个文件让我心痛。它有18页长。文件名为“bad.pdf”。你可以看到它
here
.
这是我的代码,它将解析整个文档:
$ virtualenv my_env
$ source my_env/bin/activate
(my_env) $ pip install PyPDF2==1.26.0
(my_env) $ python
>>> import PyPDF2
>>> def parse_pdf_doc():
>>> pdfFileObj = open('bad.pdf', 'rb')
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
>>> for curr_page_num in range(pdfReader.numPages):
>>> print 'curr_page_num = {}'.format(curr_page_num)
>>> pageObj = pdfReader.getPage(curr_page_num)
>>> print '\tPage Retrieved successfully'
>>> page_text = pageObj.extractText()
>>> print '\tText extracted successfully'
当我运行这段代码时,它成功地解析了前9页。但在第十页,它就挂着。永远:
>>> parse_pdf_doc()
curr_page_num = 0
Page Retrieved successfully
Text extracted successfully
curr_page_num = 1
Page Retrieved successfully
Text extracted successfully
curr_page_num = 2
Page Retrieved successfully
Text extracted successfully
curr_page_num = 3
Page Retrieved successfully
Text extracted successfully
curr_page_num = 4
Page Retrieved successfully
Text extracted successfully
curr_page_num = 5
Page Retrieved successfully
Text extracted successfully
curr_page_num = 6
Page Retrieved successfully
Text extracted successfully
curr_page_num = 7
Page Retrieved successfully
Text extracted successfully
curr_page_num = 8
Page Retrieved successfully
Text extracted successfully
curr_page_num = 9
Page Retrieved successfully
<... hung here forever ...>
第10页有什么问题?让我们在查看器中打开它。哦哇:即使是谷歌文档也无法解析第10页。所以这一页肯定有什么东西被破坏了:
但是,我仍然需要pypdf抛出异常或以其他方式失败,而不仅仅是进入无限循环。它扼杀了我的工作流程。如何处理pdf文件中损坏的页面?