代码之家 › 专栏 › 技术社区 › viraptor

实际工作的Python html解析

parsing html python

viraptor · 技术社区 · 14 年前

我正在尝试用Python解析一些html。以前有一些方法是有效的。。。但是现在没有解决办法我什么也用不上。

SGMLParser离开后,beautifulsoup出现问题
lxml试图对典型的html来说“太正确”(属性和标记不能包含未知的名称空间,或者抛出异常,这意味着几乎无法解析带有Facebook connect的页面)

5 回复 | 直到 14 年前

Tim McNamara 14 年前

确保使用 html 当您使用 lxml :

>>> from lxml import html
>>> doc = """<html>
... <head>
...   <title> Meh
... </head>
... <body>
... Look at this interesting use of <p>
... rather than using <br /> tags as line breaks <p>
... </body>"""
>>> html.document_fromstring(doc)
<Element html at ...>

PaulMcG 14 年前

我用pyparsing处理了许多HTML页面抓取项目。这是介于BeautifulSoup和完整的HTML解析器之间的中间地带,也是正则表达式的低级方法(这是疯狂的做法)。

一切在页面上,因为您感兴趣的区域之外的一些有问题的HTML可能会抛出一个全面的HTML解析器。

接受空白,而不在表达式中乱扔“\s*”
处理标记中的意外属性
处理标签中的大写/小写
使用命名空间处理属性名称
不引用
<blah /> )
返回已分析的标记数据,该数据具有对标记属性的对象属性访问权限

<a href=xxx> 网页上的标记:

from pyparsing import makeHTMLTags, SkipTo

# read HTML from a web page
page = urllib.urlopen( "http://www.yahoo.com" )
htmlText = page.read()
page.close()

# define pyparsing expression to search for within HTML    
anchorStart,anchorEnd = makeHTMLTags("a")
anchor = anchorStart + SkipTo(anchorEnd).setResultsName("body") + anchorEnd

for tokens,start,end in anchor.scanString(htmlText):
    print tokens.body,'->',tokens.href

这会把 <a>

Pyparsing并不是解决这个问题的万无一失的解决方案,但是通过向您公开解析过程,您可以更好地控制您特别感兴趣的HTML的哪些部分,对它们进行处理,并跳过其余部分。

Ms2ger 14 年前

html5lib无法解析一半的内容

Tim McNamara 14 年前

如果你是刮内容,一个很好的方法来绕过恼人的细节是 sitescraper 包裹。它使用机器学习来确定要为您检索哪些内容。

从主页:

>>> from sitescraper import sitescraper
>>> ss = sitescraper()
>>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=python&x=0&y=0'
>>> data = ["Amazon.com: python", 
             ["Learning Python, 3rd Edition", 
             "Programming in Python 3: A Complete Introduction to the Python Language (Developer's Library)", 
             "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]
>>> ss.add(url, data)
>>> # we can add multiple example cases, but this is a simple example so 1 will do (I   generally use 3)
>>> # ss.add(url2, data2) 
>>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-  keywords=linux&x=0&y=0')
["Amazon.com: linux", ["A Practical Guide to Linux(R) Commands, Editors, and Shell    Programming", 
"Linux Pocket Guide", 
"Linux in a Nutshell (In a Nutshell (O'Reilly))", 
'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 
'Linux Bible, 2008 Edition: Boot up to Ubuntu, Fedora, KNOPPIX, Debian, openSUSE, and 11 Other Distributions']]

winwaed 14 年前

甚至在几年前,我试图为一个原始的蜘蛛型应用解析HTML,发现问题太难了。我怀疑写你自己的可能是纸上谈兵,虽然我们不可能是唯一有这个问题的人!