代码之家 › 专栏 › 技术社区 › jkp

在python字符串中解码HTML实体?

html-entities html python

217

jkp · 技术社区 · 15 年前

我正在用beautiful t soup 3解析一些HTML,但它包含beautiful t soup 3不会自动为我解码的HTML实体:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

如何解码HTML实体 text 得到 "Â£682m" 而不是 "£682m" .

5 回复 | 直到 8 年前

429

Adam Nelson 8 年前

蟒蛇3.4+

HTMLParser.unescape 已弃用,并且 was supposed to be removed in 3.5 尽管它是错误地留在里面的。它将很快从语言中删除。相反,使用 html.unescape() :

import html
print(html.unescape('&pound;682m'))

看见 https://docs.python.org/3/library/html.html#html.unescape

Python 2.6~3.3

您可以使用标准库中的HTML解析器:

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
Â£682m

见 http://docs.python.org/2/library/htmlparser.html

您也可以使用 six 可简化导入的兼容性库:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
Â£682m

Mark Amery Harley Holcombe 9 年前

美丽的汤处理实体转换。在“美丽的汤3”中,您需要指定 convertEntities 论据 BeautifulSoup 建造师(见 'Entity Conversion' 归档文档的部分)。在美丽的汤4中,实体被自动解码。

靓汤3

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>Â£682m</p>

清汤4

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p>Â£682m</p></body></html>

Corvax 8 年前

可以使用w3lib.html库中的replace_实体

In [202]: from w3lib.html import replace_entities

In [203]: replace_entities("&pound;682m")
Out[203]: u'\xa3682m'

In [204]: print replace_entities("&pound;682m")
Â£682m

LoicUV 11 年前

靓汤4让你 set a formatter to your output

如果你通过 formatter=None ,靓汤不改弦在输出端。这是最快的选择,但可能导致漂亮的汤生成无效的HTML/XML,如以下示例所示:

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<SacrÃ© bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>

-3

Ashwini Chaudhary 11 年前

这可能与此无关。但是要从整个文档中删除这些HTML实体,您可以这样做:(假设document=page,请原谅这种草率的代码,但是如果您有关于如何使其更好的想法,我会全神贯注的—我是这方面的新手)。

import re
import HTMLParser

regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
    h = HTMLParser.HTMLParser()
    unescaped = h.unescape(e) #finds the unescaped value of the html entity
    page = page.replace(e, unescaped) #replaces html entity with unescaped value