代码之家 › 专栏 › 技术社区 › user8188893

使用BeautifulSoup提取HTML注释中标记内的文本

text-extraction beautifulsoup web-scraping python-3.x

user8188893 · 技术社区 · 7 年前

我想在没有列表标记的注释中提取列表元素中的文本。但我无法使用下面的代码。

from bs4 import BeautifulSoup, Comment


html = """
<html>
<body>
<!--
  <ul>
     <li>10</li>
     <li>20</li>
     <li>30</li>
     </ul>
 -->

</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')

for numbers in soup.findAll(text=lambda text:isinstance(text, Comment)):
    print(numbers.extract())

结果是:

<ul>
<li>10</li>
<li>20</li>
<li>30</li>
</ul>

预期结果:

10
20
30

2 回复 | 直到 7 年前

SIM 7 年前

尝试以下方法。它会给你带来你想要的结果。

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<!--
  <ul>
     <li>10</li>
     <li>20</li>
     <li>30</li>
     </ul>
 -->

</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')

for item in soup.find_all(text=lambda text:isinstance(text, Comment)):
    data = BeautifulSoup(item,"html.parser")
    for number in data.find_all("li"):
        print(number.text)

输出:

10
20
30

gout 7 年前

查找所有“li”,只打印文本。

for tag in soup.find_all("li"):
        print(tag.text))

推荐文章

yash agarwal · Python Selenium-如何基于span标记内的文本提取元素?

2 年前

Amar · 漂亮汤错误:“NoneType”对象没有属性“find\u all”

2 年前

ihonestlydontKnow · Python(BeautifulSoup)仅1个结果

2 年前

ARH · 如何使用Selenium识别网站中使用的所有标签

2 年前

Kevin Rodgers Jr. · Python BeautifulSoup:在in select语句中排除其他标记

2 年前

Jensen Holm · 在非常大的字符串中查找链接时遇到问题

2 年前

koshiboto · 使用python(bs4)从段落中获取第一个不位于括号之间的常规链接

2 年前

LaddieMawery · Beautifulsoup获取嵌套跨元素时遇到问题

2 年前

Ventorro · Python和Web抓取的新手。抓取一个HTML表格——但是它并没有显示所有的列

2 年前

aphexlog · 正在尝试使用BeautifulSoup将新行附加到表体中的第一行

2 年前