代码之家 › 专栏 › 技术社区 › endolith

如果对象也有其他类,则Beauty Soup找不到CSS类

beautifulsoup screen-scraping python

endolith · 技术社区 · 15 年前

如果一个页面有 <div class="class1"> 和 <p class="class1"> 那么 soup.findAll(True, 'class1') 我会找到他们两个。

如果有 <p class="class1 class2"> 但是,它不会被找到。我如何找到具有特定类的所有对象,而不管它们是否也有其他类?

4 回复 | 直到 15 年前

endolith 15 年前

不幸的是,BeautifulSoup将其视为一个包含空格的类 'class1 class2' ['class1','class2'] . 解决方法是使用正则表达式而不是字符串来搜索类。

这项工作:

soup.findAll(True, {'class': re.compile(r'\bclass1\b')})

Kugel 11 年前

Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

In [1]: import bs4

In [2]: soup = bs4.BeautifulSoup('<div class="foo bar"></div>')

In [3]: soup(attrs={'class': 'bar'})
Out[3]: [<div class="foo bar"></div>]

而且,你不必再输入findAll了。

Inaimathi 11 年前

lxml . 它使用多个由空格分隔的类值(“class1 class2”)。

尽管名称不同,lxml还是用于解析和抓取HTML。它比BeautifulSoup快得多,甚至比BeautifulSoup(他们的名声)更好地处理“坏的”HTML。如果您不想学习lxml API,它也为BeautifulSoup提供了一个兼容API。

Ian Bicking agrees 更喜欢lxml而不是BeautifulSoup。

没有理由再使用BeautifulSoup了,除非你使用的是Google App Engine或者其他不允许使用Python的东西。

AbcAeffchen 10 年前

比如:

soup.find_all("a", class_="class1")

推荐文章

Stphn · 使用Python将多个URL中的不同变量刮到一个CSV文件中

7 年前

Alok Mishra · 如何自动点击“内容”按钮

7 年前

Stphn · 使用BeautifulSoup(在其他两个标记之间)从<a>中删除一系列链接

7 年前

sudonym · 内存泄漏在哪里?如何在python中的多处理过程中超时线程?

7 年前

sudonym · 如何使用仅XPath正则表达式模式刮取无ID的网站元素

7 年前

Ike · Python Selenium错误-当webdriver

7 年前

ilyas · 使用网站查询获取数据[已关闭]

7 年前

David Norway · 使用python在网站上清除所有使用过的Javascript

7 年前

tanee · 通过rvest获取web抓取中的电子邮件地址

7 年前

Hassang · 如何使用JavaScript将html从某个位置提取到另一个位置并向其添加属性?

7 年前