代码之家 › 专栏 › 技术社区 › IAspireToBeGladOS

python+beautifulsoup:找到一个HTML标记,其中一个属性包含匹配的文本模式?

beautifulsoup html python

IAspireToBeGladOS · 技术社区 · 6 年前

我对巨蟒和美汤都不熟悉。我想知道如何只匹配 <div> 包含属于属性的特定匹配文本模式的元素。例如,所有情况下 'id' : 'testid' 或任何地方 'class' : 'title' .

这就是我目前为止所拥有的:

def cleanup(filename):
    fh = open(filename, "r")

    soup = BeautifulSoup(fh, 'html.parser')

    for div_tag in soup.find('div', {'class':'title'}):
        h2_tag = soup.h2_tag("h2")
        div_tag.div.replace_with(h2_tag)
        del div_tag['class']

    f = open("/tmp/filename.modified", "w")
    f.write(soup.prettify(formatter="html5"))
    f.close()

一旦我能匹配所有这些特定的元素,在那一点上我就可以找到如何操作属性(删除类,从中重命名标记本身 <DIV & GT; 到 <h1> 等)。所以我知道清理的实际部分可能与当前的情况不符。

2 回复 | 直到 6 年前

IAspireToBeGladOS 6 年前

这看起来很有效,但如果有更好或更标准的方法,请告诉我。

for tag in soup.findAll(attrs={'class':'title'}):
    del tag['class']

ewwink 6 年前

.find(tagName, attributes) 返回单个元素

.find_all(tagName, attributes) 返回多个元素(列表)

更多你可以在 doc

要替换,需要创建元素 .new_tag(tagName) 删除属性 del element.attrs[attributeName] 如下图所示

from bs4 import BeautifulSoup
import requests

html = '''
<div id="title" class="testTitle">
  heading h1
</div>
'''
soup = BeautifulSoup(html)

print 'html before'
print soup

div = soup.find('div', id="title")

#delete class attribute
del div.attrs['class']

print 'html after remove attibute'
print soup

# to replace, create h1 element
h1 = soup.new_tag("h1")
# set text from previous element
h1.string = div.text
# uncomment to set ID
# h1['id'] = div['id']
div.replace_with(h1)

print 'html after replace'
print soup

推荐文章

yash agarwal · Python Selenium-如何基于span标记内的文本提取元素?

3 年前

Amar · 漂亮汤错误:“NoneType”对象没有属性“find\u all”

3 年前

ihonestlydontKnow · Python(BeautifulSoup)仅1个结果

3 年前

ARH · 如何使用Selenium识别网站中使用的所有标签

3 年前

Kevin Rodgers Jr. · Python BeautifulSoup:在in select语句中排除其他标记

3 年前

Jensen Holm · 在非常大的字符串中查找链接时遇到问题

3 年前

koshiboto · 使用python(bs4)从段落中获取第一个不位于括号之间的常规链接

3 年前

LaddieMawery · Beautifulsoup获取嵌套跨元素时遇到问题

3 年前

Ventorro · Python和Web抓取的新手。抓取一个HTML表格——但是它并没有显示所有的列

3 年前

aphexlog · 正在尝试使用BeautifulSoup将新行附加到表体中的第一行

3 年前