代码之家 › 专栏 › 技术社区 › Alexander Engelhardt

如何提取特定标题后面的HTML表?

html-parsing beautifulsoup python-3.x python

Alexander Engelhardt · 技术社区 · 6 年前

我正在使用BeautifulSoup解析HTML文件。我有一个类似的HTML文件:

<h3>Unimportant heading</h3>
<table class="foo">
  <tr>
    <td>Key A</td>
  </tr>
  <tr>
    <td>A value I don't want</td>
  </tr>
</table>


<h3>Unimportant heading</h3>
<table class="foo">
  <tr>
    <td>Key B</td>
  </tr>
  <tr>
    <td>A value I don't want</td>
  </tr>
</table>


<h3>THE GOOD STUFF</h3>
<table class="foo">
  <tr>
    <td>Key C</td>
  </tr>
  <tr>
    <td>I WANT THIS STRING</td>
  </tr>
</table>


<h3>Unimportant heading</h3>
<table class="foo">
  <tr>
    <td>Key A</td>
  </tr>
  <tr>
    <td>A value I don't want</td>
  </tr>
</table>

我要提取字符串“我要这个字符串”。最好的解决办法是 H3标题后面的第一张表叫做“好东西” . 我不知道如何用漂亮的汤来做这个-我只知道如何用一个特定的类或一个表来提取一个表。 嵌套内 一些特别的标签,但不是 下列的 一个特殊的标签。

我认为回退解决方案可以使用字符串“key c”,假设它是唯一的(几乎可以肯定是唯一的),并且只出现在一个表中,但是我会觉得使用特定的h3标题更好。

3 回复 | 直到 6 年前

PythonSherpa 6 年前

遵循@zroq的逻辑 answer 在另一个问题上,此代码将为您提供定义的头后面的表(“好东西”)。请注意,我只是把所有的HTML放在一个名为“HTML”的变量中。

from bs4 import BeautifulSoup, NavigableString, Tag

soup=BeautifulSoup(html, "lxml")

for header in soup.find_all('h3', text=re.compile('THE GOOD STUFF')):
    nextNode = header
    while True:
        nextNode = nextNode.nextSibling
        if nextNode is None:
            break
        if isinstance(nextNode, Tag):
            if nextNode.name == "h3":
                break
            print(nextNode)

输出:

<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>

干杯!

J_H 6 年前

这个 docs 如果你不想使用 find_all ,您可以这样做:

for sibling in soup.a.next_siblings:
    print(repr(sibling))

Leo_28 6 年前

我相信有很多方法可以更有效地做到这一点,但我现在可以考虑的是:

from bs4 import BeautifulSoup
import os
os.chdir('/Users/Downloads/')
html_data = open("/Users/Downloads/train.html",'r').read()
soup = BeautifulSoup(html_data, 'html.parser')
all_td = soup.find_all("td")
flag = 'no_print'
for td in all_td:
    if flag == 'print':
        print(td.text)
        break
    if td.text == 'Key C':
        flag = 'print'

输出:

I WANT THIS STRING

推荐文章

user3127554 · Powershell HTML未格式化

7 年前

user1922364 · 从一个页面获取所有链接

7 年前

GonzaloXavier · 提取R中<option>标记的内容

7 年前

Deepa MG · 如何将参数发送到另一个PHP网站的AJAX POST方法并获取JSON信息

7 年前

Anurag Sharma · 从自由流动的文本中删除html标记以形成独立的句子

7 年前

Shafizadeh · 为什么查询与DOM不匹配?

7 年前

Yannis Dran · 提取存储在磁盘上的html文件的url和名称,并分别打印它们-Python

8 年前

Athapali · 如何使用jquery获取变量中元素的文本?

8 年前

Mona G · html中响应头的jmeter正则表达式提取器

9 年前

Paul · Jsoup-从元素中提取html

9 年前