代码之家 › 专栏 › 技术社区 › Ð Ð¾Ð¼Ð°Ð½ ÐÐ¾Ð¿ÑÐµÐ²

用beautifulsoup打破标签边界上的单词

beautifulsoup python

Ð Ð¾Ð¼Ð°Ð½ ÐÐ¾Ð¿ÑÐµÐ² · 技术社区 · 6 年前

我试图用beautifulsoup将html解析为文本,但我遇到了一个问题:有些单词被没有空格的标记分割:

<span>word1</span><span>word2</space>

因此,当我提取文本时,我有:

word1word2

有些句子还连成一个句子:

INTODUCTION There are many...

有没有一种简单的方法来强制使用beautifulsoup对标签进行分词?也可能是我可以在一些标签上固定句子间隔?

我有几个复杂的html文件。我将它们处理成如下文本:

plain_texts = [BeautifulSoup(html, "html.parser").get_text() for html in htmls]

2 回复 | 直到 6 年前

RoadRunner 6 年前

你可以用 find_all() :

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html><html lang="en"><head><title>words</title></head><body><span>word1</span><span>word2</span></body></html>
"""

soup = BeautifulSoup(html_doc, 'lxml')
for span in soup.find_all('span'):
    print(span.text)

它在 <span> 分别标记:

word1
word2

Andrej Kesely 6 年前

你可以用 replace_with() 方法( docs here )但这取决于HTML的结构:

from bs4 import BeautifulSoup

data = '''
<html><body><span>word1</span><span>word2</space>
'''

soup = BeautifulSoup(data, 'lxml')
for span in soup.select('span'):
    span.replace_with(span.text + ' ')

print(soup.text.strip())

这张照片:

word1 word2

推荐文章

July · 如何定义数字间隔,然后四舍五入

1 年前

Community wiki · 对象名称前的单下划线和双下划线的含义是什么?

1 年前

Brian Johnson · 为什么在Python中列出字典列表会引发TypeError?[已关闭]

1 年前

user026 · 如何根据特定窗口的平均值(行数)创建新列?

1 年前

Ashok Shrestha · 需要追踪特定的颜色线并获取坐标

1 年前

Nicote Ool · 在FastApi和Vue3中获得422

1 年前

NeoExceptCodeBad · 如果我有很多垂直线,我如何找到它们的边缘?

1 年前

Abdulaziz · 如何对集合内的列表进行排序[重复]

1 年前

user2743931 · 带有src目录的Python setup.py

1 年前

asmgx · 为什么合并数据帧不能按照python中的预期方式工作

1 年前