代码之家 › 专栏 › 技术社区 › Justin Hill

如何使用python和BeautifulSoup从HTML返回没有标记的文本?

urllib beautifulsoup python

Justin Hill · 技术社区 · 7 年前

我一直在尝试从网站返回文本。我试图从下面的示例中返回ownerId和unitId。非常感谢您的帮助。

<script>
    h1.config.days = "7";
    h1.config.hours = "24";
    h1.config.color = "blue";
    h1.config.ownerId = 7321;
    h1.config.locationId = 1258;
    h1.config.unitId = "164";
</script>

1 回复 | 直到 7 年前

coder 7 年前

你可以用 Beautiful Soup 像这样:

#!/usr/bin/env python

from bs4 import BeautifulSoup

html = '''
<script>
    h1.config.days = "7";
    h1.config.hours = "24";
    h1.config.color = "blue";
    h1.config.ownerId = 7321;
    h1.config.locationId = 1258;
    h1.config.unitId = "164";
</script>
'''

soup = BeautifulSoup(html, "html.parser")
jsinfo = soup.find("script")

d = {}
for line in jsinfo.text.split('\n'):
    try:
        d[line.split('=')[0].strip().replace('h1.config.','')] = line.split('=')[1].lstrip().rstrip(';')
    except IndexError:
        pass

print 'OwnerId:  {}'.format(d['ownerId'])
print 'UnitId:   {}'.format(d['unitId'])

OwnerId:  7321
UnitId:   "164"

d['variable']

使现代化

现在,如果你必须处理多个 <script> 标记,要遍历它们,您可以执行以下操作:

jsinfo = soup.find_all("script")

jsinfo 是类型 <class 'bs4.element.ResultSet'> 你可以像普通人一样迭代列表

lon公司

#!/usr/bin/env python

from bs4 import BeautifulSoup
import requests

url = 'https://www.your_url'
# the user-agent you specified in the comments
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'}

html = requests.get(url, headers=headers).text
soup = BeautifulSoup(html, "html.parser")
jsinfo = soup.find_all("script")

list_of_interest = ['hl.config.lat', 'hl.config.lon']

d = {}
for line in jsinfo[9].text.split('\n'):
    if any(word in line for word in list_of_interest):
        k,v = line.strip().replace('hl.config.','').split(' = ')
        d[k] = v.strip(';')

print 'Lat => {}'.format(d['lat'])
print 'Lon => {}'.format(d['lon'])

Lat => "28.06794"
Lon => "-81.754349"

通过在 list_of_interest

推荐文章

July · 如何定义数字间隔,然后四舍五入

1 年前

Community wiki · 对象名称前的单下划线和双下划线的含义是什么?

1 年前

Brian Johnson · 为什么在Python中列出字典列表会引发TypeError?[已关闭]

1 年前

user026 · 如何根据特定窗口的平均值(行数)创建新列?

1 年前

Ashok Shrestha · 需要追踪特定的颜色线并获取坐标

1 年前

Nicote Ool · 在FastApi和Vue3中获得422

1 年前

NeoExceptCodeBad · 如果我有很多垂直线,我如何找到它们的边缘?

1 年前

Abdulaziz · 如何对集合内的列表进行排序[重复]

1 年前

user2743931 · 带有src目录的Python setup.py

1 年前

asmgx · 为什么合并数据帧不能按照python中的预期方式工作

1 年前