代码之家 › 专栏 › 技术社区 › Han Zhengzu

使用python获取html内容中的内容

web-crawler beautifulsoup python-2.7 python

Han Zhengzu · 技术社区 · 6 年前

中国网站 here 主要描述一家公司的信息。由于有许多页面包含类似的内容,我决定学习python中的数据爬虫。

基本代码

import requests
from bs4 import BeautifulSoup
page = requests.get('http://182.148.109.184/enterprise- 
info!getCompanyInfo.action?companyid=1000356')

soup = BeautifulSoup(page.text, 'html.parser')
source_content = soup.find(class_='rightSide').find(class_='content register').find(class_='formestyle')

我想收集的信息

这个图是在chrome的inspect元素页面中捕获的。

也许中国人在这里不友好,我在这里创造了一个更好的例证。

<th> the variable name </th> => For example, "company name", "company location"
<td> the target data I want to save </td>

我的问题

根据我的基本代码, source_content 里面没有任何信息。输出文件如下所示:

对比图1,2,我们可以看到经度,纬度的信息已经消失了。

如何用python获取这些数据?任何建议都将不胜感激

1 回复 | 直到 6 年前

Martin Evans 6 年前

如果您提供 Referer 请求中的标题如下:

import requests
from bs4 import BeautifulSoup

url = 'http://182.148.109.184/enterprise-info!getCompanyInfo.action?companyid=1000356'
page = requests.get(url, headers={'Referer' : url})
soup = BeautifulSoup(page.text, 'html.parser')

table = soup.find(class_='formestyle')

for tr in table.find_all('tr'):
    row = [v.text for v in tr.find_all(['th', 'td'])]
    print(row)

这将显示以下类型的数据:

['å°çåæ ï¼', 'ç»åº¦ï¼104.2153 \xa0\xa0çº¬åº¦ï¼31.3631']

如您所见,信息现在已存在。

推荐文章

July · 如何定义数字间隔,然后四舍五入

1 年前

Community wiki · 对象名称前的单下划线和双下划线的含义是什么?

1 年前

Brian Johnson · 为什么在Python中列出字典列表会引发TypeError?[已关闭]

1 年前

user026 · 如何根据特定窗口的平均值(行数)创建新列?

1 年前

Ashok Shrestha · 需要追踪特定的颜色线并获取坐标

1 年前

Nicote Ool · 在FastApi和Vue3中获得422

1 年前

NeoExceptCodeBad · 如果我有很多垂直线,我如何找到它们的边缘?

1 年前

Abdulaziz · 如何对集合内的列表进行排序[重复]

1 年前

user2743931 · 带有src目录的Python setup.py

1 年前

asmgx · 为什么合并数据帧不能按照python中的预期方式工作

1 年前