代码之家 › 专栏 › 技术社区 › chitown88

BeautifulSoup-仅返回第一个表

beautifulsoup web-scraping python

chitown88 · 技术社区 · 7 年前

我最近一直在和BeautifulSoup合作。我正试图从 https://www.pro-football-reference.com/teams/mia/2000_roster.htm 地点特别是我想要的是球员的名字和“gs”(游戏开始)。

然而,在执行此操作时,它只返回第一个(“启动器”)表数据。实际上,我对排名第一的那张桌子一点也不感兴趣,我想要第二张桌子,标题是“花名册”。

这是我正在做的代码。正如我所说,除了球员姓名和开始的比赛之外,我真的不想/需要任何东西,但我只是在练习和学习BeautifulSoup。

import pandas as pd
import requests
import bs4

alpha  = requests.get('https://www.pro-football-
reference.com/teams/mia/2000_roster.htm')

beta = bs4.BeautifulSoup(alpha.text,'lxml')


gama = beta.findAll('th',{'data-stat':'pos'})
position = [th.text for th in gama]
position = position[1:]
position = list(filter(None, position))

gama = beta.findAll('td',{'data-stat':'player'})
player = [td.text for td in gama]
player = player[1:]
while 'Defensive Starters' in player: player.remove('Defensive Starters')
while 'Special Teams Starters' in player: player.remove('Special Teams 
Starters')

gama = beta.findAll('td',{'data-stat':'age'})
age = [td.text for td in gama]
age = list(filter(None, age))

gama = beta.findAll('td',{'data-stat':'gs'})
gs = [td.text for td in gama]
gs = list(filter(None, gs))

target = pd.DataFrame(

{
'player_name':player,
'position':position,
'gs':gs,
'age':age
})

有人知道我哪里出错了吗?或者是另一种方式?

1 回复 | 直到 7 年前

SIM 7 年前

要从该表中获取内容,您需要使用任何浏览器模拟器,因为该部分的响应是动态生成的。不过,不需要任何浏览器模拟器,就可以轻松访问第一个表中的数据。我在这种情况下尝试了硒:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
page_url = "https://www.pro-football-reference.com/teams/mia/2000_roster.htm"
driver.get(page_url)
soup = BeautifulSoup(driver.page_source, "lxml")
table = soup.select(".table_outer_container")[1]
for items in table.select("tr"):
    player = items.select("[data-stat='player']")[0].text
    gs = items.select("[data-stat='gs']")[0].text
    print(player,gs)

driver.quit()

PlayerÂ  GS
Trace Armstrong* 0
John Bock 1
Tim Bowens 15
Lorenzo Bromell 0
Autry Denson 0
Mark Dixon 15
Kevin Donnalley 16

出于某种原因,如果您遇到此类错误,那么这次也不会有此类错误选项:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
page_url = "https://www.pro-football-reference.com/teams/mia/2000_roster.htm"
driver.get(page_url)
soup = BeautifulSoup(driver.page_source, "lxml")
table = soup.select(".table_outer_container")[1]
for items in table.select("tr"):
    player = items.select("[data-stat='player']")[0].text if items.select("[data-stat='player']") else ""
    gs = items.select("[data-stat='gs']")[0].text if items.select("[data-stat='gs']") else ""
    print(player,gs)

driver.quit()

推荐文章

yash agarwal · Python Selenium-如何基于span标记内的文本提取元素?

2 年前

Amar · 漂亮汤错误:“NoneType”对象没有属性“find\u all”

2 年前

ihonestlydontKnow · Python(BeautifulSoup)仅1个结果

2 年前

ARH · 如何使用Selenium识别网站中使用的所有标签

2 年前

Kevin Rodgers Jr. · Python BeautifulSoup:在in select语句中排除其他标记

2 年前

Jensen Holm · 在非常大的字符串中查找链接时遇到问题

2 年前

koshiboto · 使用python(bs4)从段落中获取第一个不位于括号之间的常规链接

2 年前

LaddieMawery · Beautifulsoup获取嵌套跨元素时遇到问题

2 年前

Ventorro · Python和Web抓取的新手。抓取一个HTML表格——但是它并没有显示所有的列

2 年前

aphexlog · 正在尝试使用BeautifulSoup将新行附加到表体中的第一行

2 年前