代码之家 › 专栏 › 技术社区 › limitless

无法使用beautifulsou找到特定的类组件

beautifulsoup web-scraping python

limitless · 技术社区 · 6 年前

我想为一个电影网站做一个刮刀来收集电影名字的列表。我尝试使用BeautifulSoup来解析HTML文件,我看到每个电影都在一个名为 "movie-row" select 此类上的方法没有检索站点的相应数据。我能得到的最接近HTML的组件是的父类 .quickbook-section .

这就是我写的代码。

def get_movies_names():
    url = "https://www.yesplanet.co.il/#/buy-tickets-by-cinema?in-cinema=1025&at=2018-11-09&view-mode=list"
    raw_html = util.simple_get(url)
    bs = BeautifulSoup(raw_html, 'html.parser')
    bs.select(".movie-row")

simple_get 只是返回HTML响应内容的函数)

2 回复 | 直到 6 年前

Jamie Scott 6 年前

似乎某个网站正在使用JavaScript呈现其电影数据。

BeautifulSoup不是浏览器,因此没有DOM,因此无法运行JavaScript代码。它所做的只是获取页面内容并对其进行解析。如果您查看相关页面的源代码并查看源代码(在大多数浏览器中,右键单击“查看源代码”)并搜索 .movie-row 你会发现没有火柴。

在这种情况下,您必须找到一种替代方法来清除数据,尝试研究它使用的JavaScript代码的功能并从中着手。或者,您可能想看看如何使用Selenium和PhantomJS。

Cohan 6 年前

正如一些人所指出的,它是通过javascript加载的,而BS4实际上并不适用于此。当您看到通过javascript加载的数据时,可以肯定在某个地方有一个API调用。您可以查看它是否在调用JSON对象,以及是否可以不使用任何apikey访问JSON对象,而不是尝试刮取数据。

如果您需要一些不同的东西,您可能需要调整一些URL模式。

import requests, json
# Ignore the insecure warning
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

url = "https://www.yesplanet.co.il/il/data-api-service/v1/poster/10100/by-showing-type/SHOWING?lang=he_IL&ordering=desc"

# Get the page
response = requests.get(url, verify=False)

# Load into json
j = json.loads(response.text)

# process what you want
for poster in j['body']['posters']:
    print(poster['url'], poster['featureTitle'])

脚本的输出如下:

/films/bohemian-rhapsody ×¨×¤×¡×××× ××××××ª
/films/the-other-story ×¡××¤××¨ ×××¨
/films/the-girl-in-the-spiders-web ×× ×¢×¨× ××¨×©×ª ××¢××××©
/films/the-nutcracker-and-the-four-realms ××¤×¦× ×××××××  ×××¨××¢ ×××××××ª
/films/911 11 ××¡×¤××××¨
/films/virgins ××× ××ª××××ª ××§×¨×××ª

attributes , code , dateStarted featureTitle , mediaList , posterSrc , url weight

如果你想知道我是如何发现这个URL的,我使用了chrome开发者控制台重新加载了这个页面。在XHR(XMLHttpRequest)上过滤,您将看到包含数据的url。

推荐文章

yash agarwal · Python Selenium-如何基于span标记内的文本提取元素?

2 年前

Amar · 漂亮汤错误:“NoneType”对象没有属性“find\u all”

2 年前

ihonestlydontKnow · Python(BeautifulSoup)仅1个结果

2 年前

ARH · 如何使用Selenium识别网站中使用的所有标签

2 年前

Kevin Rodgers Jr. · Python BeautifulSoup:在in select语句中排除其他标记

2 年前

Jensen Holm · 在非常大的字符串中查找链接时遇到问题

2 年前

koshiboto · 使用python(bs4)从段落中获取第一个不位于括号之间的常规链接

2 年前

LaddieMawery · Beautifulsoup获取嵌套跨元素时遇到问题

2 年前

Ventorro · Python和Web抓取的新手。抓取一个HTML表格——但是它并没有显示所有的列

2 年前

aphexlog · 正在尝试使用BeautifulSoup将新行附加到表体中的第一行

2 年前