代码之家 › 专栏 › 技术社区 › rofls

用beautifulsoup或golang colly解析html时出现问题

beautifulsoup web-scraping go python

rofls · 技术社区 · 6 年前

ftr我已经在这两个框架中成功地编写了很多scraper,但是我很困惑。这是我试图搜集的数据的截图(您也可以转到get请求中的实际链接):

我试图瞄准 div.section_content 以下内容:

import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml").text
soup = BeautifulSoup(html)
soup.findAll("div", {"class": "section_content"})

打印最后一行显示一些其他div,但不显示具有俯仰数据的div。

但是,我可以在文本中看到它,因此它不是javascript触发的加载问题(短语“pitching”只出现在该表中):

>>> "Pitching" in soup.text
True

以下是Golang一次尝试的缩略版本:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
) 

func main() {
    c := colly.NewCollector(
            colly.AllowedDomains("www.baseball-reference.com"),
    )   
    c.OnHTML("div.table_wrapper", func(e *colly.HTMLElement) {
            fmt.Println(e.ChildText("div.section_content"))
    })  
    c.Visit("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml")

} }

1 回复 | 直到 6 年前

damd 6 年前

在我看来,HTML实际上被注释掉了,所以这就是为什么BeautifulGroup找不到它。在分析HTML字符串之前,请将注释标记从中移除,或者使用beautifulsoup to extract the comments 并分析返回值。

例如:

for element in soup(text=lambda text: isinstance(text, Comment)):
    comment = element.extract()
    comment_soup = BeautifulSoup(comment)
    # work with comment_soup

推荐文章

ginad · 如何在go-app组件中执行javascript代码

4 月前

Jason Fingar · 方法在另一个方法的上下文中不可访问

4 月前

Flissi Hamed · 从抓取aliexpress到使用chromedp的无头浏览器

4 月前

Adam Presley · GORM错误:“运算符不存在:bigint=text”

5 月前

Jess The Witch · GCP云功能中处理延迟任务的模式

5 月前

Moonlit · 在Golang中将`func()Foo`转换为`func)any`?

5 月前

pmoubed · 这是使用计时器的goroutine泄漏吗?

5 月前

Kurt Peek · 在Go模板中,如何仅在字符串值非空的情况下添加一行?

5 月前

Harimbola Santatra · 如何获取JSON中包含unicode代码点的键的值?

6 月前

techStud · 为什么Go在变量超出作用域后立即取消引用悬挂指针时不会抛出错误?

6 月前