代码之家 › 专栏 › 技术社区 › Minial

BeautifulSoup循环没有遍历其他节点

beautifulsoup python

Minial · 技术社区 · 6 年前

关于这一点,有很多类似的情况;但我一直在与其他人进行比较。 Getting from Clustered Nodes 但不知怎么的,我不确定为什么 for loop 不是从其他元素迭代和获取文本,而是只从节点的第一个元素。

from requests import get
from bs4 import BeautifulSoup

url = 'https://shopee.com.my/'
l = []

headers = {'User-Agent': 'Googlebot/2.1 (+http://www.google.com/bot.html)'}

response = get(url, headers=headers)
html_soup = BeautifulSoup(response.text, 'html.parser')


def findDiv():
     try:
        for container in html_soup.find_all('div', {'class': 'section-trending-search-list'}):
            topic = container.select_one(
                'div._1waRmo')
            if topic:
                print(1)
                d = {
                    'Titles': topic.text.replace("\n", "")}
                print(2)
                l.append(d)
        return d
    except:
        d = None

findDiv()
print(l)

2 回复 | 直到 6 年前

Bitto 6 年前

from requests import get
from bs4 import BeautifulSoup

url = 'https://shopee.com.my/'
l = []

headers = {'User-Agent': 'Googlebot/2.1 (+http://www.google.com/bot.html)'}

response = get(url, headers=headers)
html_soup = BeautifulSoup(response.text, 'html.parser')


def findDiv():
     try:
        for container in html_soup.find_all('div', {'class': '_25qBG5'}):
            topic = container.select_one('div._1waRmo')
            if topic:
                d = {'Titles': topic.text.replace("\n", "")}
                l.append(d)
        return d
     except:
        d = None

findDiv()
print(l)

输出:

[{'Titles': 'school backpack'}, {'Titles': 'oppo case'}, {'Titles': 'baby chair'}, {'Titles': 'car holder'}, {'Titles': 'sling beg'}]

我再次建议你使用 selenium . 如果您再次运行此命令,您将看到列表中有一组不同的5个字典。每次您提出请求时,他们都会提供5个随机趋势项。但是他们有一个“改变”按钮。如果你使用硒元素,你可能只需点击它,就可以一直丢弃所有的趋势项目。

user5292841 6 年前

试试这个: TopLevel找到了选项的根,然后我们找到了它下面的所有div。我希望这是你想要的。

from requests import get
from bs4 import BeautifulSoup

url = 'https://shopee.com.my/'
l = []

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}

response = get(url, headers=headers)
html_soup = BeautifulSoup(response.text, 'html.parser')


def findDiv():
    try:
        toplevel = html_soup.find('._25qBG5')
        for container in toplevel.find_all('div'):
            topic = container.select_one('._1waRmo')
            if topic:
                print(1)
                d = {'Titles': topic.text.replace("\n", "")}
                print(2)
                l.append(d)
                return d
    except:
        d = None

findDiv()
print(l)

这可以用本地文件枚举。当我尝试使用给定的URL时,网站没有返回您显示的HTML。

from requests import get
from bs4 import BeautifulSoup

url = 'path_in_here\\test.html'
l = []

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}

example = open(url,"r")
text = example.read()

#response = get(url, headers=headers)
#html_soup = BeautifulSoup(response.text, 'html.parser')
html_soup = BeautifulSoup(text, 'html.parser')

print (text)

def findDiv():
    #try:
        print("finding toplevel")
        toplevel = html_soup.find("div", { "class":  "_25qBG5"} )
        print ("found toplevel")
        divs = toplevel.findChildren("div", recursive=True)
        print("found divs")

        for container in divs:
            print ("loop")
            topic = container.select_one('.1waRmo')
            if topic:
                print(1)
                d = {'Titles': topic.text.replace("\n", "")}
                print(2)
                l.append(d)
                return d
    #except:
    #    d = None
    #    print ("error")

findDiv()
print(l)