代码之家  ›  专栏  ›  技术社区  ›  Minial

BeautifulSoup循环没有遍历其他节点

  •  1
  • Minial  · 技术社区  · 6 年前

    关于这一点,有很多类似的情况;但我一直在与其他人进行比较。 Getting from Clustered Nodes 但不知怎么的,我不确定为什么 for loop 不是从其他元素迭代和获取文本,而是只从节点的第一个元素。

    from requests import get
    from bs4 import BeautifulSoup
    
    url = 'https://shopee.com.my/'
    l = []
    
    headers = {'User-Agent': 'Googlebot/2.1 (+http://www.google.com/bot.html)'}
    
    response = get(url, headers=headers)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    
    
    def findDiv():
         try:
            for container in html_soup.find_all('div', {'class': 'section-trending-search-list'}):
                topic = container.select_one(
                    'div._1waRmo')
                if topic:
                    print(1)
                    d = {
                        'Titles': topic.text.replace("\n", "")}
                    print(2)
                    l.append(d)
            return d
        except:
            d = None
    
    findDiv()
    print(l)
    

    the html elements i'm trying to access

    2 回复  |  直到 6 年前
        1
  •  1
  •   Bitto    6 年前
    from requests import get
    from bs4 import BeautifulSoup
    
    url = 'https://shopee.com.my/'
    l = []
    
    headers = {'User-Agent': 'Googlebot/2.1 (+http://www.google.com/bot.html)'}
    
    response = get(url, headers=headers)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    
    
    def findDiv():
         try:
            for container in html_soup.find_all('div', {'class': '_25qBG5'}):
                topic = container.select_one('div._1waRmo')
                if topic:
                    d = {'Titles': topic.text.replace("\n", "")}
                    l.append(d)
            return d
         except:
            d = None
    
    findDiv()
    print(l)
    

    输出:

    [{'Titles': 'school backpack'}, {'Titles': 'oppo case'}, {'Titles': 'baby chair'}, {'Titles': 'car holder'}, {'Titles': 'sling beg'}]
    

    我再次建议你使用 selenium . 如果您再次运行此命令,您将看到列表中有一组不同的5个字典。每次您提出请求时,他们都会提供5个随机趋势项。但是他们有一个“改变”按钮。如果你使用硒元素,你可能只需点击它,就可以一直丢弃所有的趋势项目。

        2
  •  1
  •   user5292841    6 年前

    试试这个: TopLevel找到了选项的根,然后我们找到了它下面的所有div。 我希望这是你想要的。

    from requests import get
    from bs4 import BeautifulSoup
    
    url = 'https://shopee.com.my/'
    l = []
    
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}
    
    response = get(url, headers=headers)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    
    
    def findDiv():
        try:
            toplevel = html_soup.find('._25qBG5')
            for container in toplevel.find_all('div'):
                topic = container.select_one('._1waRmo')
                if topic:
                    print(1)
                    d = {'Titles': topic.text.replace("\n", "")}
                    print(2)
                    l.append(d)
                    return d
        except:
            d = None
    
    findDiv()
    print(l)
    

    这可以用本地文件枚举。当我尝试使用给定的URL时,网站没有返回您显示的HTML。

    from requests import get
    from bs4 import BeautifulSoup
    
    url = 'path_in_here\\test.html'
    l = []
    
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}
    
    example = open(url,"r")
    text = example.read()
    
    #response = get(url, headers=headers)
    #html_soup = BeautifulSoup(response.text, 'html.parser')
    html_soup = BeautifulSoup(text, 'html.parser')
    
    print (text)
    
    def findDiv():
        #try:
            print("finding toplevel")
            toplevel = html_soup.find("div", { "class":  "_25qBG5"} )
            print ("found toplevel")
            divs = toplevel.findChildren("div", recursive=True)
            print("found divs")
    
            for container in divs:
                print ("loop")
                topic = container.select_one('.1waRmo')
                if topic:
                    print(1)
                    d = {'Titles': topic.text.replace("\n", "")}
                    print(2)
                    l.append(d)
                    return d
        #except:
        #    d = None
        #    print ("error")
    
    findDiv()
    print(l)