代码之家  ›  专栏  ›  技术社区  ›  Kashif Jilani

用BeautifulSoup刮除<span>流

  •  1
  • Kashif Jilani  · 技术社区  · 4 年前

    我正在使用BeautifulSoup从网站上抓取数据。我似乎找不到一种方法来打印span元素之间的文本。下面是结构。

    <span class="greyText smallText">
                    avg rating 4.02 —
                    132,623 ratings  —
                    published 2014
                  </span>
    <span class="greyText smallText">
                    avg rating 4.03 —
                    82,319 ratings  —
                    published 2015
                  </span>
    

    我需要分别找到平均评级和评级。

    import requests
    from bs4 import BeautifulSoup as bs
    
    url= "https://someurl"
    page = requests.get(url) 
    soup = bs(page.content, 'html.parser')
    print(soup)
    ratings = soup.find_all('span', attrs={'class': 'greyText smallText'})
    
    1 回复  |  直到 4 年前
        1
  •  1
  •   Andrej Kesely    4 年前

    替代解决方案:您可以使用 re 提取平均评分的模块:

    import re
    from bs4 import BeautifulSoup
    
    txt = '''<span class="greyText smallText">
                    avg rating 4.02 —
                    132,623 ratings  —
                    published 2014
                  </span>
    <span class="greyText smallText">
                    avg rating 4.03 —
                    82,319 ratings  —
                    published 2015
                  </span>'''
    
    soup = BeautifulSoup(txt, 'html.parser')
    
    for span in soup.select('span.greyText.smallText'):
        avg_rating = re.search(r'avg rating ([\d.]+)', span.text)
        if avg_rating:
            print(avg_rating[1])
    

    打印:

    4.02
    4.03
    
        2
  •  0
  •   bigbounty    4 年前
    In [32]: [i.text.strip() for i in soup.find_all("span",class_="greyText smallText")]
    Out[32]:
    ['avg rating 4.02 —\n                132,623 ratings  —\n                published 2014',
     'avg rating 4.03 —\n                82,319 ratings  —\n                published 2015']
    

    评级为单独值:

    In [48]: [i.text.strip().split("\n")[0] for i in soup.find_all("span",class_="greyText smallText")]
    Out[48]: ['avg rating 4.02 —', 'avg rating 4.03 —']