代码之家  ›  专栏  ›  技术社区  ›  LetzerWille

Beautiful Soup输出中的正则表达式

  •  1
  • LetzerWille  · 技术社区  · 6 年前

    我正在尝试从html页面获取行,由BS处理,包含
    “十亿”一词。但我得到的是空名单。。。。。顺便说一句,这些线介于
    <li> 标签,我尝试使用 soup.findAll("<li>", {"class": "tabcontent"})

    但它也给了我一个空列表。

    import requests
    from bs4 import BeautifulSoup
    import re
    
    url = 'http://www.worldstopexports.com/united-states-top-10-exports/'
    
    header = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"
    }
    
    page = requests.get (url, headers=header)
    
    soup = BeautifulSoup (page.text, 'lxml')
    
    table = soup.find_all (class_='tabcontent')[0].text
    
    print(re.findall(r'^.*? billion', table))
    
    print(table)
    
    
    
    Machinery including computers: US$201.7 billion (13% of total exports)
    Electrical machinery, equipment: $174.2 billion (11.3%)
    Mineral fuels including oil: $138 billion (8.9%)
    Aircraft, spacecraft: $131.2 billion (8.5%)
    Vehicles: $130.1 billion (8.4%)
    Optical, technical, medical apparatus: $83.6 billion (5.4%)
    Plastics, plastic articles: $61.5 billion (4%)
    Gems, precious metals: $60.4 billion (3.9%)
    Pharmaceuticals: $45.1 billion (2.9%)
    Organic chemicals: $36.2 billion (2.3%)
    
    3 回复  |  直到 6 年前
        1
  •  3
  •   Jan    6 年前

    你可以使用 select() 首先获取选项卡,然后 li 儿童与文本:

    # ... right under soup = BeautifulSoup (page.text, 'lxml') ...
    # select the first tab
    tab = soup.select('div.tabcontent')[0]
    
    # select its items
    items = [text 
        for item in tab.select('li') 
        for text in [item.text] 
        if "billion" in text]
    print(items)
    

    这将产生

    ['Machinery including computers: US$201.7 billion (13% of total exports)', 'Electrical machinery, equipment: $174.2 billion (11.3%)', 'Mineral fuels including oil: $138 billion (8.9%)', 'Aircraft, spacecraft: $131.2 billion (8.5%)', 'Vehicles: $130.1 billion (8.4%)', 'Optical, technical, medical apparatus: $83.6 billion (5.4%)', 'Plastics, plastic articles: $61.5 billion (4%)', 'Gems, precious metals: $60.4 billion (3.9%)', 'Pharmaceuticals: $45.1 billion (2.9%)', 'Organic chemicals: $36.2 billion (2.3%)']
    
        2
  •  2
  •   Martijn Pieters    6 年前

    您的错误在于使用 .* ;点运算符通常不匹配换行符 table 字符串包含开头和单词之间的换行符 十亿 . 如果要使用正则表达式,那么至少要使用 re.MULTILINE 标志以使 ^ 换行符后匹配:

    >>> re.findall(r'^.*billion', table, flags=re.MULTILINE)
    ['Machinery including computers: US$201.7 billion',
     'Electrical machinery, equipment: $174.2 billion',
     'Mineral fuels including oil: $138 billion',
     'Aircraft, spacecraft: $131.2 billion',
     'Vehicles: $130.1 billion',
     'Optical, technical, medical apparatus: $83.6 billion',
     'Plastics, plastic articles: $61.5 billion',
     'Gems, precious metals: $60.4 billion',
     'Pharmaceuticals: $45.1 billion',
     'Organic chemicals: $36.2 billion']
    

    但是,由于要在中查找文本 li 元素,为什么不选择这些元素?

    soup.find(class_='tabcontent').find_all('li', string=re.compile(r'billion'))
    

    将正则表达式模式传递给 string 用于筛选元素的内容。这将为您提供匹配的元素:

    >>> soup.find(class_='tabcontent').find_all('li', string=re.compile(r'billion'))
    [<li>Machinery including computers: US$201.7 billion (13% of total exports)</li>,
     <li>Electrical machinery, equipment: $174.2 billion (11.3%)</li>,
     <li>Mineral fuels including oil: $138 billion (8.9%)</li>,
     <li>Aircraft, spacecraft: $131.2 billion (8.5%)</li>,
     <li>Vehicles: $130.1 billion (8.4%)</li>,
     <li>Optical, technical, medical apparatus: $83.6 billion (5.4%)</li>,
     <li>Plastics, plastic articles: $61.5 billion (4%)</li>,
     <li>Gems, precious metals: $60.4 billion (3.9%)</li>,
     <li>Pharmaceuticals: $45.1 billion (2.9%)</li>,
     <li>Organic chemicals: $36.2 billion (2.3%)</li>]
    

    你可以随时申请 .get_text() 如果你只想要这些元素的内容。

        3
  •  1
  •   SIM    6 年前

    另一种方法如下:

    import requests
    from bs4 import BeautifulSoup
    
    URL = 'http://www.worldstopexports.com/united-states-top-10-exports/'
    soup = BeautifulSoup(requests.get(URL,headers={"User-Agent":"Mozilla/5.0"}).text, 'lxml')
    table = soup.find(class_='tabcontent')
    data =  '\n'.join([item.text for item in table.find_all("li")])
    print(data)
    

    输出:

    Machinery including computers: US$201.7 billion (13% of total exports)
    Electrical machinery, equipment: $174.2 billion (11.3%)
    Mineral fuels including oil: $138 billion (8.9%)
    Aircraft, spacecraft: $131.2 billion (8.5%)
    Vehicles: $130.1 billion (8.4%)
    Optical, technical, medical apparatus: $83.6 billion (5.4%)
    Plastics, plastic articles: $61.5 billion (4%)
    Gems, precious metals: $60.4 billion (3.9%)
    Pharmaceuticals: $45.1 billion (2.9%)
    Organic chemicals: $36.2 billion (2.3%)