代码之家  ›  专栏  ›  技术社区  ›  Justin Hill

如何使用python和BeautifulSoup从HTML返回没有标记的文本?

  •  1
  • Justin Hill  · 技术社区  · 7 年前

    我一直在尝试从网站返回文本。我试图从下面的示例中返回ownerId和unitId。非常感谢您的帮助。

    <script>
        h1.config.days = "7";
        h1.config.hours = "24";
        h1.config.color = "blue";
        h1.config.ownerId = 7321;
        h1.config.locationId = 1258;
        h1.config.unitId = "164";
    </script>
    
    1 回复  |  直到 7 年前
        1
  •  1
  •   coder    7 年前

    你可以用 Beautiful Soup 像这样:

    #!/usr/bin/env python
    
    from bs4 import BeautifulSoup
    
    html = '''
    <script>
        h1.config.days = "7";
        h1.config.hours = "24";
        h1.config.color = "blue";
        h1.config.ownerId = 7321;
        h1.config.locationId = 1258;
        h1.config.unitId = "164";
    </script>
    '''
    
    soup = BeautifulSoup(html, "html.parser")
    jsinfo = soup.find("script")
    
    d = {}
    for line in jsinfo.text.split('\n'):
        try:
            d[line.split('=')[0].strip().replace('h1.config.','')] = line.split('=')[1].lstrip().rstrip(';')
        except IndexError:
            pass
    
    print 'OwnerId:  {}'.format(d['ownerId'])
    print 'UnitId:   {}'.format(d['unitId'])
    

    OwnerId:  7321
    UnitId:   "164"
    

    d['variable']

    使现代化

    现在,如果你必须处理多个 <script> 标记,要遍历它们,您可以执行以下操作:

    jsinfo = soup.find_all("script")
    

    jsinfo 是类型 <class 'bs4.element.ResultSet'> 你可以像普通人一样迭代 列表

    lon公司

    #!/usr/bin/env python
    
    from bs4 import BeautifulSoup
    import requests
    
    url = 'https://www.your_url'
    # the user-agent you specified in the comments
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'}
    
    html = requests.get(url, headers=headers).text
    soup = BeautifulSoup(html, "html.parser")
    jsinfo = soup.find_all("script")
    
    list_of_interest = ['hl.config.lat', 'hl.config.lon']
    
    d = {}
    for line in jsinfo[9].text.split('\n'):
        if any(word in line for word in list_of_interest):
            k,v = line.strip().replace('hl.config.','').split(' = ')
            d[k] = v.strip(';')
    
    print 'Lat => {}'.format(d['lat'])
    print 'Lon => {}'.format(d['lon'])
    

    Lat => "28.06794"
    Lon => "-81.754349"
    

    通过在 list_of_interest