代码之家  ›  专栏  ›  技术社区  ›  addohm

用漂亮的soup4缺失的细胞刮桌子

  •  1
  • addohm  · 技术社区  · 5 年前

    我在BS4上遇到了一些奇怪的行为。我已经复制了一个站点的20页,这段代码在我的私人Web服务器上运行得很好。当我在真实网站上使用它时,它会随机错过一行的第8列。我以前没有经历过这种情况,而且我似乎找不到任何其他关于这个问题的帖子。第8栏是“频率等级”。这只发生在最后一列上是怎么回事?如何修复它?

    import requests
    import json
    from bs4 import BeautifulSoup
    
    base_url = 'http://hanzidb.org'
    
    
    def soup_the_page(page_number):
        url = base_url + '/character-list/by-frequency?page=' + str(page_number)    
        response = requests.get(url, timeout=5)
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup
    
    
    def get_max_page(soup):
        paging = soup.find_all("p", {'class': 'rigi'})
        # Isolate the first paging link
        paging_link = paging[0].find_all('a')
        # Extract the last page number of the series
        max_page_num = int([item.get('href').split('=')[-1] for item in paging_link][-1])
        return max_page_num
    
    
    def crawl_hanzidb():
        result = {}
    
        # Get the page scrape data
        page_content = soup_the_page(1)
        # Get the page number of the last page
        last_page = get_max_page(page_content)
        # Get the table data
        for p in range(1, last_page + 1):
            page_content = soup_the_page(p)
            for trow in page_content.find_all('tr')[1:]:
                char_dict = {}
                i = 0
                # Set the character as the dict key
                character = trow.contents[0].text
                # Initialize list on dict key
                result[character] = []
                # Return list of strings from trow.children to parse urls
                for tcell in trow.children:
                    char_position = 0
                    radical_position = 3
                    if i == char_position or i == radical_position:
                        for content in tcell.children:
                            if type(content).__name__ == 'Tag':
                                if 'href' in content.attrs:
                                    url = base_url + content.attrs.get('href')
                                    if i == char_position:
                                        char_dict['char_url'] = url
                                    if i == radical_position:
                                        char_dict['radical_url'] = url
                    i += 1
                char_dict['radical'] = trow.contents[3].text[:1]
                char_dict['pinyin'] = trow.contents[1].text
                char_dict['definition'] = trow.contents[2].text
                char_dict['hsk_level'] = trow.contents[5].text[:1] if trow.contents[5].text[:1].isdigit() else ''
                char_dict['frequency_rank'] = trow.contents[7].text if trow.contents[7].text.isdigit() else ''
                result[character].append(char_dict)
            print('Progress: ' + str(p) + '%.')
        return(result)
    
    
    crawl_data = crawl_hanzidb()
    with open('hanzidb.json', 'w') as f:
        json.dump(crawl_data, f, indent=2, ensure_ascii=False)
    
    1 回复  |  直到 5 年前
        1
  •  1
  •   JoshG    5 年前

    问题似乎是网站的HTML格式不正确。如果你看到你发布的网站的来源,有两个关闭 </td> “频率等级”列之前的标记。例子:

    <tr>
        <td><a href="/character/的">的</a></td>
        <td>de</td><td><span class="smmr">possessive, adjectival suffix</span></td>
        <td><a href="/character/白" title="Kangxi radical 106">白</a>&nbsp;106.3</td>
        <td>8</td><td>1</td>
        <td>1155</td></td>
        <td>1</td>
     </tr>
    

    我认为这会导致您使用的解析器出现问题( html.parser )如果安装 lxml 解析器,似乎可以工作。

    试试这个:

    弗斯特 安装 LXML 解析器…

    pip install lxml
    

    然后 ,更改 soup_the_page() 方法:

    soup = BeautifulSoup(response.content, 'lxml')
    

    然后运行脚本。它似乎起作用了。 print(trow.contents[7].text) 不再给出超出范围的索引错误。