代码之家  ›  专栏  ›  技术社区  ›  user9694066

Python webscraping时beautifulSoup与chrome inspect不匹配

  •  2
  • user9694066  · 技术社区  · 6 年前

    我目前正在尝试从ncbi蛋白质数据库中删除蛋白质序列。在这一点上,用户可以搜索一种蛋白质,我可以得到数据库显示的第一个结果的链接。然而,当我在美丽的汤中运行时,汤与chrome inspect元素不匹配,也没有序列。

    这是我当前的代码:

    import string
    import requests
    from bs4 import BeautifulSoup
    
    def getSequence():
        searchProt = input("Enter a Protein Name!:")
        if searchProt != '':
            searchString = "https://www.ncbi.nlm.nih.gov/protein/?term=" + searchProt
            page = requests.get(searchString)
            soup = BeautifulSoup(page.text, 'html.parser')
            soup = str(soup)
            accIndex = soup.find("a")
            accessionStart = soup.find('<dd>',accIndex)
            accessionEnd = soup.find('</dd>', accessionStart + 4)
            accession = soup[accessionStart + 4: accessionEnd]
            newSearchString = "https://www.ncbi.nlm.nih.gov/protein/" + accession
            try:
                newPage = requests.get(newSearchString)
                #This is where it fails
                newSoup = BeautifulSoup(newPage.text, 'html.parser')
                aaList = []
                spaceCount = newSoup.count("ff_line")
                print(spaceCount)
                for i in range(spaceCount):
                    startIndex = newSoup.find("ff_line")
                    startIndex = newSoup.find(">", startIndex) + 2
                    nextAA = newSoup[startIndex]
                    while nextAA in string.ascii_lowercase:
                        aaList.append(nextAA)
                        startIndex += 1
                        nextAA = newSoup[startIndex]
                return aaList        
             except:
                print("Please Enter a Valid Protein")
    

    我一直在尝试使用搜索“p53”来运行它,并获得了以下链接: here

    我在这个网站上查看了一长串的webscraping条目,并尝试了很多事情,包括安装selenium和使用不同的解析器。我仍然不明白为什么这些不匹配。(很抱歉,如果这是一个重复的问题,我对网络垃圾很陌生,目前有脑震荡,所以我正在寻找一些个案反馈)

    1 回复  |  直到 6 年前
        1
  •  1
  •   Mihai Chelaru klin    6 年前

    此代码将使用Selenium提取所需的蛋白质序列。我修改了您的原始代码,以获得您想要的结果。

    from bs4 import BeautifulSoup
    from selenium import webdriver
    import requests
    
    driver = webdriver.Firefox()
    
    def getSequence():
        searchProt = input("Enter a Protein Name!:")
        if searchProt != '':
            searchString = "https://www.ncbi.nlm.nih.gov/protein/?term=" + searchProt
            page = requests.get(searchString)
            soup = BeautifulSoup(page.text, 'html.parser')
            soup = str(soup)
            accIndex = soup.find("a")
            accessionStart = soup.find('<dd>',accIndex)
            accessionEnd = soup.find('</dd>', accessionStart + 4)
            accession = soup[accessionStart + 4: accessionEnd]
            newSearchString = "https://www.ncbi.nlm.nih.gov/protein/" + accession
            try:
                driver.get(newSearchString)
                html = driver.page_source
                newSoup = BeautifulSoup(html, "lxml")
                ff_tags = newSoup.find_all(class_="ff_line")
                aaList = []
                for tag in ff_tags:
                    aaList.append(tag.text.strip().replace(" ",""))
                protSeq = "".join(aaList)
                return protSeq
            except:
                print("Please Enter a Valid Protein")
    
    sequence = getSequence()
    print(sequence)
    

    为“p53”的输入生成以下输出:

    meepqsdlsielplsqetfsdlwkllppnnvlstlpssdsieelflsenvtgwledsggalqgvaaaaastaedpvtetpapvasapatpwplsssvpsyktfqgdygfrlgflhsgtaksvtctyspslnklfcqlaktcpvqlwvnstpppgtrvramaiykklqymtevvrrcphherssegdslappqhlirvegnlhaeylddkqtfrhsvvvpyeppevgsdcttihynymcnsscmggmnrrpiltiitledpsgnllgrnsfevricacpgrdrrteeknfqkkgepcpelppksakralptntssspppkkktldgeyftlkirgherfkmfqelnealelkdaqaskgsedngahssylkskkgqsasrlkklmikregpdsd