代码之家  ›  专栏  ›  技术社区  ›  SIM

无法修改我的脚本以限定刮片时的请求数

  •  2
  • SIM  · 技术社区  · 6 年前

    我用python编写了一个脚本 Thread 同时处理多个请求,并更快地执行刮片过程。脚本正在相应地执行它的任务。

    简而言之,scraper做什么:它解析从登录页到其主页(信息存储的地方)的所有链接,然后scrape happy hours featured special 从那里开始。刮板一直刮到29页都被爬过为止。

    由于可能有许多链接可以玩,我想限制请求的数量。然而,由于我对此没有太多的了解,所以我找不到任何方法来修改我现有的脚本来达到这个目的。

    任何帮助都将不胜感激。

    这是我迄今为止的尝试:

    import requests
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin
    import threading
    
    url = "https://www.totalhappyhour.com/washington-dc-happy-hour/?page={}"
    
    def get_info(link):
        for mlink in [link.format(page) for page in range(1,30)]:
            response = requests.get(mlink)
            soup = BeautifulSoup(response.text,"lxml")
            itemlinks = [urljoin(link,container.select_one("h2.name a").get("href")) for container in soup.select(".profile")]
            threads = []
            for ilink in itemlinks:
                thread = threading.Thread(target=fetch_info,args=(ilink,))
                thread.start()
                threads+=[thread]
    
            for thread in threads:
                thread.join()
    
    def fetch_info(nlink):
        response = requests.get(nlink)
        soup = BeautifulSoup(response.text,"lxml")
        for container in soup.select(".specials"):
            try:
                hours = container.select_one("h3").text
            except Exception: hours = ""
            try:
                fspecial = ' '.join([item.text for item in container.select(".special")])
            except Exception: fspecial = ""
            print(f'{hours}---{fspecial}')
    
    if __name__ == '__main__':
        get_info(url)
    
    3 回复  |  直到 6 年前
        1
  •  2
  •   SocketPlayer    6 年前

    你应该看看 asyncio ,它非常简单,可以帮助您更快地完成任务!

    阿尔索 multiprocessing.Pool 可以简化代码(以防不想使用asyncio)。 multiprocessing.pool 也有 ThreadPool 如果您喜欢使用线程,则为等效。

    关于请求限制,我建议您使用 threading.Semaphore (或任何其他信号,以防从线程切换)

    线程方法:

    from multiprocessing.pool import ThreadPool as Pool
    from threading import Semaphore
    from time import sleep
    
    
    MAX_RUN_AT_ONCE = 5
    NUMBER_OF_THREADS = 10
    
    sm = Semaphore(MAX_RUN_AT_ONCE)
    
    
    def do_task(number):
        with sm:
            print(f"run with {number}")
            sleep(3)
            return number * 2
    
    
    def main():
    
        p = Pool(NUMBER_OF_THREADS)
        results = p.map(do_task, range(10))
        print(results)
    
    
    if __name__ == '__main__':
        main()
    

    多处理方法:

    from multiprocessing import Pool
    from multiprocessing import Semaphore
    from time import sleep
    
    
    MAX_RUN_AT_ONCE = 5
    NUMBER_OF_PROCESS = 10
    
    semaphore = None
    
    def initializer(sm):
        """init the semaphore for the child process"""
        global semaphore
        semaphore = sm
    
    
    def do_task(number):
        with semaphore:
            print(f"run with {number}\n")
            sleep(3)
            return number * 2
    
    
    def main():
        sm = Semaphore(MAX_RUN_AT_ONCE)
        p = Pool(NUMBER_OF_PROCESS, initializer=initializer,
                 initargs=[sm])
    
        results = p.map(do_task, range(10))
        print(results)
    
    
    if __name__ == '__main__':
        main()
    

    异步方法:

    import asyncio
    
    
    MAX_RUN_AT_ONCE = 5
    sm = asyncio.Semaphore(MAX_RUN_AT_ONCE)
    
    async def do_task(number):
        async with sm:
            print(f"run with {number}\n")
            await asyncio.sleep(3)
            return number * 2
    
    async def main():   
        coros = [do_task(number) for number in range(10)]
        finished, _  = await asyncio.wait(coros)
        print([fut.result() for fut in finished])
    
    if __name__ == '__main__':
        loop = asyncio.get_event_loop()
        loop.run_until_complete(main())
    

    用于指挥 http requests 对于asyncio,您应该使用 aiohttp ,也可以使用 requests 具有 loop.run_in_executor 但那就不用了 异步 因为你所有的代码都是请求。

    输出:

    与0一起运行

    使用1运行

    与2一起运行

    与3一起跑

    与4一起跑

    (这里有一个暂停du的信号灯和睡眠)

    与5一起跑

    与6一起跑

    用7跑

    与8一起跑

    用9跑

    [0,2,4,6,8,10,12,14,16,18]

    你也可以检查 ThreadPoolExecutor

        2
  •  2
  •   SIM    6 年前

    因为我是个新手 多处理 ,我希望有任何真实的脚本,以便非常清楚地理解逻辑。脚本中使用的站点具有一些bot保护机制。然而,我发现了一个非常相似的网页来申请 多处理 在里面。

    import requests
    from multiprocessing import Pool
    from urllib.parse import urljoin
    from bs4 import BeautifulSoup
    
    url = "http://srar.com/roster/index.php?agent_search={}"
    
    def get_links(link):
        completelinks = []
        for ilink in [chr(i) for i in range(ord('a'),ord('d')+1)]:
            res = requests.get(link.format(ilink))  
            soup = BeautifulSoup(res.text,'lxml')
            for items in soup.select("table.border tr"):
                if not items.select("td a[href^='index.php?agent']"):continue
                data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
                completelinks.extend(data)
        return completelinks
    
    def get_info(nlink):
        req = requests.get(nlink)
        sauce = BeautifulSoup(req.text,"lxml")
        for tr in sauce.select("table[style$='1px;'] tr"):
            table = [td.get_text(strip=True) for td in tr.select("td")]
            print(table)
    
    if __name__ == '__main__':
        allurls = get_links(url)
        with Pool(10) as p:  ##this is the number responsible for limiting the number of requests
            p.map(get_info,allurls)
            p.join()
    
        3
  •  0
  •   MITHU    6 年前

    尽管我不确定我是否能实现 ThreadPool 下面的脚本已经在SocketPlayer的答案中描述过了,它看起来工作得很完美。如果我出了什么差错,请随时纠正。

    import requests
    from urllib.parse import urljoin
    from bs4 import BeautifulSoup
    from multiprocessing.pool import ThreadPool as Pool
    from threading import Semaphore
    
    MAX_RUN_AT_ONCE = 5
    NUMBER_OF_THREADS = 10
    
    sm = Semaphore(MAX_RUN_AT_ONCE)
    
    url = "http://srar.com/roster/index.php?agent_search={}"
    
    def get_links(link):
        with sm:
            completelinks = []
            for ilink in [chr(i) for i in range(ord('a'),ord('d')+1)]:
                res = requests.get(link.format(ilink))  
                soup = BeautifulSoup(res.text,'lxml')
                for items in soup.select("table.border tr"):
                    if not items.select("td a[href^='index.php?agent']"):continue
                    data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
                    completelinks.extend(data)
            return completelinks
    
    def get_info(nlink):
        req = requests.get(nlink)
        sauce = BeautifulSoup(req.text,"lxml")
        for tr in sauce.select("table[style$='1px;'] tr")[1:]:
            table = [td.get_text(strip=True) for td in tr.select("td")]
            print(table)
    
    if __name__ == '__main__':
        p = Pool(NUMBER_OF_THREADS)
        p.map(get_info, get_links(url))