代码之家 › 专栏 › 技术社区 › SIM

无法修改我的脚本以限定刮片时的请求数

web-scraping multithreading python-3.x python

SIM · 技术社区 · 6 年前

我用python编写了一个脚本 Thread 同时处理多个请求,并更快地执行刮片过程。脚本正在相应地执行它的任务。

简而言之,scraper做什么:它解析从登录页到其主页(信息存储的地方)的所有链接,然后scrape happy hours 和 featured special 从那里开始。刮板一直刮到29页都被爬过为止。

由于可能有许多链接可以玩,我想限制请求的数量。然而,由于我对此没有太多的了解,所以我找不到任何方法来修改我现有的脚本来达到这个目的。

任何帮助都将不胜感激。

这是我迄今为止的尝试:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import threading

url = "https://www.totalhappyhour.com/washington-dc-happy-hour/?page={}"

def get_info(link):
    for mlink in [link.format(page) for page in range(1,30)]:
        response = requests.get(mlink)
        soup = BeautifulSoup(response.text,"lxml")
        itemlinks = [urljoin(link,container.select_one("h2.name a").get("href")) for container in soup.select(".profile")]
        threads = []
        for ilink in itemlinks:
            thread = threading.Thread(target=fetch_info,args=(ilink,))
            thread.start()
            threads+=[thread]

        for thread in threads:
            thread.join()

def fetch_info(nlink):
    response = requests.get(nlink)
    soup = BeautifulSoup(response.text,"lxml")
    for container in soup.select(".specials"):
        try:
            hours = container.select_one("h3").text
        except Exception: hours = ""
        try:
            fspecial = ' '.join([item.text for item in container.select(".special")])
        except Exception: fspecial = ""
        print(f'{hours}---{fspecial}')

if __name__ == '__main__':
    get_info(url)

3 回复 | 直到 6 年前

SocketPlayer 6 年前

你应该看看 asyncio ,它非常简单,可以帮助您更快地完成任务!

阿尔索 multiprocessing.Pool 可以简化代码(以防不想使用asyncio)。 multiprocessing.pool 也有 ThreadPool 如果您喜欢使用线程,则为等效。

关于请求限制,我建议您使用 threading.Semaphore (或任何其他信号,以防从线程切换)

线程方法:

from multiprocessing.pool import ThreadPool as Pool
from threading import Semaphore
from time import sleep


MAX_RUN_AT_ONCE = 5
NUMBER_OF_THREADS = 10

sm = Semaphore(MAX_RUN_AT_ONCE)


def do_task(number):
    with sm:
        print(f"run with {number}")
        sleep(3)
        return number * 2


def main():

    p = Pool(NUMBER_OF_THREADS)
    results = p.map(do_task, range(10))
    print(results)


if __name__ == '__main__':
    main()

多处理方法:

from multiprocessing import Pool
from multiprocessing import Semaphore
from time import sleep


MAX_RUN_AT_ONCE = 5
NUMBER_OF_PROCESS = 10

semaphore = None

def initializer(sm):
    """init the semaphore for the child process"""
    global semaphore
    semaphore = sm


def do_task(number):
    with semaphore:
        print(f"run with {number}\n")
        sleep(3)
        return number * 2


def main():
    sm = Semaphore(MAX_RUN_AT_ONCE)
    p = Pool(NUMBER_OF_PROCESS, initializer=initializer,
             initargs=[sm])

    results = p.map(do_task, range(10))
    print(results)


if __name__ == '__main__':
    main()

异步方法:

import asyncio


MAX_RUN_AT_ONCE = 5
sm = asyncio.Semaphore(MAX_RUN_AT_ONCE)

async def do_task(number):
    async with sm:
        print(f"run with {number}\n")
        await asyncio.sleep(3)
        return number * 2

async def main():   
    coros = [do_task(number) for number in range(10)]
    finished, _  = await asyncio.wait(coros)
    print([fut.result() for fut in finished])

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

用于指挥 http requests 对于asyncio,您应该使用 aiohttp ,也可以使用 requests 具有 loop.run_in_executor 但那就不用了 异步 因为你所有的代码都是请求。

输出:

与0一起运行

使用1运行

与2一起运行

与3一起跑

与4一起跑

(这里有一个暂停du的信号灯和睡眠)

与5一起跑

与6一起跑

用7跑

与8一起跑

用9跑

[0,2,4,6,8,10,12,14,16,18]

你也可以检查 ThreadPoolExecutor

SIM 6 年前

因为我是个新手 多处理 ,我希望有任何真实的脚本,以便非常清楚地理解逻辑。脚本中使用的站点具有一些bot保护机制。然而,我发现了一个非常相似的网页来申请 多处理 在里面。

import requests
from multiprocessing import Pool
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "http://srar.com/roster/index.php?agent_search={}"

def get_links(link):
    completelinks = []
    for ilink in [chr(i) for i in range(ord('a'),ord('d')+1)]:
        res = requests.get(link.format(ilink))  
        soup = BeautifulSoup(res.text,'lxml')
        for items in soup.select("table.border tr"):
            if not items.select("td a[href^='index.php?agent']"):continue
            data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
            completelinks.extend(data)
    return completelinks

def get_info(nlink):
    req = requests.get(nlink)
    sauce = BeautifulSoup(req.text,"lxml")
    for tr in sauce.select("table[style$='1px;'] tr"):
        table = [td.get_text(strip=True) for td in tr.select("td")]
        print(table)

if __name__ == '__main__':
    allurls = get_links(url)
    with Pool(10) as p:  ##this is the number responsible for limiting the number of requests
        p.map(get_info,allurls)
        p.join()

MITHU 6 年前

尽管我不确定我是否能实现 ThreadPool 下面的脚本已经在SocketPlayer的答案中描述过了,它看起来工作得很完美。如果我出了什么差错,请随时纠正。

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool as Pool
from threading import Semaphore

MAX_RUN_AT_ONCE = 5
NUMBER_OF_THREADS = 10

sm = Semaphore(MAX_RUN_AT_ONCE)

url = "http://srar.com/roster/index.php?agent_search={}"

def get_links(link):
    with sm:
        completelinks = []
        for ilink in [chr(i) for i in range(ord('a'),ord('d')+1)]:
            res = requests.get(link.format(ilink))  
            soup = BeautifulSoup(res.text,'lxml')
            for items in soup.select("table.border tr"):
                if not items.select("td a[href^='index.php?agent']"):continue
                data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
                completelinks.extend(data)
        return completelinks

def get_info(nlink):
    req = requests.get(nlink)
    sauce = BeautifulSoup(req.text,"lxml")
    for tr in sauce.select("table[style$='1px;'] tr")[1:]:
        table = [td.get_text(strip=True) for td in tr.select("td")]
        print(table)

if __name__ == '__main__':
    p = Pool(NUMBER_OF_THREADS)
    p.map(get_info, get_links(url))