代码之家 › 专栏 › 技术社区 › KAs

Scrapy:通过cmdLine从同一个python进程运行多个spider失败

scrapy-spider scrapy web-scraping python

KAs · 技术社区 · 8 年前

代码如下:

if __name__ == '__main__':
    cmdline.execute("scrapy crawl spider_a -L INFO".split())
    cmdline.execute("scrapy crawl spider_b -L INFO".split())

我打算在一个 scrapy 但事实证明,只有第一个蜘蛛成功运行,而第二个蜘蛛似乎被忽略了。有什么建议吗?

2 回复 | 直到 8 年前

C. Feenstra 8 年前

就这么做吧

import subprocess

subprocess.call('for spider in spider_a spider_b; do scrapy crawl $spider -L INFO; done', shell=True)

Clément Denoix 8 年前

从零碎的文档中: https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

import scrapy
from scrapy.crawler import CrawlerProcess

from .spiders import Spider1, Spider2

process = CrawlerProcess()
process.crawl(Crawler1)
process.crawl(Crawler2)
process.start() # the script will block here until all crawling jobs are finished

编辑:如果希望逐个运行多个spider,可以执行以下操作:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

configure_logging()
runner = CrawlerRunner()

spiders = [Spider1, Spider2, Spider3, Spider4]

def join_spiders(spiders):
   """Setup a new runner with the provided spiders"""

   runner = CrawlerRunner()

   # Add each spider to the current runner
   for spider in spider:
       runner.crawl(MySpider1)

   # This will yield when all the spiders inside the runner finished
   yield runner.join()

@defer.inlineCallbacks
def crawl(group_by=2):
   # Yield a new runner containing `group_by` spiders
   for i in range(0, len(spiders), step=group_by):
       yield join_spiders(spiders[i:i + group_by])

   # When we finished running all the spiders, stop the twisted reactor
   reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

虽然没有测试所有这些,但请告诉我它是否有效!

推荐文章

Grevioos · 处理时出现刮键错误

8 年前