代码之家  ›  专栏  ›  技术社区  ›  KAs

Scrapy:通过cmdLine从同一个python进程运行多个spider失败

  •  2
  • KAs  · 技术社区  · 8 年前

    代码如下:

    if __name__ == '__main__':
        cmdline.execute("scrapy crawl spider_a -L INFO".split())
        cmdline.execute("scrapy crawl spider_b -L INFO".split())
    

    我打算在一个 scrapy 但事实证明,只有第一个蜘蛛成功运行,而第二个蜘蛛似乎被忽略了。有什么建议吗?

    2 回复  |  直到 8 年前
        1
  •  2
  •   C. Feenstra    8 年前

    就这么做吧

    import subprocess
    
    subprocess.call('for spider in spider_a spider_b; do scrapy crawl $spider -L INFO; done', shell=True)
    
        2
  •  0
  •   Clément Denoix    8 年前

    从零碎的文档中: https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    from .spiders import Spider1, Spider2
    
    process = CrawlerProcess()
    process.crawl(Crawler1)
    process.crawl(Crawler2)
    process.start() # the script will block here until all crawling jobs are finished
    

    编辑:如果希望逐个运行多个spider,可以执行以下操作:

    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    configure_logging()
    runner = CrawlerRunner()
    
    spiders = [Spider1, Spider2, Spider3, Spider4]
    
    def join_spiders(spiders):
       """Setup a new runner with the provided spiders"""
    
       runner = CrawlerRunner()
    
       # Add each spider to the current runner
       for spider in spider:
           runner.crawl(MySpider1)
    
       # This will yield when all the spiders inside the runner finished
       yield runner.join()
    
    @defer.inlineCallbacks
    def crawl(group_by=2):
       # Yield a new runner containing `group_by` spiders
       for i in range(0, len(spiders), step=group_by):
           yield join_spiders(spiders[i:i + group_by])
    
       # When we finished running all the spiders, stop the twisted reactor
       reactor.stop()
    
    crawl()
    reactor.run() # the script will block here until the last crawl call is finished
    

    虽然没有测试所有这些,但请告诉我它是否有效!