代码之家  ›  专栏  ›  技术社区  ›  showkey

在Scrapinghub上运行scrapy时如何保存下载的文件?

  •  0
  • showkey  · 技术社区  · 5 年前

    这个 stockInfo.py 包含:

    import scrapy
    import re
    import pkgutil
    
    class QuotesSpider(scrapy.Spider):
        name = "stockInfo"
        data = pkgutil.get_data("tutorial", "resources/urls.txt")
        data = data.decode()
        start_urls = data.split("\r\n")
    
        def parse(self, response):
            company = re.findall("[0-9]{6}",response.url)[0]
            filename = '%s_info.html' % company
            with open(filename, 'wb') as f:
                f.write(response.body)
    

    去执行蜘蛛 stockInfo 在窗口的命令中。

    d:
    cd  tutorial
    scrapy crawl stockInfo
    

    现在网址的所有网页 resources/urls.txt 将下载到目录上 d:/tutorial .

    然后把蜘蛛部署到 Scrapinghub stockInfo spider .

    enter image description here


    下面的命令行是如何在 刮胡 ?

            with open(filename, 'wb') as f:
                f.write(response.body)
    

    如何将数据保存在scrapinghub中,并在作业完成后从scrapinghub下载?

    首先安装刮片机。

    pip install scrapinghub[msgpack]
    

    重写为 Thiago Curvelo 喂,把它放在我的垃圾桶里。

    Deploy log location: C:\Users\dreams\AppData\Local\Temp\shub_deploy_yzstvtj8.log
    Error: Deploy failed: b'{"status": "error", "message": "Internal error"}'
        _get_apisettings, commands_module='sh_scrapy.commands')
      File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 148, in _run_usercode
        _run(args, settings)
      File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 103, in _run
        _run_scrapy(args, settings)
      File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 111, in _run_scrapy
        execute(settings=settings)
      File "/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 148, in execute
        cmd.crawler_process = CrawlerProcess(settings)
      File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 243, in __init__
        super(CrawlerProcess, self).__init__(settings)
      File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 134, in __init__
        self.spider_loader = _get_spider_loader(settings)
      File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
        return loader_cls.from_settings(settings.frozencopy())
      File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 61, in from_settings
        return cls(settings)
      File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 25, in __init__
        self._load_all_spiders()
      File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
        for module in walk_modules(name):
      File "/usr/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
        submod = import_module(fullpath)
      File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
        __import__(name)
      File "/app/__main__.egg/mySpider/spiders/stockInfo.py", line 4, in <module>
    ImportError: cannot import name ScrapinghubClient
    {"message": "shub-image-info exit code: 1", "details": null, "error": "image_info_error"}
    {"status": "error", "message": "Internal error"}
    

    这个要求.txt只包含一行:

    scrapinghub[msgpack]
    

    这个剪贴簿.yml包含:

    project: 123456
    requirements:
      file: requirements.tx
    

    现在部署它。

    D:\mySpider>shub deploy 123456
    Packing version 1.0
    Deploying to Scrapy Cloud project "123456"
    Deploy log last 30 lines:
    
    Deploy log location: C:\Users\dreams\AppData\Local\Temp\shub_deploy_4u7kb9ml.log
    Error: Deploy failed: b'{"status": "error", "message": "Internal error"}'
      File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 148, in _run_usercode
        _run(args, settings)
      File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 103, in _run
        _run_scrapy(args, settings)
      File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 111, in _run_scrapy
        execute(settings=settings)
      File "/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 148, in execute
        cmd.crawler_process = CrawlerProcess(settings)
      File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 243, in __init__
        super(CrawlerProcess, self).__init__(settings)
      File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 134, in __init__
        self.spider_loader = _get_spider_loader(settings)
      File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
        return loader_cls.from_settings(settings.frozencopy())
      File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 61, in from_settings
        return cls(settings)
      File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 25, in __init__
        self._load_all_spiders()
      File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
        for module in walk_modules(name):
      File "/usr/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
        submod = import_module(fullpath)
      File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
        __import__(name)
      File "/tmp/unpacked-eggs/__main__.egg/mySpider/spiders/stockInfo.py", line 5, in <module>
        from scrapinghub import ScrapinghubClient
    ImportError: cannot import name ScrapinghubClient
    {"message": "shub-image-info exit code: 1", "details": null, "error": "image_info_error"}
    {"status": "error", "message": "Internal error"}     
    

    1.问题仍然存在。

    ImportError: cannot import name ScrapinghubClient
    

    2.我的本地电脑上只安装了python3.7和win7,为什么错误信息显示:

    File "/usr/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    

    错误信息是否在scrapinghub(远程端)?就送到我的本地区去表演?

    0 回复  |  直到 5 年前
        1
  •  3
  •   Thiago Curvelo    5 年前

    如今,在云环境中将数据写入磁盘并不可靠,因为每个人都在使用容器,而且容器是短暂的。

    但是你可以用Scrapinghub来保存你的数据 Collection API 您可以直接通过端点使用它,也可以使用以下包装: https://python-scrapinghub.readthedocs.io/en/latest/

    python-scrapinghub ,您的代码如下所示:

    from scrapinghub import ScrapinghubClient
    from contextlib import closing
    
    project_id = '12345'
    apikey = 'XXXX'
    client = ScrapinghubClient(apikey)
    store = client.get_project(project_id).collections.get_store('mystuff')
    
    #...
    
        def parse(self, response):
            company = re.findall("[0-9]{6}",response.url)[0]
            with closing(store.create_writer()) as writer:
                writer.write({
                    '_key': company, 
                    'body': response.body}
                )        
    

    将某些内容保存到集合中后,仪表板中将显示一个链接:

    collections

    编辑:

    以确保依赖项将安装在云中( scrapinghub[msgpack] ),将它们添加到 requirements.txt Pipfile 把它包括在 scrapinghub.yml 文件。如:

    # project_directory/scrapinghub.yml
    
    projects:
      default: 12345
    
    stacks:
      default: scrapy:1.5-py3
    
    requirements:
      file: requirements.txt
    

    ( https://shub.readthedocs.io/en/stable/deploying.html#deploying-dependencies )

    因此,scrapinghub(云服务)将安装scrapinghub(python库)。:)

    我希望这对你有帮助。