代码之家  ›  专栏  ›  技术社区  ›  Luis Ramon Ramirez Rodriguez

如何保存来自表单splash scraping hub的截图

  •  0
  • Luis Ramon Ramirez Rodriguez  · 技术社区  · 5 年前

    我在scrapinghub上得到了一个splash的订阅,我想从本地机器上运行的脚本中使用它。到目前为止,我得到的指导是:

    1) 编辑设置文件:

    #I got this one from my scraping hub account
    SPLASH_URL = 'http://xx.x0-splash.scrapinghub.com'
    
    
    DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    
    SPIDER_MIDDLEWARES = {
        'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    }
    
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    

    因此,我有一个dobt,当我尝试打开浏览器上的spash服务器时,它会要求我输入用户名,我不知道在哪里设置scrapy。

    enter image description here

    import scrapy
    import json
    from scrapy import  Request
    from scrapy_splash import SplashRequest
    import scrapy_splash
    
    
    class ListSpider(scrapy.Spider):
    
        name = 'list'
        allowed_domains = ['https://medium.com/']
        start_urls = ['https://medium.com/']
    
        def parse(self, response):
            print (response.body)
            with open('data/cookies_file.json') as f:
                cookies_data = json.loads(f.read())[0]
            #print (cookies_data)
            url = 'https://medium.com/' 
            #cookies=cookies_data,
            yield Request(url,  callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})
    
    
    
    
    
    
        def afterlogin(self,response):
            with open(data_dir + 'after_login_page.html','w') as f:
                f.write(str(response.body))
    

    我没有得到错误,但我不知道如果飞溅是工作,也除了服务器ip,刮提供了一个密码,我不知道在哪里使用这个脚本。

    在使用splashrequest并添加API密钥之后,这是我得到的回溯,站点的内容仍然没有加载。

    2019-07-17 10:10:08 [scrapy.core.engine] INFO: Spider opened
    2019-07-17 10:10:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2019-07-17 10:10:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2019-07-17 10:10:09 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "www.meetmindful.com"; '*.meetmindful.com'!='www.meetmindful.com'
    2019-07-17 10:10:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.meetmindful.com/> (referer: None)
    2019-07-17 10:10:13 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
    2019-07-17 10:10:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://app.meetmindful.com/login via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
    2019-07-17 10:10:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
    2019-07-17 10:10:21 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
    2019-07-17 10:10:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
    2019-07-17 10:10:26 [scrapy.core.engine] INFO: Closing spider (finished)
    2019-07-17 10:10:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 1,
     'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
     'downloader/request_bytes': 2952,
     'downloader/request_count': 4,
     'downloader/request_method_count/GET': 1,
     'downloader/request_method_count/POST': 3,
     'downloader/response_bytes': 28104,
     'downloader/response_count': 3,
     'downloader/response_status_count/200': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2019, 7, 17, 14, 10, 26, 292646),
     'log_count/DEBUG': 5,
     'log_count/INFO': 8,
     'log_count/WARNING': 3,
     'memusage/max': 54104064,
     'memusage/startup': 54104064,
     'request_depth_max': 2,
     'response_received_count': 3,
     'retry/count': 1,
     'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
     'scheduler/dequeued': 6,
     'scheduler/dequeued/memory': 6,
     'scheduler/enqueued': 6,
     'scheduler/enqueued/memory': 6,
     'splash/render.html/request_count': 2,
     'splash/render.html/response_count/200': 2,
     'start_time': datetime.datetime(2019, 7, 17, 14, 10, 8, 200073)}
    2019-07-17 10:10:26 [scrapy.core.engine] INFO: Spider closed (finished)
    

    编辑:

    这是我得到的完整日志;

    INFO: Scrapy 1.5.2 started (bot: meetmindfull)
    INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.7.3 (default, Mar 27 2019, 22:11:17) - [GCC 7.3.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1  11 Sep 2018), cryptography 2.6.1, Platform Linux-4.15.0-20-generic-x86_64-with-debian-buster-sid
    INFO: Overridden settings: {'BOT_NAME': 'meetmindfull', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'LOG_FILE': 'log.txt', 'LOG_FORMAT': '%(levelname)s: %(message)s', 'NEWSPIDER_MODULE': 'meetmindfull.spiders', 'SPIDER_MODULES': ['meetmindfull.spiders']}
    INFO: Telnet Password: 4a122ec20dcf75e1
    INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.memusage.MemoryUsage',
     'scrapy.extensions.logstats.LogStats']
    INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy_splash.SplashCookiesMiddleware',
     'scrapy_splash.SplashMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy_splash.SplashDeduplicateArgsMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    INFO: Enabled item pipelines:
    []
    INFO: Spider opened
    INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    DEBUG: Telnet console listening on 127.0.0.1:6023
    WARNING: Remote certificate is not valid for hostname "www.meetmindful.com"; '*.meetmindful.com'!='www.meetmindful.com'
    DEBUG: Crawled (200) <GET https://www.meetmindful.com/> (referer: None)
    WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
    DEBUG: Crawled (200) <GET https://app.meetmindful.com/login via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
    DEBUG: Retrying <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
    WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
    DEBUG: Crawled (200) <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
    INFO: Closing spider (finished)
    INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 1,
    
    
    
    'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
     'downloader/request_bytes': 2952,
     'downloader/request_count': 4,
     'downloader/request_method_count/GET': 1,
     'downloader/request_method_count/POST': 3,
     'downloader/response_bytes': 28096,
     'downloader/response_count': 3,
     'downloader/response_status_count/200': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2019, 7, 17, 14, 47, 46, 604347),
     'log_count/DEBUG': 5,
     'log_count/INFO': 8,
     'log_count/WARNING': 3,
     'memusage/max': 54267904,
     'memusage/startup': 54267904,
     'request_depth_max': 2,
     'response_received_count': 3,
     'retry/count': 1,
     'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
     'scheduler/dequeued': 6,
     'scheduler/dequeued/memory': 6,
     'scheduler/enqueued': 6,
     'scheduler/enqueued/memory': 6,
     'splash/render.html/request_count': 2,
     'splash/render.html/response_count/200': 2,
     'start_time': datetime.datetime(2019, 7, 17, 14, 47, 28, 791792)}
    INFO: Spider closed (finished)
    
    0 回复  |  直到 5 年前
        1
  •  0
  •   Tarun Lalwani    5 年前

    如果您查看他们的示例文件,他们已经展示了如何使用它

    https://github.com/scrapy-plugins/scrapy-splash/blob/e40ca4f3b367ab463273bee1357d3edfe0601f0d/example/scrashtest/spiders/quotes.py

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    
    from scrapy_splash import SplashRequest
    
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        allowed_domains = ["toscrape.com"]
        start_urls = ['http://quotes.toscrape.com/']
    
        # http_user = 'splash-user'
        # http_pass = 'splash-password'
    
        def parse(self, response):
            ...
    

    你也需要屈服 SplashRequest 而不是 Request ,实际上在代码中根本没有使用Splash

    yield Request(url,  callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})
    

    yield SplashRequest(url,  callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})
    
    推荐文章