代码之家  ›  专栏  ›  技术社区  ›  Stefan

scrapy splash活动内容选择器在shell中工作,但不能与spider一起工作

  •  2
  • Stefan  · 技术社区  · 6 年前

    我刚开始使用scrapy splash从opentable.com检索预订数量。在shell中,以下操作很好:

    $ scrapy shell 'http://localhost:8050/render.html?url=https://www.opentable.com/new-york-restaurant-listings&timeout=10&wait=0.5'    
    ...
    
    In [1]: response.css('div.booking::text').extract()
    Out[1]: 
    ['Booked 59 times today',
     'Booked 20 times today',
     'Booked 17 times today',
     'Booked 29 times today',
     'Booked 29 times today',
      ... 
    ]
    

    但是,这个简单的spider返回一个空列表:

    class TableSpider(scrapy.Spider):
        name = 'opentable'
        start_urls = ['https://www.opentable.com/new-york-restaurant-listings']
    
        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(url=url,
                                    callback=self.parse,
                                    endpoint='render.html',
                                    args={'wait': 1.5},
                                    )
    
        def parse(self, response):
            yield {'bookings': response.css('div.booking::text').extract()}
    

    调用时使用:

    $ scrapy crawl opentable
    ...
    DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
    {'bookings': []}
    

    我已经试过了,但没有成功

    docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
    

    增加了等待时间。

    2 回复  |  直到 6 年前
        1
  •  3
  •   Druta Ruslan    6 年前

    我想你的问题是 middlewares ,首先需要添加一些设置

    # settings.py
    
    # uncomment `DOWNLOADER_MIDDLEWARES` and add this settings to it
    DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    
    # url of splash server
    SPLASH_URL = 'http://localhost:8050'
    
    # and some splash variables
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    

    现在运行Docker

    sudo docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
    

    如果我做了所有这些步骤,请返回:

    scrapy crawl opentable
    
    ...
    
    2018-06-23 11:23:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.opentable.com/new-york-restaurant-listings via http://localhost:8050/render.html> (referer: None)
    2018-06-23 11:23:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
    {'bookings': [
        'Booked 44 times today',
        'Booked 24 times today',
        'and many others Booked values'
    ]}
    
        2
  •  0
  •   Joaquim De la Cruz    6 年前

    这不起作用,因为web的内容使用js。

    您可以采用多种解决方案:

    1)使用硒。

    2)如果你看到页面的api,如果你调用这个url <GET https://www.opentable.com/injector/stats/v1/restaurants/<restaurant_id>/reservations> 您将拥有此特定餐厅(餐厅ID)的当前预订数量。