代码之家  ›  专栏  ›  技术社区  ›  Vivek Kumar Sinha

产生刮痕。请求()无法正常工作以爬网下一页

  •  0
  • Vivek Kumar Sinha  · 技术社区  · 7 年前

    相同的代码适用于不同的站点,但不适用于此站点!

    现场= http://quotes.toscrape.com/

    它没有给出任何错误,并且成功地抓取了8页(或计数页) 导入刮痧

        count = 8
    
        class QuotesSpiderSpider(scrapy.Spider):
            name = 'quotes_spider'
            allowed_domains = ['quotes.toscrape.com']
            start_urls = ['http://quotes.toscrape.com/']
    
            def parse(self, response):
                quotes = response.xpath('//*[@class="quote"]')
    
                for quote in quotes:
                    text = quote.xpath('.//*[@class="text"]/text()').extract_first()
                    author = quote.xpath('.//*[@class="author"]/text()').extract_first()
    
                    yield{
                        'Text' : text,
                        'Author' : author
                    }
    
                global count
                count = count - 1
                if(count > 0):
                    next_page = response.xpath('//*[@class="next"]/a/@href').extract_first()
                    absolute_next_page = response.urljoin(next_page)
                    yield scrapy.Request(absolute_next_page)
    

    但它只能爬行此网站的第1页

    地点 https://www.goodreads.com/list/show/7

    # -*- coding: utf-8 -*-
    import scrapy
    
    count = 5
    
    class BooksSpider(scrapy.Spider):
        name = 'books'
        allowed_domains = ["goodreads.com/list/show/7"]
        start_urls = ["https://goodreads.com/list/show/7/"]
    
        def parse(self, response):
            books = response.xpath('//tr/td[3]')
    
            for book in books:
                bookTitle = book.xpath('.//*[@class="bookTitle"]/span/text()').extract_first()
                authorName = book.xpath('.//*[@class="authorName"]/span/text()').extract_first()
    
                yield{
                    'BookTitle' : bookTitle,
                    'AuthorName' : authorName
                }
    
            global count
            count = count - 1
    
            if (count > 0):
                next_page_url = response.xpath('//*[@class="pagination"]/a[@class="next_page"]/@href').extract_first()
                absolute_next_page_url = response.urljoin(next_page_url)
                yield scrapy.Request(url = absolute_next_page_url)
    

    我想爬网某些有限的网页或第二网站的所有网页。

    1 回复  |  直到 7 年前
        1
  •  4
  •   Tarun Lalwani    7 年前

    您正在使用路径为的域 allowed_domains

    allowed_domains = ["goodreads.com/list/show/7"]
    

    应该是

    allowed_domains = ["goodreads.com"]