代码之家 › 专栏 › 技术社区 › RedCrusador

在python中使用scrapy的Linkextractor

spyder scrapy python

RedCrusador · 技术社区 · 6 年前

我正试图阅读一个索引页,从一个报价网站上刮取报价类别,以学习剪贴。我对此很陌生!

我可以用我的代码阅读单独的页面(类别),但是我想阅读索引页来阅读报价页。

这个 def parse_item 部分与单独的页面一起工作。但是我得不到 LinkExtractor 用于推断链接的部分。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ['website.com']
    start_urls = [
        'https://www.website.com/topics'
    ]

    rules = (
        Rule(LinkExtractor(allow=('^\/topics.*', )), callback='parse_item')  
    )


    def parse_item(self, response):
        for quote in response.css('#quotesList .grid-item'):                                       
           yield {
              'text': quote.css('a.oncl_q::text').extract_first(),
              'author': quote.css('a.oncl_a::text').extract_first(),
              'tags': quote.css('.kw-box a.oncl_list_kc::text').extract(),
              'category' : response.css('title::text').re(r'(\w+).*')  
            }

        next_page = response.css('div.bq_s.hideInfScroll > nav > ul > li:nth-last-child(1) a::attr(href)').extract_first()
        if next_page is not None:
          next_page = response.urljoin(next_page)
          yield scrapy.Request(next_page, callback=self.parse)

2 回复 | 直到 6 年前

ÐÐ²Ð°Ð½ ÐÐ°ÑÐ¸Ð»ÑÐµÐ² 6 年前

这里是你的错误:

yield scrapy.Request(next_page, callback=self.parse)

您的方法解析在哪里?

像这样改变---->

 yield scrapy.follow(url=next_page, callback=self.parse_item)

marc_s HarisH Sharma 6 年前

我已经解决了这个问题。虽然可能有一种方法可以解决这个问题 Rule(LinkExtractor 我使用了cascade of response.css查询来跟踪主题页面上的链接。

这是最终的工作版本…

import scrapy

class QuotesBrainy(scrapy.Spider):
    name = 'Quotes'

start_urls = ['https://www.website.com/topics/']

def parse(self, response):
    # follow links to topic pages
    for href in response.css('a.topicIndexChicklet::attr(href)'):
        yield response.follow(href, self.parse_item)


def parse_item(self, response):
    # iterate through all quotes
    for quote in response.css('#quotesList .grid-item'):                                       
       yield {
          'text': quote.css('a.oncl_q::text').extract_first(),
          'author': quote.css('a.oncl_a::text').extract_first(),
          'tags': quote.css('.kw-box a.oncl_list_kc::text').extract(),
          'category' : response.css('title::text').re(r'(\w+).*')  
        }

    # go through the pagination links to access infinite scroll           
    next_page = response.css('div.bq_s.hideInfScroll > nav > ul > li:nth-last-child(1) a::attr(href)').extract_first()
    if next_page is not None:
      next_page = response.urljoin(next_page)
      yield scrapy.Request(next_page, callback=self.parse_item)

推荐文章

Ausamah Hobbi · 在充满随机字母的文本文件中查找随机单词的位置

2 年前

msloryg · plt时,Python绘图已经显示出来。plot(x,y)已运行,因此我无法将多行打印到plot中

2 年前

kcomarks · 蟒蛇刮削循环

2 年前

Eric Auld · 如何在Spyder的变量资源管理器中查看类对象

6 年前

Ruben · 如何将Python打印从循环导出到不同的文本文件?

6 年前

bziggy · 从mac将CSV文件导入Spyder

6 年前

BluHaz · Spyder错误:“int”对象不可调用,但在不同IDE中没有错误

6 年前

danie · 应用程序spyder启动可能产生错误

6 年前

Arnold · 在Spyder中运行PyQt5应用程序时,它总是以-1退出

6 年前

keramat · RuntimeError:“path”必须为None或list,而不是<class“\u Freeze\u importlib\u external”_python电报机器人上的NamespacePath'>

7 年前