代码之家 › 专栏 › 技术社区 › Elgin Cahangirov

剪贴跟踪以前的链接

scrapy-spider scrapy web-scraping python

Elgin Cahangirov · 技术社区 · 6 年前

我正在尝试使用scrappy从url'开始跟踪上一年的链接 https://umanity.jp/en/racedata/race_6.php '的。在这个URL中,当前年份是2018年,有上一个按钮。当你点击那个按钮,它会转到2017,2016…直到2000年。但是,我写的《残缺的蜘蛛》在2017年停播了。我的代码:

import scrapy


class RaceSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['umanity.jp']
    start_urls = ['https://umanity.jp/en/racedata/race_6.php']  # start to scrape from this url

    def parse(self, response):
        previous_year_btn = response.xpath('//div[@class="newslist_year_select m_bottom5"]/*[1]')
        if previous_year_btn.extract_first()[1] == 'a':
            href = previous_year_btn.xpath('./@href').extract_first()
            follow_link = response.urljoin(href)
            yield scrapy.Request(follow_link, self.parse_years)

    def parse_years(self, response):
        print(response.url)  # prints only year 2017

不明白为什么它会停在2017年而不去往年。怎么了?

2 回复 | 直到 6 年前

Zev 6 年前

parse_years

开关:
yield scrapy.Request(follow_link, self.parse_years)
yield scrapy.Request(follow_link, self.parse) parse 函数继续查找链接。

分析年

def parse_years(self, response):
    print(response.url)  # prints only year 2017
    yield from self.parse(response)

SIM 6 年前

self.parse self.parse_years

class RaceSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['umanity.jp']
    start_urls = ['https://umanity.jp/en/racedata/race_6.php']  # start to scrape from this url

    def parse(self, response):
        previous_year_btn = response.xpath('//div[contains(@class,"newslist_year_select")]/a')
        if 'race_prev.gif' in previous_year_btn.xpath('.//img/@src').extract_first():
            href = previous_year_btn.xpath('./@href').extract_first()
            yield scrapy.Request(response.urljoin(href), self.parse)
            print(response.url)

但是,保持第二种方法的有效性:

def parse(self, response):      
    yield scrapy.Request(response.url, self.parse_years)  #this is the fix

    previous_year_btn = response.xpath('//div[contains(@class,"newslist_year_select")]/a')
    if 'race_prev.gif' in previous_year_btn.xpath('.//img/@src').extract_first():
        href = previous_year_btn.xpath('./@href').extract_first()
        yield scrapy.Request(response.urljoin(href), self.parse)

def parse_years(self, response):
    print(response.url)

推荐文章

July · 如何定义数字间隔,然后四舍五入

1 年前

Community wiki · 对象名称前的单下划线和双下划线的含义是什么?

1 年前