代码之家 › 专栏 › 技术社区 › Chris Jewell

刮痧蜘蛛一遍又一遍地返回相同的元素

scrapy-spider scrapy xpath python

Chris Jewell · 技术社区 · 7 年前

我遇到了一个问题,我把一个蜘蛛放在一起。我正在努力从成绩单上删去每一行 this site ,并找到了一些合适的选择器,但在运行时,爬行器的输出只是重复同一行。我见过另外两个人有类似的问题( like this ),但尚未找到解决我问题的答案。

(注意,我认为这可能是我的基础Python编码和 for 循环构建,而不是 scrapy

这是蜘蛛:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class TalSpider(CrawlSpider):
    name = 'tal'
    allowed_domains = ['https://www.thisamericanlife.org/radio-archives/episode/']
    start_urls = ['https://www.thisamericanlife.org/radio-archives/episode/1/transcript/']

def parse(self, response):

    for line in response.xpath('//div'):
        episode_num_text = line.xpath('//div[contains(@class, "radio-wrapper")]/@id').extract()
        radio_date_text = line.xpath('//div[contains(@class, "radio-date")]/text()').extract()
        episode_title = line.xpath('//h2').xpath('a[contains(@href, *)]/text()').extract()
        begin_timestamp = line.xpath('//p[contains(@begin, *)]/@begin').extract()
        speaker_class = line.xpath('//div/@class').extract()
        speaker_name = line.xpath('//h4/text()').extract()
        line_text = line.xpath('//p[contains(@begin, *)]/text()').extract()
        full_audio_link = line.xpath('//p[contains(@class, "full-audio")]/text()').extract()



        for item in zip(episode_num_text, radio_date_text, episode_title, begin_timestamp, speaker_class, speaker_name, line_text, full_audio_link):
            scraped_info = {
                'episode_num_text' : item[0], 
                'radio_date_text' : item[1], 
                'episode_title' : item[2],
                'begin_timestamp' : item[3], 
                'speaker_class' : item[4],
                'speaker_name' : item[5], 
                'line_text' : item[6], 
                'full_audio_link' : item[7],
                }
            yield scraped_info

这是的屏幕抓图。csv输出 which shows the repeated output.

问题似乎在于环我的想法是:对于这个选择器列表中的每个选择器,根据for循环中的项定义,提取该元素的子集。相反,它似乎正在执行:对于此列表中的177个选择器中的每一个,返回定义的每个项的第一个元素。

我很高兴澄清这些问题中的任何一个,并将非常感谢任何人可以提供的任何帮助!

1 回复 | 直到 7 年前

rojeeer 7 年前

请注意XPath与 relative XPath 在刮痧中。

解析时,您将遍历从绝对XPath解析的元素。然而,在循环中,您仍然使用绝对XPath,这是错误的,应该是相对XPath。

谢谢

推荐文章

Grevioos · 处理时出现刮键错误

7 年前

Xiaowei Cheng · 找不到td scrapyspider的xpath[已关闭]

7 年前

Teresa Salil · 使用scrapy时如何绕过“cookiewall”?

7 年前

KAs · Scrapy:通过cmdLine从同一个python进程运行多个spider失败

7 年前

TheM00s3 · 设置下载延迟时,刮擦行为会发生变化

7 年前

WebOrCode · 为什么在刮擦的文本是作为字符串在蜘蛛,但作为列表在管道?

7 年前

Amrit · 无法在运行scrapy spider的c中运行python脚本

7 年前

aleroot · 使用相同的爬行器分析详细信息页面和分页页面

7 年前

user7367694 · Scrapy项目错误:“未定义变量”,实际上我已经定义了这个变量

7 年前

Chris Jewell · 刮痧蜘蛛一遍又一遍地返回相同的元素

7 年前