代码之家  ›  专栏  ›  技术社区  ›  Exam Orph

Scrapy-删除重复项并将数据作为单个列表输出?

  •  1
  • Exam Orph  · 技术社区  · 7 年前

    import scrapy
    
    class testSpider(scrapy.Spider):
        name = "quotes"
        start_urls = ['http://www.website.com']
    
        def parse(self, response):
            urls = response.css('div.subject_wrapper > a::attr(href)').extract()
            for url in urls:
                url = response.urljoin(url)
                yield scrapy.Request(url=url, callback=self.getData)
    
        def getData(self, response):
            data = {'data': response.css('strong.data::text').extract()}
            yield data
    

    它工作正常,但当它返回每个链接的数据列表时,当我输出到CSV时,它看起来如下所示:

    "dalegribel,Chad,Ninoovcov,dalegribel,Gotenks,sillydog22"
    
    "kaylachic,jmargerum,kaylachic"
    
    "Kempodancer,doctordbrew,Gotenks,dalegribel"
    
    "Gotenks,dalegribel,jmargerum"
    
    ...
    

    dalegribel
    Chad
    Ninoovcov
    Gotenks
    ...
    

    1 回复  |  直到 7 年前
        1
  •  4
  •   pythad    7 年前

    不确定是否可以使用Scrapy内置方法以某种方式完成,但python的方法是创建一组唯一的元素,检查重复项,并且只生成唯一的元素:

    class testSpider(scrapy.Spider):
        name = "quotes"
        start_urls = ['http://www.website.com']
        unique_data = set()
    
        def parse(self, response):
            urls = response.css('div.subject_wrapper > a::attr(href)').extract()
            for url in urls:
                url = response.urljoin(url)
                yield scrapy.Request(url=url, callback=self.getData)
    
        def getData(self, response):
            data_list = response.css('strong.data::text').extract()
            for elem in data_list:
                if elem and (elem not in self.unique_data):
                    self.unique_data.add(elem)
                    yield {'data': elem}