代码之家  ›  专栏  ›  技术社区  ›  CtrlAltF2 BoldX

从亚马逊获得所有评论?Python 3

  •  2
  • CtrlAltF2 BoldX  · 技术社区  · 7 年前

    我试图阅读python中的所有产品评论。我有一个脚本,但它不起作用。

    parser = html.fromstring(page_response)
    XPATH_AGGREGATE = '//span[@id="acrCustomerReviewText"]'
    XPATH_REVIEW_SECTION_1 = '//div[@data-hook="reviews-content"]'
    XPATH_REVIEW_SECTION_2 = '//div[@data-hook="review"]'
    
    XPATH_AGGREGATE_RATING = '//table[@id="histogramTable"]//tr'
    XPATH_PRODUCT_NAME = '//h1//span[@id="productTitle"]//text()'
    XPATH_PRODUCT_PRICE  = '//span[@id="priceblock_ourprice"]/text()'
    
    raw_product_price = parser.xpath(XPATH_PRODUCT_PRICE)
    product_price = ''.join(raw_product_price).replace(',','')
    
    raw_product_name = parser.xpath(XPATH_PRODUCT_NAME)
    product_name = ''.join(raw_product_name).strip()
    total_ratings  = parser.xpath(XPATH_AGGREGATE_RATING)
    reviews = parser.xpath(XPATH_REVIEW_SECTION_1)
    if not reviews:
        reviews = parser.xpath(XPATH_REVIEW_SECTION_2)
    

    页面为 https://www.amazon.com/productreviews/ “+asin+”/,其中asin是一个ID(例如,B0718Y23CQ)。我在评论中什么都没有得到。谢谢你的帮助!

    1 回复  |  直到 7 年前
        1
  •  1
  •   Alex    6 年前

    嗯,老实说,我不知道你使用的一些路径在哪里,因为我找不到它们。我已重做了您的代码以尝试帮助:

    from lxml import html 
    import requests
    import json
    asin = 'B0718Y23CQ'
    page_response = requests.get('https://www.amazon.com/product-reviews/'+ asin)
    parser = html.fromstring(page_response.content)
    reviews_html = parser.xpath('//div[@class="a-section review"]')
    reviews_arr = []
    for review in reviews_html:
        review_dic = {}
        review_dic['title'] = review.xpath('.//a[@data-hook="review-title"]/text()')
        review_dic['rating'] = review.xpath('.//a[@class="a-link-normal"]/@title')
        review_dic['author'] = review.xpath('.//a[@data-hook="review-author"]/text()')
        review_dic['date'] = review.xpath('.//span[@data-hook="review-date"]/text()')
        review_dic['purchase'] = review.xpath('.//span[@data-hook="avp-badge"]/text()')
        review_dic['review_text'] = review.xpath('.//span[@data-hook="review-body"]/text()')
        review_dic['helpful_votes'] = review.xpath('.//span[@data-hook="helpful-vote-statement"]/text()')
        reviews_arr.append(review_dic)
    print(json.dumps(reviews_arr, indent = 4))
    

    输出方案为:

    {
            "title": [
                "I find it very useful, I use for anything I need"
            ],
            "rating": [
                "5.0 out of 5 stars"
            ],
            "author": [
                "Nicoletta Delon"
            ],
            "date": [
                "on January 2, 2018"
            ],
            "purchase": [
                "Verified Purchase"
            ],
            "review_text": [
                "I like this a lot. I use it a lot. It's a medium to small size but it holds a lot."
            ],
            "helpful_votes": [
                "\n      One person found this helpful.\n    "
            ]
        }
    

    现在,您必须清理结果,将其从列表中删除,防止元素为空,我认为您将获得所需的内容。 要获得所有评论,您必须迭代页面,添加 ?pageNumber=1 并迭代该数字。您可以使用代理来防止ip阻塞,以防您发出许多请求。