代码之家  ›  专栏  ›  技术社区  ›  JHall651

如何在r中从需要交互的网页中刮取文本

  •  0
  • JHall651  · 技术社区  · 6 年前

    我正试图从网页上抓取评论来确定词频。然而,当审查时间较长时,只进行部分审查。您必须单击“更多”以获取网页以显示完整的评论。下面是我用来提取评论文本的代码。我如何“点击”更多内容以获得完整评论?

    library(rvest)
    
    tripAdvisorURL <- "https://www.tripadvisor.com/Hotel_Review-g33657-d85704- 
    Reviews-Hotel_Bristol-Steamboat_Springs_Colorado.html#REVIEWS"
    
    webpage <-read_html(tripAdvisorURL)
    
    reviewData <- xml_nodes(webpage,xpath = '//*[contains(concat( " ", @class, " 
    " ), concat( " ", "partial_entry", " " ))]')
    
    head(reviewData)
    
    xml_text(reviewData[[1]])
    
    [1] "The rooms were clean and we slept so good we had room 10 and 12 we 
    didn’t use 12 but it joins 10 .kind of strange but loved the hotel ..me 
    personally I would take the hot tub out it was kinda old..the lady 
    that...More"
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   Yifu Yan    6 年前

    如评论中所述,您可以将Rselenium与rvest结合使用,以获得更多的互动性:

    library(RSelenium)
    
    rmDr <- rsDriver(browser = "chrome")
    
    myclient <- rmDr$client
    tripAdvisorURL <- "https://www.tripadvisor.com/Hotel_Review-g33657-d85704-Reviews-Hotel_Bristol-Steamboat_Springs_Colorado.html#REVIEWS"
    myclient$navigate(tripAdvisorURL)
    #select all "more" button, and loop to click them
    webEles <- myclient$findElements(using = "css",value = ".ulBlueLinks")
    for (webEle in webEles) {
        webEle$clickElement()
    }
    
    mypagesource <- myclient$getPageSource()
    
    read_html(mypagesource[[1]]) %>%
        html_nodes(".partial_entry") %>%
        html_text()