代码之家 › 专栏 › 技术社区 › JHall651

如何在r中从需要交互的网页中刮取文本

rvest web-scraping r

JHall651 · 技术社区 · 6 年前

我正试图从网页上抓取评论来确定词频。然而,当审查时间较长时,只进行部分审查。您必须单击“更多”以获取网页以显示完整的评论。下面是我用来提取评论文本的代码。我如何“点击”更多内容以获得完整评论?

library(rvest)

tripAdvisorURL <- "https://www.tripadvisor.com/Hotel_Review-g33657-d85704- 
Reviews-Hotel_Bristol-Steamboat_Springs_Colorado.html#REVIEWS"

webpage <-read_html(tripAdvisorURL)

reviewData <- xml_nodes(webpage,xpath = '//*[contains(concat( " ", @class, " 
" ), concat( " ", "partial_entry", " " ))]')

head(reviewData)

xml_text(reviewData[[1]])

[1] "The rooms were clean and we slept so good we had room 10 and 12 we 
didnât use 12 but it joins 10 .kind of strange but loved the hotel ..me 
personally I would take the hot tub out it was kinda old..the lady 
that...More"

1 回复 | 直到 6 年前

Yifu Yan 6 年前

如评论中所述,您可以将Rselenium与rvest结合使用,以获得更多的互动性:

library(RSelenium)

rmDr <- rsDriver(browser = "chrome")

myclient <- rmDr$client
tripAdvisorURL <- "https://www.tripadvisor.com/Hotel_Review-g33657-d85704-Reviews-Hotel_Bristol-Steamboat_Springs_Colorado.html#REVIEWS"
myclient$navigate(tripAdvisorURL)
#select all "more" button, and loop to click them
webEles <- myclient$findElements(using = "css",value = ".ulBlueLinks")
for (webEle in webEles) {
    webEle$clickElement()
}

mypagesource <- myclient$getPageSource()

read_html(mypagesource[[1]]) %>%
    html_nodes(".partial_entry") %>%
    html_text()

推荐文章

Marc B. · 使用ggplot2创建条形图时“缺少值”

1 年前

deschen · tidyverse与外部向量发生突变,该外部向量的元素是数据帧中的列值

1 年前

Laura · 在Shiny中使用可排序的包拖放名称,这些名称将成为图表

1 年前

Mallikarjun M · 如何使用随机森林进行时间序列预测?

1 年前

ly li · 模型摘要:当表格形状改变时,拟合优度消失

1 年前

C.Robin · 将marginaffects::predictions()的结果连接回main df?

1 年前

monotonic · 如何将格式为“col1+col3+col4”的数据帧的行名转换为一列数字向量“c(1,3,4)”?

2 年前

Shawn Hemelstrand · 为什么我的自定义errorbar函数不能在R中工作?

2 年前

RoyBatty · 统计每个字符在整个数据集中出现的次数

2 年前

stats_noob · R: 记录某个“行为”发生的循环的索引?

2 年前