正如您正确指出的那样,使用唯一的HTML标识符通常更容易。所有HTML元素都有特定的属性。拿着
创建日期
您要单击的按钮:
<button class="_Button_6kisxq _FakeLink_6kisxq _facet-expand-button_13c61c" data-analytics-name="Filter facet toggle Date created" data-test-filter-facet-toggle="Date created" aria-controls="dateCreated-08686056605854187" title="Date created" type="button">
<span>Date created</span>
<svg class="svg-inline--fa fa-caret-down" data-prefix="fas" data-icon="caret-down" aria-hidden="true" focusable="false" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 320 512">
<path fill="currentColor" d="M31.3 192h257.3c17.8 0 26.7 21.5 14.1 34.1L174.1 354.8c-7.8 7.8-20.5 7.8-28.3 0L17.2 226.1C4.6 213.5 13.5 192 31.3 192z"></path>
</svg>
</button>
属性包括:
class="_Button_6kisxq _FakeLink_6kisxq _facet-expand-button_13c61c" data-analytics-name="Filter facet toggle Date created" data-test-filter-facet-toggle="Date created" aria-controls="dateCreated-08686056605854187"
我们走吧
data-test-filter-facet-toggle='Date created'
点击“数据创建过滤器”下拉菜单!
顺便说一句,您可以通过单击右上角的小箭头图标,然后选择按钮元素来提取这些数据,如下所示:
最后,我们可以在一段时间后实现点击按钮。我们需要等待元素出现,因为网站不会立即加载。这是Webscraping的第一条规则:始终等待元素出现。我之前甚至实现了类似的东西,网站上的HTML元素名称各不相同,顺序也不同。加载时间也不是恒定的。因此,我强烈建议始终等待元素首先出现,并使您的代码对网页设计的更改尽可能健壮。你的这一页有时似乎特别慢!
顺便说一句,你甚至可以导出日期列表,然后选择任何过滤值。。。
代码:
library(rvest)
library(chromote)
library(purrr)
library(tibble)
# Start a Chromote session
b <- ChromoteSession$new()
url <- "https://osf.io/search?resourceType=Registration%2CRegistrationComponent"
pr_sess <- read_html_live(url)
pr_sess$view()
# Click the "Date created" dropdown dynamically
# Step 2: Wait for the button to load and click it
timeout <- 10 # Maximum wait time in seconds
button <- NULL
start_time <- Sys.time()
while (is.null(button) && as.numeric(Sys.time() - start_time) < timeout) {
button <- tryCatch(
pr_sess %>% html_element("[data-test-filter-facet-toggle='Date created']"),
error = function(e) NULL
)
Sys.sleep(0.5) # Check every 0.5 seconds
}
if (is.null(button)) {
stop("Button did not appear within the timeout period.")
}
# Click the "Date created" dropdown
pr_sess$click("[data-test-filter-facet-toggle='Date created']")
# Step 3: Wait for the dropdown to load
Sys.sleep(2) # Adjust based on load time
# Step 4: Extract the list items of the ul list below the filter
facet_list <- pr_sess %>%
html_elements("ul._facet-list_13c61c li._facet-value_13c61c")
# Step 5: Parse the extracted items into a data frame
facet_data <- facet_list %>%
map_df(~ {
year <- .x %>%
html_element("button") %>%
html_text2() %>%
as.character()
count <- .x %>%
html_element("span._facet-count_13c61c") %>%
html_text2() %>%
as.integer()
tibble(year = year, count = count)
})
# Print the extracted data
print(facet_data)
# click on any of the list values, filter with e.g. pr_sess$click("[data-test-filter-facet-value = '2024']")
这将打印可用的创建日期列表:
> # Print the extracted data
> print(facet_data)
# A tibble: 14 Ã 2
year count
<chr> <int>
1 2024 31020
2 2023 31001
3 2022 28099
4 2021 25604
5 2020 24456
6 2019 17142
7 2018 13833
8 2017 8751
9 2016 5688
10 2015 3314
11 2014 954
12 2013 717
13 2012 91
14 2011 2
这应该会给你一个很好的开端!顺便问一下,你接下来想做什么?