wayback
包裹
GL
) (
GH
)
它支持查询互联网档案和以HTML格式读取保存的页面(“纪念品”)。你可以通过
http://www.mementoweb.org/guide/quick-intro/
&
https://mementoweb.org/guide/rfc/
作为启动资源。
library(wayback) # devtools::install_git(one of the superscript'ed links above)
library(rvest) # for reading the resulting HTML contents
library(tibble) # mostly for prettier printing of data frames
我们可以采取多种方法。这是我在对在线内容进行法医学分析时倾向于做的。基督教青年会。
首先,我们得到记录在案的纪念品(基本上是相关内容的简短列表):
(rss <- get_mementos("http://www.dailyecho.co.uk/news/district/winchester/rss/"))
## # A tibble: 7 x 3
## link rel ts
## <chr> <chr> <dttm>
## 1 http://www.dailyecho.co.uk/news/district/winchester/rss/ original NA
## 2 http://web.archive.org/web/timemap/link/http://www.dailyecho.co⦠timemap NA
## 3 http://web.archive.org/web/http://www.dailyecho.co.uk/news/dist⦠timegate NA
## 4 http://web.archive.org/web/20090517035444/http://www.dailyecho.⦠first me⦠2009-05-17 03:54:44
## 5 http://web.archive.org/web/20180712045741/http://www.dailyecho.⦠prev mem⦠2018-07-12 04:57:41
## 6 http://web.archive.org/web/20180812213013/http://www.dailyecho.⦠memento 2018-08-12 21:30:13
## 7 http://web.archive.org/web/20180812213013/http://www.dailyecho.⦠last mem⦠2018-08-12 21:30:13
IA的日历菜单查看器实际上就是“时间地图”。我喜欢用这个,因为它是所有爬行的时间点记忆列表。这是上面的第二个链接,我们将在下面阅读:
(tm <- get_timemap(rss$link[2]))
## # A tibble: 45 x 5
## rel link type from datetime
## <chr> <chr> <chr> <chr> <chr>
## 1 original http://www.dailyecho.co.uk:80/news/d⦠NA NA NA
## 2 self http://web.archive.org/web/timemap/l⦠applicatio⦠Sun, 17 May ⦠NA
## 3 timegate http://web.archive.org NA NA NA
## 4 first memento http://web.archive.org/web/200905170⦠NA NA Sun, 17 May 20â¦
## 5 memento http://web.archive.org/web/200908130⦠NA NA Thu, 13 Aug 20â¦
## 6 memento http://web.archive.org/web/200911121⦠NA NA Thu, 12 Nov 20â¦
## 7 memento http://web.archive.org/web/201001121⦠NA NA Tue, 12 Jan 20â¦
## 8 memento http://web.archive.org/web/201007121⦠NA NA Mon, 12 Jul 20â¦
## 9 memento http://web.archive.org/web/201011271⦠NA NA Sat, 27 Nov 20â¦
## 10 memento http://web.archive.org/web/201106290⦠NA NA Wed, 29 Jun 20â¦
## # ... with 35 more rows
内容在mementos中,应该有您在日历视图中看到的尽可能多的mementos。我们将在第一篇文章中读到:
mem <- read_memento(tm$link)
# Ideally use writeLines(), now, to save this to disk with a good
# filename. Alternatively, stick it in a data frame with metadata
# and saveRDS() it. But, that's not a format others (outside R) can
# use so perhaps do the data frame thing and stream it out as ndjson
# with jsonlite::stream_out() and compress it during save or afterwards.
然后把它转换成我们可以编程使用的东西
xml2::read_xml()
或
xml2::read_html()
(RSS有时更好地解析为XML):
read_html(mem)
## {xml_document}
## <html>
## [1] <body><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Daily Ec ...
read_memento()
as
参数自动解析结果,但我喜欢将纪念品存储在本地(如注释中所述),以免滥用IA服务器(即,如果我需要再次获取数据,我不必访问它们的基础结构)。
一个很大的警告是,如果你试图在短时间内从IA获得太多的资源,你会被暂时禁止,因为他们有规模,但这是一个免费的服务,他们(理所当然)试图防止滥用。