除了拥有一个灵活的工具包,数据科学还经常需要开箱即用的思维(至少在我的职业中是这样)。
但是,首先是关于PDF文件的一件事。
我认为他们不是你想象的那样。”粗体(或斜体)不是“元数据”。您应该花一些时间阅读PDF文件,因为它们是复杂的、讨厌的、邪恶的东西,您在处理数据时可能经常遇到。读这个
https://stackoverflow.com/a/19777953/1457051
要知道真正的文本需要什么(遵循1.8×x的链接)
pdfbox
解决方案)。
回到我们不定期的回答
虽然我是R最忠实的支持者之一,但不是
一切
需要在R中完成或应该在R中完成。当然,我们将使用R来
最后
得到你
大胆的
但我们将使用一个助手命令行实用程序来完成这项工作。
这个
pdftools
包基于
poppler
图书馆。它附带了源代码,所以“我只是一个R用户”,人们可能没有完整的
波普勒
他们系统上的工具集。
Mac用户可以使用
Homebrew
到(一旦您获得自制设置):
Linux的人知道怎么做。Windows用户永远都会丢失(在那里
是
Poppler二进制文件,但是你最好把时间花在切换到真正的操作系统上)。
一旦你这样做了,你就可以利用下面的内容来实现你的目标。
首先,我们将使用许多安全保险杠来实现辅助功能:
#' Uses the command-line pdftohtml function from the poppler library
#' to convert a PDF to HTML and then read it in with xml2::read_html()
#'
#' @md
#' @param path the path to the file [path.expand()] will be run on this value
#' @param extra_args extra command-line arguments to be passed to `pdftohtml`.
#' They should be supplied as you would supply arguments to the `args`
#' parameter of [system2()].
read_pdf_as_html <- function(path, extra_args=character()) {
# make sure poppler/pdftohtml is installed
pdftohtml <- Sys.which("pdftohtml")
if (pdftohtml == "") {
stop("The pdftohtml command-line utility must be installed.", call.=FALSE)
}
# make sure the file exists
path <- path.expand(path)
stopifnot(file.exists(path))
# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")
# get by with a little help from our friends
suppressPackageStartupMessages({
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})
# we're going to do the conversion in a temp directory space
td <- tempfile(fileext = "_dir")
dir.create(td)
on.exit(unlink(td, recursive=TRUE), add=TRUE)
# save our current working directory
curwd <- getwd()
on.exit(setwd(curwd), add=TRUE)
# move to the temp space
setwd(td)
file.copy(path, td)
# collect the extra arguments
c(
"-i" # ignore images
) -> args
args <- c(args, extra_args, basename(path), "r-doc") # saves it to r-doc-html.html
# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")
# we'll let stderr display so you can debug errors
system2(
command = pdftohtml,
args = args,
stdout = TRUE
) -> res
res <- gsub("^Page-", "", res[length(res)])
message("Converted ", res, " pages")
# this will need to be changed if poppler ever does anything different
xml2::read_html("r-docs.html")
}
现在,我们将使用它:
doc <- read_pdf_as_html("~/Data/Mulla__Indian_Contract_Act2018-11-12_01-00.PDF")
bold_tags <- html_nodes(doc, xpath=".//b")
bold_words <- html_text(bold_tags)
head(bold_words, 20)
## [1] "Preamble"
## [2] "WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;"
## [3] "History"
## [4] "Ancient and Medieval Period"
## [5] "The Introduction of English Law Into India"
## [6] "Mofussal Courts"
## [7] "Legislation"
## [8] "The Indian Contract Act 1872"
## [9] "The Making of the Act"
## [10] "Law of Contract Until 1950"
## [11] "The Law of Contract after 1950"
## [12] "Amendments to This Act"
## [13] "Other Laws Affecting Contracts and Enforcement"
## [14] "Recommendations of the Indian Law Commission"
## [15] "Section 1."
## [16] "Short title"
## [17] "Extent, Commencement."
## [18] "Enactments Repealed."
## [19] "Applicability of the Act"
## [20] "Scheme of the Act"
length(bold_words)
## [1] 1939
根本不需要Java,你已经拥有了
豪言壮语
.
如果你真的想去
pdfbox-app
正如Ralf所指出的,可以使用这个包装器使其更易于使用:
read_pdf_as_html_with_pdfbox <- function(path) {
java <- Sys.which("java")
if (java == "") {
stop("Java binary is not on the system PATH.", call.=FALSE)
}
# get by with a little help from our friends
suppressPackageStartupMessages({
library(httr, warn.conflicts = FALSE, quietly = TRUE)
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})
path <- path.expand(path)
stopifnot(file.exists(path))
# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")
# download the pdfbox "app" if not installed
if (!dir.exists("~/.pdfboxjars")) {
message("~/.pdfboxjars not found. Creating it and downloading pdfbox-app jar...")
dir.create("~/.pdfboxjars")
httr::GET(
url = "http://central.maven.org/maven2/org/apache/pdfbox/pdfbox-app/2.0.12/pdfbox-app-2.0.12.jar",
httr::write_disk(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
httr::progress()
) -> res
httr::stop_for_status(res)
}
# we're going to do the conversion in a temp directory space
tf <- tempfile(fileext = ".html")
on.exit(unlink(tf), add=TRUE)
c(
"-jar",
path.expand(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
"ExtractText",
"-html",
path,
tf
) -> args
# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")
system2(
command = java,
args = args
) -> res
xml2::read_html(tf)
}