“剪切粘贴”是一种有严重缺陷的反模式,适用于可复制代码&分析或自动化(我们在数据科学工作流程中所追求的一切)。
这是让您了解数据元素的起始代码(但您仍有一些“卷起袖子”的工作要做)
library(xml2)
library(magrittr)
# temp holding space for the unzipped PPTX
td <- tempfile("dir")
# unzip it and keep file names
fils <- unzip(tempf.pptx, exdir = td)
# look for chart XML files
charts <- fils[grepl("chart.*\\.xml$", fils)]
# read in the first one
chart <- read_xml(charts[1])
现在,我们已经找到并读取了一个图表XML文件,让我们看看我们是否知道它是哪种图表:
# find charts in the XML (i don't know if there can be more than one per-XML file)
(embedded_charts <- xml_find_all(chart, ".//c:chart/c:plotArea"))
## {xml_nodeset (1)}
## [1] <c:plotArea xmlns:c="http://schemas.openxmlformats.org/drawingml/200 ...
# get the node root of the first one (again, i'm not sure if there can be more than one)
(first_embed <- embedded_charts[1])
## {xml_nodeset (1)}
## [1] <c:plotArea xmlns:c="http://schemas.openxmlformats.org/drawingml/200 ...
# use it to get the kind of chart so we can target the values with it
(xml_children(first_embed) %>%
xml_name() %>%
grep("Chart", ., value=TRUE) -> embed_kind)
## [1] "lineChart"
(target <- xml_find_first(first_embed, sprintf(".//c:%s", embed_kind)))
## {xml_nodeset (1)}
## [1] <c:lineChart>\n <c:grouping val="standard"/>\n <c:varyColors val=" ...
# extract "column" metadata
col_refs <- xml_find_all(target, ".//c:ser/c:tx/c:strRef")
(xml_find_all(col_refs, ".//c:f") %>%
sapply(xml_text) -> col_specs)
## [1] "sheet1!$B$1" "sheet1!$C$1" "sheet1!$D$1"
(xml_find_all(col_refs, ".//c:v") %>%
sapply(xml_text))
## [1] "setosa" "versicolor" "virginica"
提取“X”元数据&数据:
x_val_refs <- xml_find_all(target, ".//c:cat")
(lapply(x_val_refs, xml_find_all, ".//c:f") %>%
sapply(xml_text) -> x_val_specs)
## [1] "sheet1!$A$2:$A$36" "sheet1!$A$2:$A$36" "sheet1!$A$2:$A$36"
(lapply(x_val_refs, xml_find_all, ".//c:v") %>%
sapply(xml_double) -> x_vals)
## [,1] [,2] [,3]
## [1,] 4.3 4.3 4.3
## [2,] 4.4 4.4 4.4
## [3,] 4.5 4.5 4.5
## [4,] 4.6 4.6 4.6
## [5,] 4.7 4.7 4.7
## [6,] 4.8 4.8 4.8
## [7,] 4.9 4.9 4.9
## [8,] 5.0 5.0 5.0
## [9,] 5.1 5.1 5.1
## [10,] 5.2 5.2 5.2
## [11,] 5.3 5.3 5.3
## [12,] 5.4 5.4 5.4
## [13,] 5.5 5.5 5.5
## [14,] 5.6 5.6 5.6
## [15,] 5.7 5.7 5.7
## [16,] 5.8 5.8 5.8
## [17,] 5.9 5.9 5.9
## [18,] 6.0 6.0 6.0
## [19,] 6.1 6.1 6.1
## [20,] 6.2 6.2 6.2
## [21,] 6.3 6.3 6.3
## [22,] 6.4 6.4 6.4
## [23,] 6.5 6.5 6.5
## [24,] 6.6 6.6 6.6
## [25,] 6.7 6.7 6.7
## [26,] 6.8 6.8 6.8
## [27,] 6.9 6.9 6.9
## [28,] 7.0 7.0 7.0
## [29,] 7.1 7.1 7.1
## [30,] 7.2 7.2 7.2
## [31,] 7.3 7.3 7.3
## [32,] 7.4 7.4 7.4
## [33,] 7.6 7.6 7.6
## [34,] 7.7 7.7 7.7
## [35,] 7.9 7.9 7.9
提取“Y”元数据和数据:
y_val_refs <- xml_find_all(target, ".//c:val")
(lapply(y_val_refs, xml_find_all, ".//c:f") %>%
sapply(xml_text) -> y_val_specs)
## [1] "sheet1!$B$2:$B$36" "sheet1!$C$2:$C$36" "sheet1!$D$2:$D$36"
(lapply(y_val_refs, xml_find_all, ".//c:v") %>%
sapply(xml_double) -> y_vals)
## [[1]]
## [1] 3.0 3.2 2.3 3.2 3.2 3.0 3.6 3.3 3.8 4.1 3.7 3.4 3.5 3.8 4.0
##
## [[2]]
## [1] 2.4 2.3 2.5 2.7 3.0 2.6 2.7 2.8 2.6 3.2 3.4 3.0 2.9 2.3 2.9 2.8 3.0
## [18] 3.1 2.8 3.1 3.2
##
## [[3]]
## [1] 2.5 2.8 2.5 2.7 3.0 3.0 2.6 3.4 2.5 3.1 3.0 3.0 3.2 3.1 3.0 3.0 2.9
## [18] 2.8 3.0 3.0 3.8
# see if there are X & Y titles
title_nodes <- xml_find_all(first_embed, ".//c:title")
(lapply(title_nodes, xml_find_all, ".//a:t") %>%
sapply(xml_text) -> titles)
## [1] "Sepal.Length" "Sepal.Width"
docxtractr
软件包(用于从Word文档中获取表格)我还没有看到很多关于这个特殊需求的需求,所以我不确定在不久的将来是否会有一个用于上述习惯用法的软件包。