textplot文档名称取自
docnames
语料库的。在这种情况下,您希望创建按分组的新文档
author
docvar公司。这可以使用
texts()
提取器功能及其
groups
论点
为了创建一个可复制的示例,我将使用内置的数据对象
data_char_sampletext
,并将其分割成句子以形成新文档,然后模拟作者docvar。
library("quanteda")
# quanteda version 1.0.0
myCorpus <- corpus(data_char_sampletext) %>%
corpus_reshape(to = "sentences")
# make some duplicated author docvar values
set.seed(1)
docvars(myCorpus, "author") <-
sample(c("author1", "author2", "author3"),
size = ndoc(myCorpus), replace = TRUE)
这将产生:
summary(myCorpus)
# Corpus consisting of 15 documents:
#
# Text Types Tokens Sentences author
# text1.1 23 23 1 author1
# text1.2 40 53 1 author2
# text1.3 48 63 1 author2
# text1.4 30 39 1 author3
# text1.5 20 25 1 author1
# text1.6 43 57 1 author3
# text1.7 13 15 1 author3
# text1.8 25 26 1 author2
# text1.9 9 9 1 author2
# text1.10 37 53 1 author1
# text1.11 32 41 1 author1
# text1.12 30 30 1 author1
# text1.13 28 35 1 author3
# text1.14 16 18 1 author2
# text1.15 32 42 1 author3
#
# Source: /Users/kbenoit/tmp/* on x86_64 by kbenoit
# Created: Fri Feb 16 18:03:13 2018
# Notes: corpus_reshape.corpus(., to = "sentences")
现在,我们将文本提取为字符向量,通过
著者
文档变量。这将生成一个长度为3的命名字符向量,其中名称是(唯一的)作者标识符。
groupedtexts <- texts(myCorpus, groups = "author")
length(groupedtexts)
# [1] 3
names(groupedtexts)
# [1] "author1" "author2" "author3"
然后(举例说明):
textplot_xray(
kwic(groupedtexts, "and"),
kwic(groupedtexts, "for")
)