代码之家 › 专栏 › 技术社区 › Colin

Quanteda textplot\u X射线按非唯一docvar分组为文档

quanteda lexical corpus plot r

1

Colin · 技术社区 · 7 年前

我有一个Quanteda语料库,包含10个文档,其中有几个是同一位作者写的。我将作者存储在单独的docvar列中- myCorpus$documents[,"author"]

> docvars(myCorpus)

          author   
206035    author1   
269823    author2   
304225    author1   
422364    author2
<...snip..>

我正在绘制 Lexical Dispersion Plot with xplot_xray ,

textplot_xray(
            kwic(myCorpus, "image"),
            kwic(myCorpus, "one"),
            kwic(myCorpus, "like"),
            kwic(myCorpusus, "time"),
            kwic(myCorpus, "just"),
            scale = "absolute"
          )

如何使用 myCorpus$文档[,“作者”] 作为文档标识符而不是文档ID?

我不是要对文档进行分组,我只是想通过作者来识别文档。我认识到文档ID必须是唯一的,所以不能简单地用 docnames(myCorpus)<-

1 回复 | 直到 7 年前

1

Ken Benoit 7 年前

textplot文档名称取自 docnames 语料库的。在这种情况下,您希望创建按分组的新文档 author docvar公司。这可以使用 texts() 提取器功能及其 groups 论点

为了创建一个可复制的示例,我将使用内置的数据对象 data_char_sampletext ,并将其分割成句子以形成新文档,然后模拟作者docvar。

library("quanteda")
# quanteda version 1.0.0

myCorpus <- corpus(data_char_sampletext) %>% 
    corpus_reshape(to = "sentences")
# make some duplicated author docvar values
set.seed(1)
docvars(myCorpus, "author") <- 
    sample(c("author1", "author2", "author3"), 
           size = ndoc(myCorpus), replace = TRUE)

这将产生:

summary(myCorpus)
# Corpus consisting of 15 documents:
#     
#     Text Types Tokens Sentences  author
#  text1.1    23     23         1 author1
#  text1.2    40     53         1 author2
#  text1.3    48     63         1 author2
#  text1.4    30     39         1 author3
#  text1.5    20     25         1 author1
#  text1.6    43     57         1 author3
#  text1.7    13     15         1 author3
#  text1.8    25     26         1 author2
#  text1.9     9      9         1 author2
# text1.10    37     53         1 author1
# text1.11    32     41         1 author1
# text1.12    30     30         1 author1
# text1.13    28     35         1 author3
# text1.14    16     18         1 author2
# text1.15    32     42         1 author3
# 
# Source:  /Users/kbenoit/tmp/* on x86_64 by kbenoit
# Created: Fri Feb 16 18:03:13 2018
# Notes:   corpus_reshape.corpus(., to = "sentences")

现在,我们将文本提取为字符向量,通过 著者 文档变量。这将生成一个长度为3的命名字符向量,其中名称是(唯一的)作者标识符。

groupedtexts <- texts(myCorpus, groups = "author")
length(groupedtexts)
# [1] 3
names(groupedtexts)
# [1] "author1" "author2" "author3"

然后(举例说明):

textplot_xray(
    kwic(groupedtexts, "and"),
    kwic(groupedtexts, "for")
)