你想要这个功能
tokens_lookup()
,定义了一个字典,该字典将规范的参与方标签定义为键,并将参与方名称的所有ngram变体列为值。通过设置
exclusive = FALSE
它将保留不匹配的标记,实际上是作为所有变体与规范方名称的替代。
在下面的示例中,我对您的输入文本进行了一些修改,以说明如何将党的名称组合成不同于使用“liberal”而不是“liberal party”的短语。
library("quanteda")
text<-c('a text about some political parties called the new democratic party
which is conservative the new democrats and the liberal party and the
liberals which are liberal helping poor people')
toks <- tokens(text)
partydict <- dictionary(list(
olp = c("liberal party", "the liberals"),
ndp = c("new democrats", "new democratic party")
))
(toks2 <- tokens_lookup(toks, partydict, exclusive = FALSE))
## tokens from 1 document.
## text1 :
## [1] "a" "text" "about" "some" "political" "parties"
## [7] "called" "the" "NDP" "which" "is" "conservative"
## [13] "the" "NDP" "and" "the" "OLP" "and"
## [19] "OLP" "which" "are" "liberal" "helping" "poor"
## [25] "people"
因此,它用参与方密钥替换了参与方名称的差异。
sentdict <- dictionary(list(
left = c("liberal", "left"),
right = c("conservative", "")
))
dfm(toks2) %>%
dfm_lookup(dictionary = sentdict, exclusive = FALSE)
## Document-feature matrix of: 1 document, 19 features (0% sparse).
## 1 x 19 sparse Matrix of class "dfm"
## features
## docs olp ndp a text about some political parties called the which is RIGHT and LEFT are helping
## text1 2 2 1 1 1 1 1 1 1 3 2 1 1 2 1 1 1
## features
## docs poor people
## text1 1 1
另外两个注意事项:
-
如果不希望替换标记中的键大写,请设置
capkeys = FALSE
.
-
valuetype
参数,包括
valuetype = regex
. (请注意,示例中的正则表达式的格式可能不正确,因为
|
ndp例子中的操作符将得到“新民主党”或“新的”,然后是“民主党”。但与
令牌查找()