第一个问题:
为什么dfm中的功能(名称)这么长?
答:因为字典在
dfm()
lut_dict[70:72]
# Dictionary object with 3 key entries.
# - assault felony:
# - asf
# - assault misdemeanor:
# - asm
# - assault no weapon aggravated injury:
# - anai
第二个问题
:在可复制的示例中,为什么几乎所有的单词都不见了?
答:因为字典值与dfm中的特征唯一匹配的是“etc”类别。
corpus_dfm2 <- dfm(tokens(example_text), # where corpus1M is already a corpus via quanteda::corpus()
remove = stopwords("english"),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
dictionary = lut_dict,
ngrams = 1:2,
stem = TRUE, verbose = TRUE)
corpus_dfm2
# Document-feature matrix of: 3 documents, 1 feature (66.7% sparse).
# 3 x 1 sparse Matrix of class "dfmSparse"
# features
# docs etc.
# text1 0
# text2 0
# text3 1
lut_dict["etc."]
# Dictionary object with 1 key entry.
# - etc.:
# - etc
dfm(tokens(example_text), # the "tokens" is not necessary here
remove = stopwords("english"),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
ngrams = 1:2,
stem = TRUE)
# Document-feature matrix of: 3 documents, 18 features (66.7% sparse).
# 3 x 18 sparse Matrix of class "dfmSparse"
# features
# docs quick brown fox the_quick quick_brown brown_fox like carrot i_like
# text1 1 1 1 1 1 1 0 0 0
# text2 0 0 0 0 0 0 1 1 1
# text3 0 0 0 0 0 0 0 0 0
# features
# docs like_carrot etc cat dog the_there there_that that_etc etc_cat cat_dog
# text1 0 0 0 0 0 0 0 0 0
# text2 1 0 0 0 0 0 0 0 0
# text3 0 1 1 1 1 1 1 1 1
如果要保持功能不匹配,请更换
dictionary
thesaurus
dfm(tokens(example_text),
remove = stopwords("english"),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
thesaurus = lut_dict,
ngrams = 1:2,
stem = TRUE)
Document-feature matrix of: 3 documents, 18 features (66.7% sparse).
3 x 18 sparse Matrix of class "dfmSparse"
features
docs quick brown fox the_quick quick_brown brown_fox like carrot i_like
text1 1 1 1 1 1 1 0 0 0
text2 0 0 0 0 0 0 1 1 1
text3 0 0 0 0 0 0 0 0 0
features
docs like_carrot cat dog the_there there_that that_etc etc_cat cat_dog ETC.
text1 0 0 0 0 0 0 0 0 0
text2 1 0 0 0 0 0 0 0 0
text3 0 1 1 1 1 1 1 1 1