将文档嵌入同一空间的一种方法是从两列中学习词汇:
preproc_func = tolower
token_func = word_tokenizer
union_txt = c(Train_PRDHA_String.df$MAKTX_Keyword, Train_PRDHA_String.df$PH_Level_04_Keyword)
it_train = itoken(union_txt,
preprocessor = preproc_func,
tokenizer = token_func,
ids = Train_PRDHA_String.df$ID,
progressbar = TRUE)
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
it1 = itoken(Train_PRDHA_String.df$MAKTX_Keyword, preproc_func,
token_func, ids = Train_PRDHA_String.df$ID)
dtm_train_1 = create_dtm(it1, vectorizer)
it2 = itoken(Train_PRDHA_String.df$PH_Level_04_Keyword, preproc_func,
token_func, ids = Train_PRDHA_String.df$ID)
dtm_train_2 = create_dtm(it2, vectorizer)
然后你可以把它们组合成一个矩阵:
dtm_train = cbind(dtm_train_1, dtm_train_2)
char_tokenizer
ngram > 1
(说
ngram = c(3, 3)
). 检查一下很好
stringdist
包裹我想你收到了
Result