代码之家 › 专栏 › 技术社区 › BARIK FATI

如何构建hashtags语料库(文本挖掘)

topic-modeling corpus text-mining r

BARIK FATI · 技术社区 · 7 年前

我试图通过挖掘所有的标签来分析twitter数据。我想把所有的hashtag放在一个语料库中,并将这个语料库映射到一个单词列表。你知道我如何处理这个问题吗? 这是我的数据快照

这是我使用的代码,但我的DTM中有一个问题,即完全稀疏

step1 <- strsplit(newFile$Hashtag, "#")
step2 <- lapply(step1, tail, -1)
result <- lapply(step2, function(x){
sapply(strsplit(x, " "), head, 1)
})
result2<-do.call(c, unlist(result, recursive=FALSE))
myCorpus <- tm::Corpus(VectorSource(result2)) # create a corpus

这是关于我的语料库的信息

myCorpus
  <<SimpleCorpus>>
 Metadata:  corpus specific: 1, document level (indexed): 0
 Content:  documents: 12635

和我的DTM

<<DocumentTermMatrix (documents: 12635, terms: 6280)>>
Non-/sparse entries: 12285/79335515
Sparsity           : 100%
Maximal term length: 36
Weighting          : term frequency (tf)

1 回复 | 直到 7 年前

Tito Sanz 7 年前

你的问题是你正在使用 str_split . 您应该尝试:

str_extract_all("This all are hashtag #hello #I #am #a #buch #of #hashtags", "#\\S+")

As results this list:
[[1]]
[1] "#hello"    "#I"        "#am"       "#a"        "#buch"     "#of"      
[7] "#hashtags"

如果您想要的结果是数据帧,请使用 simplify = T :

str_extract_all("This all are hashtag #hello #I #am #a #buch #of #hashtags", "#\\S+", simplify = T)

因此:

     [,1]     [,2] [,3]  [,4] [,5]    [,6]  [,7]       
[1,] "#hello" "#I" "#am" "#a" "#buch" "#of" "#hashtags"

推荐文章

Marc B. · 使用ggplot2创建条形图时“缺少值”

1 年前

deschen · tidyverse与外部向量发生突变,该外部向量的元素是数据帧中的列值

1 年前

Laura · 在Shiny中使用可排序的包拖放名称,这些名称将成为图表

1 年前

Mallikarjun M · 如何使用随机森林进行时间序列预测?

1 年前

ly li · 模型摘要:当表格形状改变时,拟合优度消失

1 年前

C.Robin · 将marginaffects::predictions()的结果连接回main df?

1 年前

monotonic · 如何将格式为“col1+col3+col4”的数据帧的行名转换为一列数字向量“c(1,3,4)”?

2 年前

Shawn Hemelstrand · 为什么我的自定义errorbar函数不能在R中工作?

2 年前

RoyBatty · 统计每个字符在整个数据集中出现的次数

2 年前

stats_noob · R: 记录某个“行为”发生的循环的索引?

2 年前