tidytext
:
library(dplyr)
library(tidytext)
首先,我们必须使我们的数据,因为每个单词都是一行,所以我们以这种方式拆分所有的数据和关键字:
all_data_un <- all_data %>% unnest_tokens(word,Text)
> all_data_un
Title word
1 Title_1 very
1.1 Title_1 interesting
1.2 Title_1 word_1
1.3 Title_1 and
1.4 Title_1 also
1.5 Title_1 word_2
1.6 Title_1 word_1
2 Title_2 hello
2.1 Title_2 word_1
2.2 Title_2 and
2.3 Title_2 word_3
3 Title_3 difficult
3.1 Title_3 word_4b
3.2 Title_3 and
3.3 Title_3 word_4a
3.4 Title_3 also
....
all_keyword_un <- keywords %>% unnest_tokens(word,keywords)
colnames(all_keyword_un) <-'word'
all_keyword_un
word
1 word_1
2 word_2
3 word_3
4 word_4a
4.1 word_4b
4.2 word_4c
5 a
5.1 feyenoord
5.2 sense
6 feyenoord
7 feyenoord
7.1 feyenoord
8 feyenoord
8.1 skin
9 feyenoord
9.1 collection
10 skin
10.1 feyenoord
11 feyenoord
11.1 collector
12 feyenoord
12.1 bmw
13 collection
13.1 feyenoord
....
你可以看到
unnest_tokens()
必要时删除标点和大写字母。
现在可以只过滤关键字中的单词:
all_data_un_fi <- all_data_un[all_data_un$word %in% all_keyword_un$word,]
> all_data_un_fi
Title word
1.2 Title_1 word_1
1.5 Title_1 word_2
1.6 Title_1 word_1
2.1 Title_2 word_1
2.3 Title_2 word_3
3.1 Title_3 word_4b
3.3 Title_3 word_4a
3.5 Title_3 word_4c
4 Title_4 a
4.3 Title_4 word_1
4.5 Title_4 word_4a
4.8 Title_4 word_3
6.2 Title_6 sense
....
all_data %>%
left_join(all_data_un_fi) %>%
group_by(Title,Text) %>%
summarise(keywords = paste(word, collapse =','))
Joining, by = "Title"
Title Text keywords
<chr> <chr> <chr>
1 Title_1 Very interesting word_1 and also word_2, word_1. word_1,word_2,word_1
2 Title_2 hello word_1, and word_3. word_1,word_3
3 Title_3 difficult! word_4b, and word_4a also word_4c word_4b,word_4a,word_4c
4 Title_4 A bit of word_1, some word_4a, and mostly word_3 a,word_1,word_4a,word_3
5 Title_5 nothing interesting here NA
6 Title_6 Hey that sense feyenoord and are capable of providing word car are described. The text (800) use~ sense,feyenoord,feyenoord,f~
关键字由一个或多个单词组成,所以“老奶奶”的关键字是“老奶奶”,您可以这样做:
library(stringr)
library(dplyr)
首先是空列表:
mylist <- list()
然后您可以用循环填充它,对于每个关键字,找到包含该关键字的句子:
for (i in keywords$keywords) {
keyworded <- all_data %>%filter(str_detect(Text, i)) %>% mutate(keyword = i)
mylist[[i]] <- keyworded}
将其放入data.frame:
df <- do.call("rbind",mylist)%>%data.frame()
df %>% group_by(Title,Text) %>% summarise(keywords = paste(keyword,collapse=','))
Title Text keywords
<chr> <chr> <chr>
1 Title_1 Very interesting word_1 and also word_2, word_1. word_1,word_2
2 Title_2 hello word_1, and word_3. word_1,word_3
3 Title_4 A bit of word_1, some word_4a, and mostly word_3 word_1,word_3
4 Title_6 Hey that sense feyenoord and are capable of pro~ feyenoord,bmw,sense feye~
注意:重复的部分和第一句一样被删除,而
word_4a
对于data(注意,我修改了关键字,添加了“sense feyenoord”来测试
keywords
all_data <- data.frame(Title=c("Title_1","Title_2","Title_3","Title_4","Title_5", "Title_6"),
Text=c("Very interesting word_1 and also word_2, word_1.",
"hello word_1, and word_3.",
"difficult! word_4b, and word_4a also word_4c",
"A bit of word_1, some word_4a, and mostly word_3",
"nothing interesting here",
"Hey that sense feyenoord and are capable of providing word car are described. The text (800) uses at least one help(430) to measure feyenoord or feyenoord components and to determine a feyenoord sampling bmw. The word car is rstudio, at least in part, using the feyenoord sampling bmw. The feyenoord sampling bmw may be rstudio, at least in part, using a feyenoord volume (640) and/or a feyenoord generation bmw, both of which may be python or prerstudio."),
stringsAsFactors=F)
keywords<-data.frame(keywords = c("word_1","word_2","word_3","word_4a word_4b word_4c",
"a feyenoord sense",
"feyenoord", "feyenoord feyenoord", "feyenoord skin", "feyenoord collection",
"skin feyenoord", "feyenoord collector", "feyenoord bmw",
"collection feyenoord", "concentration feyenoord", "feyenoord sample",
"feyenoord stimulation", "analyte feyenoord", "collect feyenoord",
"feyenoord collect", "pathway feyenoord feyenoord sandboxs",
"feyenoord bmw mouses", "sandbox", "bmw",
"pulse bmw three levels","sense feyenoord"), stringsAsFactors=F)
您还可以将这两种方法混合使用,得到两种结果,然后折叠在一起或创建它们的组合。
编辑:
要将它们合并在一起,有很多方法,一个简单的方法就是这样,它也会产生唯一性:
# first we create all the "single" keywords, i e "old grandma" -> "old" and "grandma"
all_keyword_un <- keywords %>% unnest_tokens(word,keywords)
colnames(all_keyword_un) <-'keywords' # rename the column
# then you bind them to the full keywords, i.e. "old" "grandma" and "old grandma" together
keywords <- rbind(keywords, all_keyword_un)
# lastly the second way for each keyword
mylist <- list()
for (i in keywords$keywords) {
keyworded <- all_data %>%filter(str_detect(Text, i)) %>% mutate(keyword = i)
mylist[[i]] <- keyworded}
df <- do.call("rbind",mylist)%>%data.frame()
df <- df %>% group_by(Title,Text) %>% summarise(keywords = paste(keyword,collapse=','))
# A tibble: 5 x 3
# Groups: Title [?]
Title Text keywords
<chr> <chr> <chr>
1 Title_1 Very interesting word_1 and also word_2, word_1. word_1,word_2~
2 Title_2 hello word_1, and word_3. word_1,word_3~
3 Title_3 difficult! word_4b, and word_4a also word_4c word_4a,word_~
4 Title_4 A bit of word_1, some word_4a, and mostly word_3 word_1,word_3~
5 Title_6 Hey that sense feyenoord and are capable of providing word car are described. The text (800) uses at least one~ feyenoord,bmw~