代码之家 › 专栏 › 技术社区 › R overflow

从文本中提取多个关键字并在数据框中打印

dplyr text regex r

R overflow · 技术社区 · 6 年前

我有一个数据帧(称为 ),如下所示:

Title         Text 
Title_1       Very interesting word_1 and also keyword_2
Title_2       hello keyword_1, and keyword_3.

我还有第二个数据帧(称为 关键词

keywords
word_1
word_2
word_3
word_4a word_4b word_4c

我想创造在所有数据帧中。在此列中,如果其中一个关键字(来自“关键字”数据框)出现在all\ U data$Text或all\ U data$Title列中,我希望打印相关关键字。例如:

Title         Text                                               Keywords
Title_1       Very interesting word_1 and also word_2, word_1.   word_1, word_2
Title_2       hello word_1, and word_3.                          word_1, word_3
Title_3       difficult! word_4b, and word_4a also word_4c       word_4a word_4b word_4c

! 只需在all\ u data$words列中打印一次单词,而不是多次。 对我来说,harders的部分是打印一个“关键字”,比如:“keyword\u a keyword\u A1 keyword\u A3”,只有当关键字的所有部分都出现在相关文本中时,它才会出现。

Recognize patterns in column, and add them to column in Data frame ),我用他的解决方案:

ls <- strsplit(tolower(paste(all_data$Title, all_data$Text)),"(\\s+)|(?!')(?=[[:punct:]])", perl = TRUE)    

all_data$Keywords <- do.call("rbind",lapply(ls,function(x) paste(unique(x[x %in% tolower(keywords)]), collapse = ", ")))

更新

@尼古拉斯2帮了我一个解决方案(谢谢你)。但不幸的是,它失败了。有人知道怎么解决这个问题吗?正如您在下面的示例中所看到的,例如,关键字“feyenoord skin”不应该出现(因为“skin”没有出现在文本中)。我只想关键字出现,如果他们出现在文本中(或与多个关键字,如“你好世界”,这将是伟大的,如果它出现,如果所有的字都出现在文本中(所以你好和世界)。非常感谢!

df <- data.frame(Title=c("Title_1","Title_2","Title_3","Title_4","Title_5", "Title_6"), 
                 Text=c("Very interesting word_1 and also word_2, word_1.", 
                        "hello word_1, and word_3.", 
                        "difficult! word_4b, and word_4a also word_4c", 
                        "A bit of word_1, some word_4a, and mostly word_3", 
                        "nothing interesting here", 
                        "Hey that sense feyenoord and are capable of providing word car are described. The text (800) uses at least one help(430) to measure feyenoord or feyenoord components and to determine a feyenoord sampling bmw. The word car is rstudio, at least in part, using the feyenoord sampling bmw. The feyenoord sampling bmw may be rstudio, at least in part, using a feyenoord volume (640) and/or a feyenoord generation bmw, both of which may be python or prerstudio."), 
                 stringsAsFactors=F) 


keywords<-data.frame(Keyword=c("word_1","word_2","word_3","word_4a word_4b word_4c", 
                               "a feyenoord sense", 
                               "feyenoord", "feyenoord feyenoord", "feyenoord skin", "feyenoord collection", 
                               "skin feyenoord", "feyenoord collector", "feyenoord bmw", 
                               "collection feyenoord", "concentration feyenoord", "feyenoord sample",
                               "feyenoord stimulation", "analyte feyenoord", "collect feyenoord", 
                               "feyenoord collect", "pathway feyenoord feyenoord sandboxs", 
                               "feyenoord bmw mouses", "sandbox", "bmw", 
                               "pulse bmw three levels"),stringsAsFactors=F) 

# split the keywords into words, but remember keyword length 
k <- keywords %>% mutate(l=str_split(Keyword," ")) %>% unnest %>% 
  group_by(Keyword) %>% mutate(n=n()) %>% ungroup 
# split the title into words 
# compare with words from keywords 
# keep only possibly multiple, but full matches 
# collate all results and merge back to the original data 
test <- df %>% mutate(l=str_split(Text,"[ .,]")) %>% unnest %>% 
  inner_join(k,by="l") %>% 
  group_by(Title,Keyword) %>% filter(n()%%n==0) %>% 
  distinct(Keyword) %>% ungroup %>% nest(Keyword) %>% 
  rowwise %>% mutate(keywords=paste(data[[1]],collapse=", ")) %>% select(-data) %>% 
  inner_join(df,.,by="Title") 

View(test)

4 回复 | 直到 6 年前

eddi 6 年前

我没有优化任何东西,只是做了最简单的事情:

library(data.table)

setDT(df)
setDT(keywords)

keywords[, strsplit(Keyword, ' '), by = Keyword
       ][, c(.SD[, .(row = seq_len(nrow(df)), found = grepl(V1, df$Text)), by = V1],
             N = .N), by = Keyword
       ][, sum(found) == N[1], by = .(Keyword, row)
       ][, paste(Keyword[V1], collapse = ","), by = row]
#   row                                            V1
#1:   1                                 word_1,word_2
#2:   2                                 word_1,word_3
#3:   3                       word_4a word_4b word_4c
#4:   4                                 word_1,word_3
#5:   5                                              
#6:   6 a feyenoord sense,feyenoord,feyenoord bmw,bmw

s__ 6 年前

tidytext :

library(dplyr)     
library(tidytext)  # text manipulation

首先,我们必须使我们的数据,因为每个单词都是一行,所以我们以这种方式拆分所有的数据和关键字:

all_data_un <- all_data %>% unnest_tokens(word,Text)
    > all_data_un
       Title        word
1    Title_1        very
1.1  Title_1 interesting
1.2  Title_1      word_1
1.3  Title_1         and
1.4  Title_1        also
1.5  Title_1      word_2
1.6  Title_1      word_1
2    Title_2       hello
2.1  Title_2      word_1
2.2  Title_2         and
2.3  Title_2      word_3
3    Title_3   difficult
3.1  Title_3     word_4b
3.2  Title_3         and
3.3  Title_3     word_4a
3.4  Title_3        also
....

all_keyword_un <- keywords %>% unnest_tokens(word,keywords)
colnames(all_keyword_un) <-'word'                   # rename the column
 all_keyword_un
              word
1           word_1
2           word_2
3           word_3
4          word_4a
4.1        word_4b
4.2        word_4c
5                a
5.1      feyenoord
5.2          sense
6        feyenoord
7        feyenoord
7.1      feyenoord
8        feyenoord
8.1           skin
9        feyenoord
9.1     collection
10            skin
10.1     feyenoord
11       feyenoord
11.1     collector
12       feyenoord
12.1           bmw
13      collection
13.1     feyenoord
....

你可以看到 unnest_tokens() 必要时删除标点和大写字母。

现在可以只过滤关键字中的单词:

all_data_un_fi <- all_data_un[all_data_un$word %in% all_keyword_un$word,]
      > all_data_un_fi
       Title      word
1.2  Title_1    word_1
1.5  Title_1    word_2
1.6  Title_1    word_1
2.1  Title_2    word_1
2.3  Title_2    word_3
3.1  Title_3   word_4b
3.3  Title_3   word_4a
3.5  Title_3   word_4c
4    Title_4         a
4.3  Title_4    word_1
4.5  Title_4   word_4a
4.8  Title_4    word_3
6.2  Title_6     sense 
....

all_data %>%                                      # starting data
left_join(all_data_un_fi) %>%                     # joining without forget any sentence
group_by(Title,Text) %>%                          # group by title and text
summarise(keywords = paste(word, collapse =','))  # put in one cell all the keywords finded


   Joining, by = "Title"
# A tibble: 6 x 3
# Groups:   Title [?]
  Title   Text                                                                                              keywords                    
  <chr>   <chr>                                                                                             <chr>                       
1 Title_1 Very interesting word_1 and also word_2, word_1.                                                  word_1,word_2,word_1        
2 Title_2 hello word_1, and word_3.                                                                         word_1,word_3               
3 Title_3 difficult! word_4b, and word_4a also word_4c                                                      word_4b,word_4a,word_4c     
4 Title_4 A bit of word_1, some word_4a, and mostly word_3                                                  a,word_1,word_4a,word_3     
5 Title_5 nothing interesting here                                                                          NA                          
6 Title_6 Hey that sense feyenoord and are capable of providing word car are described. The text (800) use~ sense,feyenoord,feyenoord,f~

关键字由一个或多个单词组成,所以“老奶奶”的关键字是“老奶奶”,您可以这样做:

library(stringr)
library(dplyr)

首先是空列表:

mylist <- list()

然后您可以用循环填充它,对于每个关键字,找到包含该关键字的句子:

for (i in keywords$keywords) {
keyworded <- all_data %>%filter(str_detect(Text, i)) %>% mutate(keyword = i)
  mylist[[i]] <- keyworded}

将其放入data.frame:

 df <- do.call("rbind",mylist)%>%data.frame()

 df %>% group_by(Title,Text) %>% summarise(keywords = paste(keyword,collapse=','))

# A tibble: 4 x 3
# Groups:   Title [?]
  Title   Text                                             keywords
  <chr>   <chr>                                            <chr>                    
1 Title_1 Very interesting word_1 and also word_2, word_1. word_1,word_2            
2 Title_2 hello word_1, and word_3.                        word_1,word_3            
3 Title_4 A bit of word_1, some word_4a, and mostly word_3 word_1,word_3            
4 Title_6 Hey that sense feyenoord and are capable of pro~ feyenoord,bmw,sense feye~

注意:重复的部分和第一句一样被删除,而 word_4a

对于data(注意,我修改了关键字,添加了“sense feyenoord”来测试 keywords

   all_data <-  data.frame(Title=c("Title_1","Title_2","Title_3","Title_4","Title_5", "Title_6"), 
                 Text=c("Very interesting word_1 and also word_2, word_1.", 
                        "hello word_1, and word_3.", 
                        "difficult! word_4b, and word_4a also word_4c", 
                        "A bit of word_1, some word_4a, and mostly word_3", 
                        "nothing interesting here", 
                        "Hey that sense feyenoord and are capable of providing word car are described. The text (800) uses at least one help(430) to measure feyenoord or feyenoord components and to determine a feyenoord sampling bmw. The word car is rstudio, at least in part, using the feyenoord sampling bmw. The feyenoord sampling bmw may be rstudio, at least in part, using a feyenoord volume (640) and/or a feyenoord generation bmw, both of which may be python or prerstudio."), 
                 stringsAsFactors=F) 

keywords<-data.frame(keywords = c("word_1","word_2","word_3","word_4a word_4b word_4c", 
                               "a feyenoord sense", 
                               "feyenoord", "feyenoord feyenoord", "feyenoord skin", "feyenoord collection", 
                               "skin feyenoord", "feyenoord collector", "feyenoord bmw", 
                               "collection feyenoord", "concentration feyenoord", "feyenoord sample",
                               "feyenoord stimulation", "analyte feyenoord", "collect feyenoord", 
                               "feyenoord collect", "pathway feyenoord feyenoord sandboxs", 
                               "feyenoord bmw mouses", "sandbox", "bmw", 
                               "pulse bmw three levels","sense feyenoord"), stringsAsFactors=F)

您还可以将这两种方法混合使用,得到两种结果,然后折叠在一起或创建它们的组合。

编辑:
要将它们合并在一起,有很多方法,一个简单的方法就是这样,它也会产生唯一性:

# first we create all the "single" keywords, i e "old grandma" -> "old" and "grandma"
all_keyword_un <- keywords %>% unnest_tokens(word,keywords)
colnames(all_keyword_un) <-'keywords'                   # rename the column

# then you bind them to the full keywords, i.e. "old" "grandma" and "old grandma" together
keywords <- rbind(keywords, all_keyword_un)

# lastly the second way for each keyword
mylist <- list()
for (i in keywords$keywords) {
  keyworded <- all_data %>%filter(str_detect(Text, i)) %>% mutate(keyword = i)
  mylist[[i]] <- keyworded}

df <- do.call("rbind",mylist)%>%data.frame()
df <- df %>% group_by(Title,Text) %>% summarise(keywords = paste(keyword,collapse=','))

# A tibble: 5 x 3
# Groups:   Title [?]
  Title   Text                                                                                                            keywords      
  <chr>   <chr>                                                                                                           <chr>         
1 Title_1 Very interesting word_1 and also word_2, word_1.                                                                word_1,word_2~
2 Title_2 hello word_1, and word_3.                                                                                       word_1,word_3~
3 Title_3 difficult! word_4b, and word_4a also word_4c                                                                    word_4a,word_~
4 Title_4 A bit of word_1, some word_4a, and mostly word_3                                                                word_1,word_3~
5 Title_6 Hey that sense feyenoord and are capable of providing word car are described. The text (800) uses at least one~ feyenoord,bmw~

Nicolas2 6 年前

df <- data.frame(
   Title=c("Title_1","Title_2","Title_3","Title_4"),
   Text=c("Very interesting word_1 and also word_2, word_1.",
          "hello word_1, and word_3.",                     
          "difficult! word_4b, and word_4a also word_4c",
          "nothing interesting here"),stringsAsFactors=FALSE)

keywords<-data.frame(Keyword=c("word_1","word_2","word_3","word_4a word_4b word_4c"),stringsAsFactors=F)

df %>% mutate(l=str_split(Text,"[ .,]")) %>% unnest %>%
  inner_join(keywords %>% mutate(l=str_split(Keyword," ")) %>% unnest, by="l") %>%
  select(-Keyword) %>% distinct %>% nest(l)
#    Title                                             Text                      data
#1 Title_1 Very interesting word_1 and also word_2, word_1.            word_1, word_2
#2 Title_2                        hello word_1, and word_3.            word_1, word_3
#3 Title_3     difficult! word_4b, and word_4a also word_4c word_4b, word_4a, word_4c

因此,结果存储在一个列表中。要将其转换为字符串:

df %>% mutate(l=str_split(Text,"[ .,]")) %>% unnest %>%
  inner_join(keywords %>% mutate(l=str_split(Keyword," ")) %>% unnest,by="l") %>%
  select(-Keyword) %>% distinct %>% arrange(l) %>% nest(l) %>%
  rowwise %>% mutate(keywords=paste(data[[1]],collapse=" ")) %>% select(-data)
## A tibble: 3 x 3
#  Title   Text                                             keywords               
#  <chr>   <chr>                                            <chr>                  
#1 Title_1 Very interesting word_1 and also word_2, word_1. word_1 word_2          
#2 Title_2 hello word_1, and word_3.                        word_1 word_3          
#3 Title_3 difficult! word_4b, and word_4a also word_4c     word_4a word_4b word_4c

df <- data.frame(Title=c("Title_1","Title_2","Title_3","Title_4","Title_5"),
Text=c("Very interesting word_1 and also word_2, word_1.",
       "hello word_1, and word_3.",                     
       "difficult! word_4b, and word_4a also word_4c",
       "A bit of word_1, some word_4a, and mostly word_3",
       "nothing interesting here"),
  stringsAsFactors=F)
  keywords<-data.frame(Keyword=c("word_1","word_2","word_3","word_4a word_4b word_4c"),stringsAsFactors=F)

# split the keywords into words, but remember keyword length
k <- keywords %>% mutate(l=str_split(Keyword," ")) %>% unnest %>%
   group_by(Keyword) %>% mutate(n=n()) %>% ungroup
# split the title into words
# compare with words from keywords
# keep only possibly multiple, but full matches
# collate all results and merge back to the original data
df %>% mutate(l=str_split(Text,"[ .,]")) %>% unnest %>%
   inner_join(k,by="l") %>%
   group_by(Title,Keyword) %>% filter(n()%%n==0) %>%
   distinct(Keyword) %>% ungroup %>% nest(Keyword) %>%
   rowwise %>% mutate(keywords=paste(data[[1]],collapse=", ")) %>% select(-data) %>%
   inner_join(df,.,by="Title")
#    Title                                             Text                keywords
#1 Title_1 Very interesting word_1 and also word_2, word_1.          word_1, word_2
#2 Title_2                        hello word_1, and word_3.          word_1, word_3
#3 Title_3     difficult! word_4b, and word_4a also word_4c word_4a word_4b word_4c
#4 Title_4    A bit word_1, some word_4a, and mostly word_3          word_1, word_3

RLave 6 年前

Title <- c("A","B","C","A","A","B","A","A","B","C")
Text <- c("A",11,12,13,14,15,14,13,12,"hi")
df <- data.frame(Title,Text, stringsAsFactors=FALSE)

keywords <- c("A","B","hi")
keys <- data.frame(keywords,stringsAsFactors=FALSE)

require(dplyr)
require(stringr)
df %>% mutate(Keywords = paste(str_c(keys$keywords[which(keys$keywords %in% 
df$Title)],collapse = ","),str_c(keys$keywords[which(!keywords %in% 
df$Title)] 
[which(keywords[which(!keywords %in% df$Title)] %in% df$Text)], 
collapse=","), 
sep=",")) -> df

str_c(keys$keywords[which(keys$keywords %in% df$Title)],collapse = ",")

$Title 专栏与需求 str_c 将找到的关键字连接到一个字符串中,以避免由于未连接的结果是数据帧而不是字符串而造成混乱的重复。下一学期是:

str_c(keys$keywords[which(!keywords %in% df$Title)][which(keywords[which(!keywords 
%in% df$Title)] %in% df$Words)], collapse=",")

这看起来很糟糕,但需要的关键字不在 ,它们在 $Text $标题 . 出于同样的原因,我们应该使用使字符串输出。然后两个字符串的粘贴将为我们提供所需的输出。修修补补 collapse=" ," 和 sep = " ," 如果需要,可以添加空格。