代码之家 › 专栏 › 技术社区 › Varun

R筛选包含单词组合的行

filtering dplyr text r

Varun · 技术社区 · 6 年前

我正在处理文本数据,并寻找解决过滤问题的方法。

我已经设法找到了一个解决方案,可以过滤包含“Word 1”的行或者 '单词2'

df=data.frame(UID=c(1,2,3,4,5),Text=c("the quick brown fox jumped over the lazy dog",
                                 "long live the king",
                                 "I love my dog a lot",
                                 "Tomorrow will be a rainy day",
                                 "Tomorrow will be a sunny day"))


#Filter for rows that contain "brown" OR "dog"
filtered_results_1=dplyr::filter(df, grepl('brown|dog', Text))

但是,当我筛选同时包含“Word 1”的行时以及 “单词2”不起作用。

#Filter for rows that contain "brown" AND "dog"
filtered_results_2=dplyr::filter(df, grepl('brown & dog', Text))

5 回复 | 直到 6 年前

akrun 6 年前

我们可以用双人床 grepl

dplyr::filter(df, grepl('\\bbrown\\b', Text) & grepl('\\bdog\\b', Text))

或者使用一个条件,检查单词“brown”,然后检查单词“dog”(注意单词边界)( \\b

dplyr::filter(df, grepl("\\bbrown\\b.*\\bdog\\b|\\bdog\\b.*\\bbrown\\b", Text))
#   UID                                         Text
#1   1 the quick brown fox jumped over the lazy dog

注意:它检查单词边界、单词“brown”、“dog”以及它们在字符串中是否存在

base R

subset(df, grepl("\\bbrown\\b.*\\bdog\\b|\\bdog\\b.*\\bbrown\\b", Text))

moodymudskipper 6 年前

你可以用 stringr::str_count :

dplyr::mutate(df, test = stringr::str_count(Text,'brown|dog'))
#   UID                                         Text test
# 1   1 the quick brown fox jumped over the lazy dog    2
# 2   2                           long live the king    0
# 3   3                          I love my dog a lot    1
# 4   4                 Tomorrow will be a rainy day    0
# 5   5                 Tomorrow will be a sunny day    0

dplyr::filter(df, stringr::str_count(Text,'brown|dog') == 2)
#   UID                                         Text
# 1   1 the quick brown fox jumped over the lazy dog

会算数的 dog 或 brown 尽管发生了很多次

dplyr::filter(df, purrr::map_int(strsplit(as.character(Text),'[[:punct:] ]'),
               ~sum(unique(.) %in% c("brown","dog"))) == 2)

#   UID                                         Text
# 1   1 the quick brown fox jumped over the lazy dog

Terru_theTerror 6 年前

尝试此解决方案:

filtered_results_2=dplyr::filter(df, grepl('brown.*dog|dog.*brown', Text))
filtered_results_2
  UID                                         Text
1   1 the quick brown fox jumped over the lazy dog

Saurabh Chauhan 6 年前

sqldf :

library(sqldf)
sqldf("select * from df where Text like '%dog%' AND Text like '%brown%'")

输出:

    UID                                         Text
     1   1 the quick brown fox jumped over the lazy dog

Chriss Paul 6 年前

base

df[grepl("(?=.*dog)(?=.*brown)", df$Text, perl = TRUE),]
  UID                                         Text
1   1 the quick brown fox jumped over the lazy dog