代码之家  ›  专栏  ›  技术社区  ›  Varun

R筛选包含单词组合的行

  •  4
  • Varun  · 技术社区  · 6 年前

    我正在处理文本数据,并寻找解决过滤问题的方法。

    我已经设法找到了一个解决方案,可以过滤包含“Word 1”的行 或者 '单词2'

    df=data.frame(UID=c(1,2,3,4,5),Text=c("the quick brown fox jumped over the lazy dog",
                                     "long live the king",
                                     "I love my dog a lot",
                                     "Tomorrow will be a rainy day",
                                     "Tomorrow will be a sunny day"))
    
    
    #Filter for rows that contain "brown" OR "dog"
    filtered_results_1=dplyr::filter(df, grepl('brown|dog', Text))
    

    但是,当我筛选同时包含“Word 1”的行时 以及 “单词2”不起作用。

    #Filter for rows that contain "brown" AND "dog"
    filtered_results_2=dplyr::filter(df, grepl('brown & dog', Text))
    

    5 回复  |  直到 6 年前
        1
  •  3
  •   akrun    6 年前

    我们可以用双人床 grepl

    dplyr::filter(df, grepl('\\bbrown\\b', Text) & grepl('\\bdog\\b', Text))
    

    或者使用一个条件,检查单词“brown”,然后检查单词“dog”(注意单词边界)( \\b

    dplyr::filter(df, grepl("\\bbrown\\b.*\\bdog\\b|\\bdog\\b.*\\bbrown\\b", Text))
    #   UID                                         Text
    #1   1 the quick brown fox jumped over the lazy dog
    

    注意:它检查单词边界、单词“brown”、“dog”以及它们在字符串中是否存在


    base R

    subset(df, grepl("\\bbrown\\b.*\\bdog\\b|\\bdog\\b.*\\bbrown\\b", Text))
    
        2
  •  5
  •   moodymudskipper    6 年前

    你可以用 stringr::str_count :

    dplyr::mutate(df, test = stringr::str_count(Text,'brown|dog'))
    #   UID                                         Text test
    # 1   1 the quick brown fox jumped over the lazy dog    2
    # 2   2                           long live the king    0
    # 3   3                          I love my dog a lot    1
    # 4   4                 Tomorrow will be a rainy day    0
    # 5   5                 Tomorrow will be a sunny day    0
    
    dplyr::filter(df, stringr::str_count(Text,'brown|dog') == 2)
    #   UID                                         Text
    # 1   1 the quick brown fox jumped over the lazy dog
    

    会算数的 dog brown 尽管发生了很多次

    dplyr::filter(df, purrr::map_int(strsplit(as.character(Text),'[[:punct:] ]'),
                   ~sum(unique(.) %in% c("brown","dog"))) == 2)
    
    #   UID                                         Text
    # 1   1 the quick brown fox jumped over the lazy dog
    
        3
  •  1
  •   Terru_theTerror    6 年前

    尝试此解决方案:

    filtered_results_2=dplyr::filter(df, grepl('brown.*dog|dog.*brown', Text))
    filtered_results_2
      UID                                         Text
    1   1 the quick brown fox jumped over the lazy dog
    
        4
  •  1
  •   Saurabh Chauhan    6 年前

    sqldf :

    library(sqldf)
    sqldf("select * from df where Text like '%dog%' AND Text like '%brown%'")
    

    输出:

        UID                                         Text
         1   1 the quick brown fox jumped over the lazy dog
    
        5
  •  1
  •   Chriss Paul    6 年前

    base

    df[grepl("(?=.*dog)(?=.*brown)", df$Text, perl = TRUE),]
      UID                                         Text
    1   1 the quick brown fox jumped over the lazy dog