代码之家  ›  专栏  ›  技术社区  ›  Lyndon L.

在R中,如何提取多个匹配的词作为字符串,如果与regex或grep为true,如何进行匹配?

  •  1
  • Lyndon L.  · 技术社区  · 5 年前

    我还是R的初学者。我需要一些代码的帮助,这些代码可以在向量中搜索列表中的术语并返回true。如果为真,则返回一个匹配项的字符串。

    我已经设置好告诉我条件是否匹配并返回第一个匹配的条件,但我不知道如何获取其余匹配的条件。

    在所附的代码中,我有我想要的输出和不完美的最终输出。

    #create dataset of 2 columns/vectors. 1st column is "Job Title", 2nd column is "Work Experience"
    'Work Experience' <- c("cooked food; cleaned house; made beds", "analyzed data; identified gaps; used sql, python, and r", "used tableau to make dashboards for clients; applied advanced macro excel functions", "financial planning and strategy; consulted with leaders and clients")
    'Job Title' <- c("dad", "research analyst", "business intelligence consultant", "finance consultant")
    Job_Hist   <- data.frame(`Job Title`, `Work Experience`)
    
    #create list of terms to search for in Job_Hist
    Term_List <- c("python", " r", "sql", "tableau", "excel")
    
    #use grepl to search the Work Experience vector for terms in CS_Term_List THEN return TRUE or FALSE
    Term_TF<- grepl(paste(Term_List, collapse = '|'),Job_Hist$Work.Experience)
    
    #add a new column to our final output dataframe that shows if the job experience matched our terms  
    Final_Output<-Job_Hist
    Final_Output$Term_Test <- Term_TF
    
    
    #Let's see what what terms caused the TRUE Flag in the Final_Output
    m<-regexpr(paste(Term_List, collapse = '|'),
           Job_Hist$Work.Experience, perl=TRUE)
    T_Match <- regmatches(Job_Hist$Work.Experience,m)
    
    
    
    #Compare Final_Output to my Desired_Output and please help me :)
    Desired_T_Match <- c("NA", "sql, python, r", "tableau, excel", "NA")
    Desired_Output <- data.frame(`Job Title`, `Work Experience`, Term_TF, Desired_T_Match)
    
    #I need 2 things. 
     #1) a way to tie T_Match back to Final_Output... something like if, TRUE then match
     #2) a way to return every term matched in a coma delimited string. Example: research analyst   analyzed data...    TRUE    sql, python
    
    1 回复  |  直到 5 年前
        1
  •  1
  •   Mako212    5 年前

    你可以使用 stringr::str_extract_all 要从每行获取匹配项列表,请执行以下操作:

    library(stringr)
    library(tidyverse)
    
    Job_Hist$matches <- str_extract_all(Job_Hist$Work.Experience, 
      paste(Term_List, collapse = '|'), simplify = TRUE)
    
                                                                          Work.Experience  Term matches.1 matches.2
    1                                               cooked food; cleaned house; made beds FALSE                    
    2                             analyzed data; identified gaps; used sql, python, and r  TRUE       sql    python
    3 used tableau to make dashboards for clients; applied advanced macro excel functions  TRUE   tableau     excel
    4                 financial planning and strategy; consulted with leaders and clients FALSE                    
      matches.3
    1          
    2         r
    3          
    4       
    

    编辑: 如果希望将匹配项作为逗号分隔的字符串放在一列中,可以使用:

    str_extract_all(Job_Hist$Work.Experience, paste(Term_List, collapse = '|')) %>% 
      sapply(., paste, collapse = ", ")
    
               matches
    1                
    2 sql, python,  r
    3  tableau, excel
    4                
    

    注意,如果使用默认参数 simplify = FALSE 在里面 str_extract_all 你的专栏 matches 看起来是正确的,就像我们得到的结果一样 sapply 上面。但是,如果你用 str() 您会看到每个元素实际上都是它自己的列表,这会给某些类型的分析带来问题。