代码之家  ›  专栏  ›  技术社区  ›  Haakonkas

从具有逗号分隔值的列中提取多个字符串

  •  2
  • Haakonkas  · 技术社区  · 6 年前

    我有这样一个数据框:

    structure(list(mut = c("Q184H/CAA-CAT", "I219V/ATC-GTC", "A314T/GCG-ACG, P373Q/CCG-CAG, A653E/GCG-GAA","0")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
    

    我想要的是:

        mut                    nt
    1   Q184H                  CAA-CAT
    2   I219V                  ATC-GTC
    3   A314T, P373Q, A653E    GCG-ACG, CCG-CAG, CGC-GAA
    4   0                      0
    

    library(dplyr)
    df %>%
        mutate(nt = gsub(".+/(.*?)", "\\1", mut))
    

    如何使每个条目都匹配?我必须把它们分开然后再配对吗?

    2 回复  |  直到 6 年前
        1
  •  3
  •   duckmayr    6 年前

    . s到 [^,] ^ ,意思是匹配任何东西 这些角色。所以呢 [^,]+ 表示尽可能多地匹配非逗号的连续字符。

    df = structure(list(mut = c("Q184H/CAA-CAT", "I219V/ATC-GTC",
                                "A314T/GCG-ACG, P373Q/CCG-CAG, A653E/GCG-GAA","0")),
                   row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    df %>%
        mutate(nt = gsub("[^,]+?/([^,]+?)", "\\1", mut),
               mut = gsub("([^/]+)/[^,]+", "\\1", mut))
    #> # A tibble: 4 x 2
    #>   mut                 nt                     
    #>   <chr>               <chr>                  
    #> 1 Q184H               CAA-CAT                
    #> 2 I219V               ATC-GTC                
    #> 3 A314T, P373Q, A653E GCG-ACG,CCG-CAG,GCG-GAA
    #> 4 0                   0
    

    于2018-10-10由 reprex package (第0.2.1版)

        2
  •  2
  •   hrbrmstr    6 年前

    不要 接受这个答案(@duckmayr做了regex调试)。发布这篇文章是为了向人们展示 stringi 我们可以得到自我记录的正则表达式,这样我们未来的自我就不会最终憎恨过去的自我:

    library(stringi) # it's what stringr uses
    library(tidyverse)
    
    xdf <- structure(list(mut = c("Q184H/CAA-CAT", "I219V/ATC-GTC", "A314T/GCG-ACG, P373Q/CCG-CAG, A653E/GCG-GAA","0")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
    
    mutate(
      xdf, 
      nt = stri_replace_all_regex(
        str = mut,
        pattern = "
    [^,]+?  # match anything but a comma and if there is one, match at most once
    /       # followed by a forward slash
    (       # start of match group
     [^,]+? # same as above
    )       # end of match group
    ",
        replacement = "$1", # take the match group value as the value
        opts_regex = stri_opts_regex(comments=TRUE)
      ),
      mut = stri_replace_all_regex(
        str = mut,
        pattern = "
    (      # start of match group
     [^/]+ # match anything but a forward slash
    )      # end of match group
    /      # followed by a forward slash
    [^,]+  # match anything but a comma
    ",
        replacement = "$1", # take the match group value as the value
        opts_regex = stri_opts_regex(comments=TRUE)
      )
    )