代码之家 › 专栏 › 技术社区 › Haakonkas

从具有逗号分隔值的列中提取多个字符串

regex r

Haakonkas · 技术社区 · 6 年前

我有这样一个数据框:

structure(list(mut = c("Q184H/CAA-CAT", "I219V/ATC-GTC", "A314T/GCG-ACG, P373Q/CCG-CAG, A653E/GCG-GAA","0")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

我想要的是:

    mut                    nt
1   Q184H                  CAA-CAT
2   I219V                  ATC-GTC
3   A314T, P373Q, A653E    GCG-ACG, CCG-CAG, CGC-GAA
4   0                      0

library(dplyr)
df %>%
    mutate(nt = gsub(".+/(.*?)", "\\1", mut))

如何使每个条目都匹配?我必须把它们分开然后再配对吗?

2 回复 | 直到 6 年前

duckmayr 6 年前

. s到 [^,] ^ ,意思是匹配任何东西这些角色。所以呢 [^,]+ 表示尽可能多地匹配非逗号的连续字符。

df = structure(list(mut = c("Q184H/CAA-CAT", "I219V/ATC-GTC",
                            "A314T/GCG-ACG, P373Q/CCG-CAG, A653E/GCG-GAA","0")),
               row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df %>%
    mutate(nt = gsub("[^,]+?/([^,]+?)", "\\1", mut),
           mut = gsub("([^/]+)/[^,]+", "\\1", mut))
#> # A tibble: 4 x 2
#>   mut                 nt                     
#>   <chr>               <chr>                  
#> 1 Q184H               CAA-CAT                
#> 2 I219V               ATC-GTC                
#> 3 A314T, P373Q, A653E GCG-ACG,CCG-CAG,GCG-GAA
#> 4 0                   0

^{于2018-10-10由

reprex package

(第0.2.1版)}

hrbrmstr 6 年前

不要接受这个答案(@duckmayr做了regex调试)。发布这篇文章是为了向人们展示 stringi 我们可以得到自我记录的正则表达式,这样我们未来的自我就不会最终憎恨过去的自我:

library(stringi) # it's what stringr uses
library(tidyverse)

xdf <- structure(list(mut = c("Q184H/CAA-CAT", "I219V/ATC-GTC", "A314T/GCG-ACG, P373Q/CCG-CAG, A653E/GCG-GAA","0")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

mutate(
  xdf, 
  nt = stri_replace_all_regex(
    str = mut,
    pattern = "
[^,]+?  # match anything but a comma and if there is one, match at most once
/       # followed by a forward slash
(       # start of match group
 [^,]+? # same as above
)       # end of match group
",
    replacement = "$1", # take the match group value as the value
    opts_regex = stri_opts_regex(comments=TRUE)
  ),
  mut = stri_replace_all_regex(
    str = mut,
    pattern = "
(      # start of match group
 [^/]+ # match anything but a forward slash
)      # end of match group
/      # followed by a forward slash
[^,]+  # match anything but a comma
",
    replacement = "$1", # take the match group value as the value
    opts_regex = stri_opts_regex(comments=TRUE)
  )
)

推荐文章

Marc B. · 使用ggplot2创建条形图时“缺少值”

1 年前

deschen · tidyverse与外部向量发生突变,该外部向量的元素是数据帧中的列值

1 年前

Laura · 在Shiny中使用可排序的包拖放名称,这些名称将成为图表

1 年前

Mallikarjun M · 如何使用随机森林进行时间序列预测?

1 年前

ly li · 模型摘要:当表格形状改变时,拟合优度消失

1 年前

C.Robin · 将marginaffects::predictions()的结果连接回main df?

1 年前

monotonic · 如何将格式为“col1+col3+col4”的数据帧的行名转换为一列数字向量“c(1,3,4)”?

2 年前

Shawn Hemelstrand · 为什么我的自定义errorbar函数不能在R中工作?

2 年前

RoyBatty · 统计每个字符在整个数据集中出现的次数

2 年前

stats_noob · R: 记录某个“行为”发生的循环的索引?

2 年前