代码之家 › 专栏 › 技术社区 › Madisonel

提取以下单个单词的字符:

regex-lookarounds regex

Madisonel · 技术社区 · 3 年前

我想提取药物的名称,其中“药物:”、“其他:”等位于药物名称之前。取每个“:”后的第一个单词,包括“-”这样的字符。如果有2个“:”实例,则“and”应将这2个单词连接成一个字符串。ourpur应位于列名为Drug的单列数据帧中。

以下是我的可重复示例:

my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))

输出应该看起来像这样:

output.df <- data.frame(Drugs = c("TLD-1433", "CG0070 and n-dodecyl-B-D-matose", "Atezolizumab", "N-803 and N-803", "Everolimus and Intravesical", "Association and Association"))

这是我尝试过的,但没有奏效。尝试1:

str_extract(my.df$col1, '(?<=:\\s)(\\w+)')

尝试2:

str_extract(my.df$col1, '(?<=:\\s)(\\w+)(-)(\\w+)')

0 回复 | 直到 3 年前

The fourth bird 3 年前

我对R不太熟悉,但一个可以从示例数据中为您提供匹配的模式可能是:

(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*

然后,您可以将匹配项连接起来 and 在两者之间。

模式匹配:

(?<=:\s) 积极的回头看,断言 : 左侧有一个空格字符
\w+(?:-\w+)* 匹配1+个单词字符,然后可选择重复 - 和1+个单词字符
(?: 非捕获组
- and \w+(?:-\w+)* 比赛 和 后跟1+个单词字符,然后可选择重复 - 和1+个单词字符
)* 关闭非捕获组,并可选择重复

Regex demo

要获取所有匹配项,可以使用str_match_all

str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')

例如

library(stringr)
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
lapply(
str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
, paste, collapse=" and ")

输出

[[1]]
[1] "TLD-1433"

[[2]]
[1] "CG0070 and n-dodecyl-B-D-maltoside"

[[3]]
[1] "Atezolizumab"

[[4]]
[1] "N-803 and BCG and N-803"

[[5]]
[1] "Everolimus and Intravesical"

[[6]]
[1] "Association and Association"

Ryszard Czech 3 年前

使用

:\s*\b([\w-]+\b(?:\s+and\s+\b[\w-]+)*)\b

请参阅 regex proof .

解释

--------------------------------------------------------------------------------
  :                        ':'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [\w-]+                   any character of: word characters (a-z,
                             A-Z, 0-9, _), '-' (1 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      \s+                      whitespace (\n, \r, \t, \f, and " ")
                               (1 or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
      and                      'and'
--------------------------------------------------------------------------------
      \s+                      whitespace (\n, \r, \t, \f, and " ")
                               (1 or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
--------------------------------------------------------------------------------
      [\w-]+                   any character of: word characters (a-
                               z, A-Z, 0-9, _), '-' (1 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )*                       end of grouping
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char

R code :

my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
library(stringr)
matches <- str_match_all(my.df$col1, ":\\s*\\b([\\w-]+\\b(?:\\s+and\\s+\\b[\\w-]+)*)\\b")
Drugs <- sapply(matches, function(z) paste(z[,-1], collapse=" and "))
output.df <- data.frame(Drugs) 
output.df

结果 :

                               Drugs
1                           TLD-1433
2 CG0070 and n-dodecyl-B-D-maltoside
3                       Atezolizumab
4            N-803 and BCG and N-803
5        Everolimus and Intravesical
6        Association and Association