代码之家  ›  专栏  ›  技术社区  ›  Madisonel

提取以下单个单词的字符:

  •  0
  • Madisonel  · 技术社区  · 3 年前

    我想提取药物的名称,其中“药物:”、“其他:”等位于药物名称之前。 取每个“:”后的第一个单词,包括“-”这样的字符。 如果有2个“:”实例,则“and”应将这2个单词连接成一个字符串。ourpur应位于列名为Drug的单列数据帧中。

    以下是我的可重复示例:

    my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
    "Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
    ")))
    

    输出应该看起来像这样:

    output.df <- data.frame(Drugs = c("TLD-1433", "CG0070 and n-dodecyl-B-D-matose", "Atezolizumab", "N-803 and N-803", "Everolimus and Intravesical", "Association and Association")) 
    

    这是我尝试过的,但没有奏效。 尝试1:

    str_extract(my.df$col1, '(?<=:\\s)(\\w+)')
           
    

    尝试2:

    str_extract(my.df$col1, '(?<=:\\s)(\\w+)(-)(\\w+)')
    
    0 回复  |  直到 3 年前
        1
  •  0
  •   The fourth bird    3 年前

    我对R不太熟悉,但一个可以从示例数据中为您提供匹配的模式可能是:

    (?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*
    

    然后,您可以将匹配项连接起来 and 在两者之间。

    模式匹配:

    • (?<=:\s) 积极的回头看,断言 : 左侧有一个空格字符
    • \w+(?:-\w+)* 匹配1+个单词字符,然后可选择重复 - 和1+个单词字符
    • (?: 非捕获组
      • and \w+(?:-\w+)* 比赛 后跟1+个单词字符,然后可选择重复 - 和1+个单词字符
    • )* 关闭非捕获组,并可选择重复

    Regex demo

    要获取所有匹配项,可以使用str_match_all

    str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
    

    例如

    library(stringr)
    my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
    "Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
    ")))
    lapply(
    str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
    , paste, collapse=" and ")
    

    输出

    [[1]]
    [1] "TLD-1433"
    
    [[2]]
    [1] "CG0070 and n-dodecyl-B-D-maltoside"
    
    [[3]]
    [1] "Atezolizumab"
    
    [[4]]
    [1] "N-803 and BCG and N-803"
    
    [[5]]
    [1] "Everolimus and Intravesical"
    
    [[6]]
    [1] "Association and Association"
    
        2
  •  0
  •   Ryszard Czech    3 年前

    使用

    :\s*\b([\w-]+\b(?:\s+and\s+\b[\w-]+)*)\b
    

    请参阅 regex proof .

    解释

    --------------------------------------------------------------------------------
      :                        ':'
    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w) and
                               something that is not a word char
    --------------------------------------------------------------------------------
      (                        group and capture to \1:
    --------------------------------------------------------------------------------
        [\w-]+                   any character of: word characters (a-z,
                                 A-Z, 0-9, _), '-' (1 or more times
                                 (matching the most amount possible))
    --------------------------------------------------------------------------------
        \b                       the boundary between a word char (\w)
                                 and something that is not a word char
    --------------------------------------------------------------------------------
        (?:                      group, but do not capture (0 or more
                                 times (matching the most amount
                                 possible)):
    --------------------------------------------------------------------------------
          \s+                      whitespace (\n, \r, \t, \f, and " ")
                                   (1 or more times (matching the most
                                   amount possible))
    --------------------------------------------------------------------------------
          and                      'and'
    --------------------------------------------------------------------------------
          \s+                      whitespace (\n, \r, \t, \f, and " ")
                                   (1 or more times (matching the most
                                   amount possible))
    --------------------------------------------------------------------------------
          \b                       the boundary between a word char (\w)
                                   and something that is not a word char
    --------------------------------------------------------------------------------
          [\w-]+                   any character of: word characters (a-
                                   z, A-Z, 0-9, _), '-' (1 or more times
                                   (matching the most amount possible))
    --------------------------------------------------------------------------------
        )*                       end of grouping
    --------------------------------------------------------------------------------
      )                        end of \1
    --------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w) and
                               something that is not a word char
    

    R code :

    my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
    "Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
    ")))
    library(stringr)
    matches <- str_match_all(my.df$col1, ":\\s*\\b([\\w-]+\\b(?:\\s+and\\s+\\b[\\w-]+)*)\\b")
    Drugs <- sapply(matches, function(z) paste(z[,-1], collapse=" and "))
    output.df <- data.frame(Drugs) 
    output.df
    

    结果 :

                                   Drugs
    1                           TLD-1433
    2 CG0070 and n-dodecyl-B-D-maltoside
    3                       Atezolizumab
    4            N-803 and BCG and N-803
    5        Everolimus and Intravesical
    6        Association and Association