代码之家  ›  专栏  ›  技术社区  ›  Haakonkas

复杂正则表达式与各种模式匹配

  •  1
  • Haakonkas  · 技术社区  · 6 年前

    我有一个包含以下信息的列的数据框:

        c("GYRA.Flq_NC_002695.1.916822_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRA_RequiresSNPConfirmation", 
    "GYRB.CARD_pvgb_AP009048_3760295_3762710_ARO_3003303_Escherichia_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRB_RequiresSNPConfirmation", 
    "MARR.CARD_pvgb_U00096_1619119_1619554_ARO_3003378_Escherichia_Multi_drug_resistance_MDR_regulator_MARR_RequiresSNPConfirmation", 
    "PARC.Flq_M58408_gene_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_PARC_RequiresSNPConfirmation", 
    "SOXS.CARD_pvgb_U00096_4277468_4277933_ARO_3003381_Escherichia_Multi_drug_resistance_MDR_regulator_SOXS_RequiresSNPConfirmation", 
    "TOLC.CARD_phgb_FJ768952_0_1488_ARO_3000237_tolC_Multi_drug_resistance_Multi_drug_efflux_pumps_TOLC", 
    "parE.CARD_pvgb_NC_007779_3172159_3174052_ARO_3003316_Escherichia_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_parE_RequiresSNPConfirmation", 
    "GYRA.Flq_CP001918.1_gene3562_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRA_RequiresSNPConfirmation", 
    "PARC.Flq_NC_003197.1.1254697_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_PARC_RequiresSNPConfirmation", 
    "GYRA.Flq_NC_003197.1.1253794_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRA_RequiresSNPConfirmation", 
    "parE.CARD_pvgb_NC_003197_3343961_3345854_ARO_3003317_Salmonella_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_parE_RequiresSNPConfirmation", 
    "ACRR.CARD_pvgb_NC_014121_1270697_1271351_ARO_3003374_Enterobacter_Multi_drug_resistance_MDR_regulator_ACRR_RequiresSNPConfirmation"
    )
    

    数控002695.1.916822 _氟喹诺酮类\氟喹诺酮耐药\ DNA \拓扑异构酶\ GYRA \需要确认“, “GYRB.CARD\u pvgb”_ _ARO\U 3003303\U大肠杆菌\U氟喹诺酮类\U氟喹诺酮类\U耐药\U DNA \U拓扑异构酶\U GYRB \U需要确认“, U00096电话1619119电话1619554 M58408型 _基因\氟喹诺酮类\氟喹诺酮耐药\ DNA \拓扑异构酶\ PARC \需要确认“, _ARO\U 3003381\U大肠杆菌\U多药耐药\U MDR\U调节剂\U SOXS\U需要确认“, 福建768952 0 1488 _ARO\ U 3000237\ tolC\多药耐药\多药流出\泵\ tolC“, “parE.CARD\u pvgb”_ 北卡罗来纳州\u 007779 \u 3172159 \u 3174052 _ARO\U 3003316\U大肠杆菌\U氟喹诺酮类\U氟喹诺酮类\U耐药\U DNA \U拓扑异构酶\U parE \U requiressNPConfirmations“, “GYRA.Flq_ _基因3562 \氟喹诺酮类\氟喹诺酮耐药\ DNA \拓扑异构酶\ GYRA \需要确认“, “飞行区_ 编号:003197.1.1254697 “GYRA.Flq_ 编号:003197.1.1253794 _氟喹诺酮类\氟喹诺酮耐药\ DNA \拓扑异构酶\ GYRA \需要确认“, “parE.CARD\u pvgb”_ 北卡罗来纳州\u 003197 \u 3343961 \u 3345854 “ACRR.CARD\u pvgb_ 数控014121 1270697 1271351

    library(dplyr)
    df %>% mutate(ref_name2 = sub("[A-z]+.[A-z]+.[A-z]+.([A-z][A-z].[0-9]+.[0-9].[0-9]+)", "\\1", ref_name),
             ref_name2 = sub("\\_ARO.*", "", ref_name2),
             ref_name2 = sub("\\_Fluoro.*", "", ref_name2),
             ref_name2 = sub("\\_gene.*", "", ref_name2))
    

    但这只与上面的字符串部分匹配,还删除了我想要的几个字母。有没有比多次sub/gsub调用更简单的方法?

    我想要的是:

    c(NC_002695.1.916822, AP009048_3760295_3762710, U00096_1619119_1619554, M58408, U00096_4277468_4277933, FJ768952_0_1488, NC_007779_3172159_3174052, CP001918.1, NC_003197.1.1254697, NC_003197.1.1253794, NC_003197_3343961_3345854, NC_014121_1270697_1271351)
    

    我试着在视觉上与之匹配 https://regexr.com/30u4a ,并且还尝试阅读了很多关于复杂匹配的内容,但似乎找不到正确的代码。

    2 回复  |  直到 6 年前
        1
  •  5
  •   Wiktor Stribiżew    6 年前

    你可以用

    > sub("^.*?_([A-Z]+[0-9_.]*[0-9]).*", "\\1", x)
     [1] "NC_002695.1.916822"        "AP009048_3760295_3762710"  "U00096_1619119_1619554"    "M58408"                    "U00096_4277468_4277933"    "FJ768952_0_1488"          
     [7] "NC_007779_3172159_3174052" "CP001918.1"                "NC_003197.1.1254697"       "NC_003197.1.1253794"       "NC_003197_3343961_3345854" "NC_014121_1270697_1271351"
    

    regex demo .

    图案细节

    • ^ sub (已使用)
    • .*? -零个或多个字符,尽可能少(注意,不能使用 [^_]* 在这里,我们需要的模式可能出现在0或更多下划线之后)
    • _ -a _
    • ([A-Z]+[0-9_.]*[0-9]) -捕获组1:
      • [A-Z]+
      • [0-9_.]* -0或更多数字, _ . 字符
      • [0-9]
    • .*
        2
  •  0
  •   Luis Colorado    6 年前

    我能做的最好的事就是 this example

    "[A-Za-z]*\.([A-Za-z]*_)*([A-Z]+_?\d+(_\d+(_\d+)*|\.\d+(\.\d+)*)?)[^"]*"
    

    也就是说,搜索左双引号 " ,然后是一组字母,然后是一个点 . ,然后是可变数量(可能为空)的字母序列(任何大小写),后跟下划线 _ ,然后是我们感兴趣的组(组) \2

    • 一系列字母,后跟(可选)下划线,后跟
      • 由点分隔的一组数字的序列 . .

    接下来是下一个双引号,结束字符串。

    \\2
    

    然后你会得到你所发布的想要的结果,如上面的演示所示。