我有一个包含以下信息的列的数据框:
c("GYRA.Flq_NC_002695.1.916822_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRA_RequiresSNPConfirmation",
"GYRB.CARD_pvgb_AP009048_3760295_3762710_ARO_3003303_Escherichia_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRB_RequiresSNPConfirmation",
"MARR.CARD_pvgb_U00096_1619119_1619554_ARO_3003378_Escherichia_Multi_drug_resistance_MDR_regulator_MARR_RequiresSNPConfirmation",
"PARC.Flq_M58408_gene_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_PARC_RequiresSNPConfirmation",
"SOXS.CARD_pvgb_U00096_4277468_4277933_ARO_3003381_Escherichia_Multi_drug_resistance_MDR_regulator_SOXS_RequiresSNPConfirmation",
"TOLC.CARD_phgb_FJ768952_0_1488_ARO_3000237_tolC_Multi_drug_resistance_Multi_drug_efflux_pumps_TOLC",
"parE.CARD_pvgb_NC_007779_3172159_3174052_ARO_3003316_Escherichia_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_parE_RequiresSNPConfirmation",
"GYRA.Flq_CP001918.1_gene3562_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRA_RequiresSNPConfirmation",
"PARC.Flq_NC_003197.1.1254697_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_PARC_RequiresSNPConfirmation",
"GYRA.Flq_NC_003197.1.1253794_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRA_RequiresSNPConfirmation",
"parE.CARD_pvgb_NC_003197_3343961_3345854_ARO_3003317_Salmonella_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_parE_RequiresSNPConfirmation",
"ACRR.CARD_pvgb_NC_014121_1270697_1271351_ARO_3003374_Enterobacter_Multi_drug_resistance_MDR_regulator_ACRR_RequiresSNPConfirmation"
)
数控002695.1.916822
_氟喹诺酮类\氟喹诺酮耐药\ DNA \拓扑异构酶\ GYRA \需要确认“,
“GYRB.CARD\u pvgb”_
_ARO\U 3003303\U大肠杆菌\U氟喹诺酮类\U氟喹诺酮类\U耐药\U DNA \U拓扑异构酶\U GYRB \U需要确认“,
U00096电话1619119电话1619554
M58408型
_基因\氟喹诺酮类\氟喹诺酮耐药\ DNA \拓扑异构酶\ PARC \需要确认“,
_ARO\U 3003381\U大肠杆菌\U多药耐药\U MDR\U调节剂\U SOXS\U需要确认“,
福建768952 0 1488
_ARO\ U 3000237\ tolC\多药耐药\多药流出\泵\ tolC“,
“parE.CARD\u pvgb”_
北卡罗来纳州\u 007779 \u 3172159 \u 3174052
_ARO\U 3003316\U大肠杆菌\U氟喹诺酮类\U氟喹诺酮类\U耐药\U DNA \U拓扑异构酶\U parE \U requiressNPConfirmations“,
“GYRA.Flq_
_基因3562 \氟喹诺酮类\氟喹诺酮耐药\ DNA \拓扑异构酶\ GYRA \需要确认“,
“飞行区_
编号:003197.1.1254697
“GYRA.Flq_
编号:003197.1.1253794
_氟喹诺酮类\氟喹诺酮耐药\ DNA \拓扑异构酶\ GYRA \需要确认“,
“parE.CARD\u pvgb”_
北卡罗来纳州\u 003197 \u 3343961 \u 3345854
“ACRR.CARD\u pvgb_
数控014121 1270697 1271351
library(dplyr)
df %>% mutate(ref_name2 = sub("[A-z]+.[A-z]+.[A-z]+.([A-z][A-z].[0-9]+.[0-9].[0-9]+)", "\\1", ref_name),
ref_name2 = sub("\\_ARO.*", "", ref_name2),
ref_name2 = sub("\\_Fluoro.*", "", ref_name2),
ref_name2 = sub("\\_gene.*", "", ref_name2))
但这只与上面的字符串部分匹配,还删除了我想要的几个字母。有没有比多次sub/gsub调用更简单的方法?
我想要的是:
c(NC_002695.1.916822, AP009048_3760295_3762710, U00096_1619119_1619554, M58408, U00096_4277468_4277933, FJ768952_0_1488, NC_007779_3172159_3174052, CP001918.1, NC_003197.1.1254697, NC_003197.1.1253794, NC_003197_3343961_3345854, NC_014121_1270697_1271351)
我试着在视觉上与之匹配
https://regexr.com/30u4a
,并且还尝试阅读了很多关于复杂匹配的内容,但似乎找不到正确的代码。