代码之家  ›  专栏  ›  技术社区  ›  neuron

如何操作一个列的两个部分?

  •  2
  • neuron  · 技术社区  · 6 年前

    我正在处理一些基因数据,我的专栏中有一篇不是我想要的格式。我不知道生物学在这里讨论了多少,但我正在尝试修正我的氨基酸在我的数据中是如何显示的。

    氨基酸显然有一个名称,但它们也有一个3个字母的缩写和一个1个字母的缩写。我的数据有3个字母形式的氨基酸,但我想把它们改成1个字母的缩写。以下是我的数据示例。

     chr location           effect   impact AA_change
       1    12543 missense_variant MODERATE  p.Ala12Val
       1    52367 missense_variant MODERATE  p.Leu54Pro
       1   752347 missense_variant MODERATE  p.Met99Ser
       1   984645 missense_variant MODERATE  p.Lys34Ile
       1   989845 missense_variant MODERATE  p.Arg4Cys
       1   999854 missense_variant MODERATE  p.His43Gly
       1   999855 missense_variant MODERATE  p.Glu14Phe
    
    dat <- structure(list(chr = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), location = c(12543L, 
    52367L, 752347L, 984645L, 989845L, 999854L, 999855L), effect = c("missense_variant", 
    "missense_variant", "missense_variant", "missense_variant", "missense_variant", 
    "missense_variant", "missense_variant"), impact = c("MODERATE", 
    "MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE"
    ), AA_change = c("Ala12Val", "Leu54Pro", "Met99Ser", "Lys34Ile", 
    "Arg4Cys", "His43Gly", "Glu14Phe")), .Names = c("chr", "location", 
    "effect", "impact", "AA_change"), row.names = c(NA, -7L), class = "data.frame")
    

    这是一个三个字母的氨基酸列表,以及它们最好的缩写是什么。

      Ala == A
      Arg == R
      Asn == N
      Asp == D
      Cys == C
      Glu == E
      Gln == Q
      Gly == G
      His == H
      Ile == I
      Leu == L
      Lys == K
      Met == M
      Phe == F
      Pro == P
      Ser == S
      Thr == T
      Trp == W
      Tyr == Y
      Val == V
    

    我觉得有一个简单的函数可以做这个,但我正在努力做的事情,如何做到这一点。我习惯于只改变专栏的一部分,而不是同时改变两件事。所以我要问的是我如何改变这个

    Ala12Val
    Leu54Pro
    Met99Ser
    Lys34Ile
    Arg4Cys
    His43Gly
    Glu14Phe
    

    为了这个

    A12V
    L54P
    M99S
    K32I
    R4C
    E14F
    

    这是可以做到的吗?

    3 回复  |  直到 6 年前
        1
  •  2
  •   zx8754 John Colby    6 年前

    查找氨基酸,然后得到子串前3个字母和映射,提取数字,子串后3个字母和映射。然后把它们粘在一起。

    # lookup map
    AAmap <- setNames(c("A","R","N","D","C","E","Q","G","H","I","L","K","M","F","P","S","T","W","Y","V"),
                      c("Ala","Arg","Asn","Asp","Cys","Glu","Gln","Gly","His","Ile","Leu","Lys","Met","Phe","Pro","Ser","Thr","Trp","Tyr","Val"))
    
    # get first 3 map to AA, get digits, get last 3 map to AA
    dat$AA_change_short <-
      paste0(AAmap[ substr(dat$AA_change, 1, 3) ],
             gsub("[^\\d]+", "", dat$AA_change, perl = TRUE),
             AAmap[ substr(dat$AA_change, nchar(dat$AA_change) - 2, nchar(dat$AA_change)) ])
    
    dat
    #   chr location           effect   impact AA_change AA_change_short
    # 1   1    12543 missense_variant MODERATE  Ala12Val            A12V
    # 2   1    52367 missense_variant MODERATE  Leu54Pro            L54P
    # 3   1   752347 missense_variant MODERATE  Met99Ser            M99S
    # 4   1   984645 missense_variant MODERATE  Lys34Ile            K34I
    # 5   1   989845 missense_variant MODERATE   Arg4Cys             R4C
    # 6   1   999854 missense_variant MODERATE  His43Gly            H43G
    # 7   1   999855 missense_variant MODERATE  Glu14Phe            E14F
    
        2
  •  2
  •   Onyambu    6 年前
    b=which(adist(dat2$V1,dat$AA_change,partial = T)==0,T)
    
    dat$AA_change1=`regmatches<-`(dat$AA_change,gregexpr("\\D+",dat$AA_change),
                                     value=split(dat2$V3[b[,1]],b[,2]))
    
    dat
      chr location           effect   impact AA_change AA_change1
    1   1    12543 missense_variant MODERATE  Ala12Val       A12V
    2   1    52367 missense_variant MODERATE  Leu54Pro       L54P
    3   1   752347 missense_variant MODERATE  Met99Ser       M99S
    4   1   984645 missense_variant MODERATE  Lys34Ile       I34K
    5   1   989845 missense_variant MODERATE   Arg4Cys        R4C
    6   1   999854 missense_variant MODERATE  His43Gly       G43H
    7   1   999855 missense_variant MODERATE  Glu14Phe       E14F
    
    
    
    dat2 = read.table(text="Ala == A
      Arg == R
      Asn == N
      Asp == D
      Cys == C
      Glu == E
      Gln == Q
      Gly == G
      His == H
      Ile == I
      Leu == L
      Lys == K
      Met == M
      Phe == F
      Pro == P
      Ser == S
      Thr == T
      Trp == W
      Tyr == Y
      Val == V")[-2]
    
        3
  •  2
  •   Frank    6 年前

    如果它的形式总是酸,数字,酸你可以把它分成三列,用 match 或者加入。对于data.table,这看起来像…

    library(data.table)
    setDT(dat)
    
    # put your mapping into a nicer format
    abbrDT = fread(header = FALSE,"
      Ala == A
      Arg == R
      Asn == N
      Asp == D
      Cys == C
      Glu == E
      Gln == Q
      Gly == G
      His == H
      Ile == I
      Leu == L
      Lys == K
      Met == M
      Phe == F
      Pro == P
      Ser == S
      Thr == T
      Trp == W
      Tyr == Y
      Val == V")[, .(abbr3 = V1, abbr1 = V3)] 
    
    # split the column
    patt = "(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)"
    dat[, c("AA1", "num", "AA2") := tstrsplit(AA_change, patt, perl=TRUE)]
    
    # substitute for each part
    dat[abbrDT, on=.(AA1 = abbr3), AA1 := abbr1]
    dat[abbrDT, on=.(AA2 = abbr3), AA2 := abbr1]
    

    哪个给了

       chr location           effect   impact AA_change AA1 num AA2
    1:   1    12543 missense_variant MODERATE  Ala12Val   A  12   V
    2:   1    52367 missense_variant MODERATE  Leu54Pro   L  54   P
    3:   1   752347 missense_variant MODERATE  Met99Ser   M  99   S
    4:   1   984645 missense_variant MODERATE  Lys34Ile   K  34   I
    5:   1   989845 missense_variant MODERATE   Arg4Cys   R   4   C
    6:   1   999854 missense_variant MODERATE  His43Gly   H  43   G
    7:   1   999855 missense_variant MODERATE  Glu14Phe   E  14   F
    

    或者,再次组合列并删除不需要的列:

    dat[, AA_change := paste0(AA1, num, AA2)]
    
    dat[, c("AA1", "num", "AA2") := NULL]