代码之家  ›  专栏  ›  技术社区  ›  user9740934

R-在定义的阈值上选择矩阵列表中存在的相同行的时间高效方法

  •  1
  • user9740934  · 技术社区  · 6 年前

    我有一个包含68个矩阵的列表。每个矩阵基本上是一个边列表,由三列和数千行组成。前两列分别命名为Node1和Node2,包含基因名称。每一行代表图形中的一条边,即基因之间的相互作用。第三列包含每条边的权重。

    目标是获得一个最终表格,其中75%或更多矩阵中存在的具有不同权重的边折叠成一行。每个最终边的权重将对应于相同边权重的平均值。

    我想知道一种更省时的代码,用于比较大型矩阵和百万行。

    实例

    1. 矩阵

      edgelist1<-matrix(data = c("ABCD1","EFGH1","DFEC","JEKC4",0.1314,1.1231),nrow = 2,ncol = 3,dimnames = list(c(),c("Node1","Node2","Weight")))
      edgelist1
      
      edgelist2<-matrix(data = c("ABCD1","DEIR3","CGESL","DFEC","KMN3","PME2",1.7564,0.6573,0.5478),nrow = 3,ncol = 3,dimnames = list(c(),c("Node1","Node2","Weight")))
      edgelist2
      
      edgelist3<-matrix(data = c("ACCD1","DEIR3","GUESL","DFEC","KMN3","PMKE2",1.264,0.8573,0.7458),nrow = 3,ncol = 3,dimnames = list(c(),c("Node1","Node2","Weight")))
      edgelist3
      
      edgelist4<-matrix(data = c("KPF2","NDM1","GUESL","ABCD1","KMN3","PMKE2","LTRC5","DFEC",1.142,0.9273,0.1358,0.3456),nrow = 4,ncol = 3,dimnames = list(c(),c("Node1","Node2","Weight")))
      edgelist4
      
    2. 列表

      list<-list(edgelist1,edgelist2,edgelist3,edgelist4)
      
    3. 所需输出

      finaledgelist<-matrix(c("ABCD1","DFEC","0.7445"),nrow=1,ncol = 3,dimnames = list(c(),c("Node1","Node2","Weight")))
      finaledgelist
      

    我的代码

    #Combining all edgelists into one
    alledges<-do.call(rbind,list)
    
    #Merging column 1 and column 2
    alledges<-data.frame(list(Edges=paste(alledges[,1],alledges[,2]),Weights=alledges[,3]))
    
    #Table to see the frequencies of appearance of each edge
    as.data.frame(table(alledges$Edge))->frequencies
    
    # Selection of the edges present in 75% or more of the original edgelists
    frequencies[frequencies$Freq>=3,]->selection
    
    #Selection of each edge that appears three or more times
    alledges[alledges$Edge %in% selection$Var1,]->repeated
    
    #Collapse by edge name and compute mean of the weights
    finaledgelist<-repeated %>%
      group_by(Edges) %>%
      dplyr::summarize(Weights=mean(as.numeric(as.character(Weights)), na.rm = TRUE))
    
    #Final edge list as data frame
    finaledgelist<-as.data.frame(cbind(Node1=unlist(strsplit(as.vector(finaledgelist$Edges),split=" "))[2*(1:nrow(finaledgelist))-1],Node2=unlist(strsplit(as.vector(finaledgelist$Edges),split=" "))[2*(1:nrow(finaledgelist))],Weights=finaledgelist$Weights))
    finaledgelist$Weights<-as.numeric(as.character(finaledgelist$Weights))
    
    2 回复  |  直到 6 年前
        1
  •  1
  •   missuse    6 年前

    下面是一种使用tidyverse的方法

    library(tidyverse)
    
    do.call(rbind, list1) %>% #bind all matrices together
      as.data.frame %>% #convert to data frame
      group_by(Node1, Node2) %>% #group by nodes
      mutate(n1 = n()) %>% #count members of each group
      filter(n1 >= (0.75 * length(list1))) %>% #filter those that are present in less than 75% of list elements
      summarise(weight = mean(as.numeric(as.character(Weight)))) #get mean weight for those that are left
    
    #output#
    A tibble: 1 x 3
    # Groups: Node1 [?]
      Node1 Node2 weight
      <fct> <fct>  <dbl>
    1 ABCD1 DFEC   0.744
    
        2
  •  0
  •   IceCreamToucan    6 年前

    与missuse的方法相同,但由于您特别询问了效率,下面是 data.table 版本

    list1 <- list(edgelist1, edgelist2, edgelist3, edgelist4) %>% lapply(as.data.frame, stringsAsFactors = F)
    dt <- rbindlist(list1)
    
    dt[dt[, pct := .N/length(list1), by = .(Node1, Node2)]$pct >= 0.75
       , .(Weight = mean(as.numeric(Weight)))
       , by = .(Node1, Node2)]
    
    #    Node1 Node2    Weight
    # 1: ABCD1  DFEC 0.7444667
    

    基准

    f1 <- function(){
    list1 <- list(edgelist1, edgelist2, edgelist3, edgelist4) %>% lapply(as.data.frame, stringsAsFactors = F)
    dt <- rbindlist(list1)
    
    dt[dt[, pct := .N/length(list1), by = .(Node1, Node2)]$pct >= 0.75
       , .(Weight = mean(as.numeric(Weight)))
       , by = .(Node1, Node2)]
    }
    
    f2 <- function(){
      do.call(rbind, list1) %>% #bind all metrics together
      as.data.frame %>% #convert to data frame
      group_by(Node1, Node2) %>% #group by nodes
      mutate(n1 = n()) %>% #count members of each group
      filter(n1 >= (0.75 * length(list1))) %>% #filter those that are present in less than 75% of list elements
      summarise(weight = mean(as.numeric(as.character(Weight)))) #get mean weight for those that are left
    }
    
    library(microbenchmark)
    
    microbenchmark(f1(), f2())
    
    # Unit: milliseconds
    # expr      min       lq      mean    median        uq       max neval
    # f1() 1.817024 2.207588  3.715193  2.718768  3.631382  33.88879   100
    # f2() 7.789532 9.990557 16.287901 12.058657 15.876705 347.46884   100