代码之家  ›  专栏  ›  技术社区  ›  Matt Summersgill

转换的最快方式数据表因子列到字符列

  •  0
  • Matt Summersgill  · 技术社区  · 6 年前

    TL;DR:在数据表到字符列?

    作为开场白,我怀疑这个问题可能会在其他地方得到答案,我正在努力寻找任何明确的答案,因此,如果有人能为我指出正确的方向,我将非常感激这一点,同时也希望得到详尽的答案。

    我经常需要转换所有数据表因子列到字符列。看过Matt Dowle建议使用 set 为了这个任务 comment on this answer

    然而,当我做一些测试时,我对结果感到惊讶。方法如下:

    测试数据生成

    library(data.table)
    set.seed(1234)
    
    ## Create a vector of 1 million 4 character strings
    ## with 456,976 possible unique values 
    DiverseSize <- 1e6
    Diverse <- paste0(sample(LETTERS,DiverseSize,replace = TRUE),
                      sample(letters,DiverseSize,replace = TRUE),
                      sample(letters,DiverseSize,replace = TRUE),
                      sample(letters,DiverseSize,replace = TRUE))
    
    ## Create a vector of 10 million single character strings
    ## with 26 possible unique values
    CommonSize  <- 1e7
    Common <-  sample(LETTERS,CommonSize,replace = TRUE)
    
    ## Mix them into a data.table columns, "x0" through "x9"
    DT_Original<- data.table(x0 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                             x1 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                             x2 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                             x3 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                             x4 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                             x5 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                             x6 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                             x7 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                             x8 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)),
                             x9 = as.factor(sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE)))
    
    DT1 <- copy(DT_Original)
    DT2 <- copy(DT_Original)
    DT3 <- copy(DT_Original)
    

    功能

    unfactorize <- function(df){
      for(i in which(sapply(df, class) == "factor")) df[[i]] = as.character(df[[i]])
      return(df)
    }
    
    set_unfactorize <- function(df){
      for(col in names(df)[which(sapply(df, class) == "factor")]) set(df, j = col, value = as.character(df[[col]]))
    }
    

    执行

    ## Original
    DT1 <- unfactorize(DT1)
    ## data.table::set version
    set_unfactorize(DT2)
    ## Outside of function
    for(col in names(DT3)[which(sapply(DT3, class) == "factor")]) set(DT3, j = col, value = as.character(DT3[[col]]))
    

    分析结果

    我对结果感到非常惊讶——BaseR版本似乎执行速度最快,使用的内存最少,尽管我希望它需要一个副本。是这样,还是我遗漏了什么?

    Profiling Comparison

    0 回复  |  直到 6 年前