代码之家  ›  专栏  ›  技术社区  ›  Juan Antonio Roldán Díaz

计算每行中一个变量与另一个变量的百分比数据表

  •  3
  • Juan Antonio Roldán Díaz  · 技术社区  · 6 年前

    我正在寻找一种方法来优化聚合数据表,我有几百万个数据,而我当前的实现速度很慢。

    library(data.table)
    df <- data.table(Factor = as.factor(rep(LETTERS[1:3], 3)),
                     Variable = 1:9)
    

    当前实施:

    aux <- df[, .(sumVar = sum(Variable)/sum(df$Variable)), by = .(Factor)]
    df[aux, sumVar := sumVar, on = .(Factor = Factor)]
    

    期望输出:

    > df
       Factor Variable    sumVar
    1:      A        1 0.2666667
    2:      B        2 0.3333333
    3:      C        3 0.4000000
    4:      A        4 0.2666667
    5:      B        5 0.3333333
    6:      C        6 0.4000000
    7:      A        7 0.2666667
    8:      B        8 0.3333333
    9:      C        9 0.4000000
    

    我想我的问题在于 merge ,但我不知道如何提高它,我不熟悉 dplyr 我还没有找到任何方法可以一步一步地完成手术 data.table .

    感谢您的帮助!

    3 回复  |  直到 6 年前
        1
  •  2
  •   fidelin    6 年前

    像这样的

    df[ , ':='(sumVar = sum(Variable)/sum(df$Variable)), by = .(Factor)] 
    
        2
  •  4
  •   Hugh    6 年前

    你的例子中有很多重复的地方,所以我不确定我解释的是否正确。尽管如此,最好只计算一次分母并使用 gsum :

    BigTotal <- df[, sum(Variable)]
    df[, sumVar1 := sum(Variable), by = .(Factor)][, propVar := sumVar1 / BigTotal]
    

    df <- data.table(
      Factor = as.factor(sample(LETTERS, size = 10^8, replace = T)),
      Variable = sample(10^3, size = 10^8, replace = T)
    )
    
    microbenchmark::microbenchmark(dt1 = {
      aux <- df[, .(sumVar = sum(Variable)/sum(df$Variable)), keyby = .(Factor)]
      df[aux, sumVar := sumVar, on = .(Factor = Factor)]
    },
    dt2 = {
    BigTotal <- df[, sum(Variable)]
    df[, sumVar1 := sum(Variable), by = .(Factor)][, propVar := sumVar1 / BigTotal]
    }, 
    times = 2)
    
    
    Unit: seconds
     expr      min       lq     mean   median       uq      max neval cld
      dt1 9.523696 9.523696 9.567555 9.567555 9.611414 9.611414     2   b
      dt2 3.996581 3.996581 4.521274 4.521274 5.045967 5.045967     2  a 
    
        3
  •  2
  •   Ben    6 年前

    你有什么样的数据和时间安排?在下面100米行的例子中,我得到了以下计时

    library(data.table)
    
    df <- data.table(
      Factor = as.factor(sample(LETTERS, size = 10^8, replace = T)),
      Variable = sample(10^3, size = 10^8, replace = T)
    )
    
    # data.table solution 1
    system.time({
      aux <- df[, .(sumVar = sum(Variable)/sum(df$Variable)), by = .(Factor)]
      df[aux, sumVar := sumVar, on = .(Factor = Factor)]
    })  # ~10.5 seconds
    
    # data.table solution 2
    system.time({
      df[, sumVar := sum(Variable)/sum(df$Variable), by = Factor]
    })  # ~8.3 seconds
    
    # dplyr solution 1
    system.time({
      df %>% dplyr::group_by(Factor) %>% dplyr::mutate(A=sum(Variable)/sum(df$Variable))
    })  # ~10.0 seconds
    

    df <- data.table(
      Factor = as.factor(sample(as.character(10^6), size = 10^8, replace = T)),
      Variable = as.numeric(sample(10^3, size = 10^8, replace = T))
    )
    
    # data.table solution 1
    system.time({
      aux <- df[, .(sumVar = sum(Variable)/sum(df$Variable)), by = .(Factor)]
      df[aux, sumVar := sumVar, on = .(Factor = Factor)]
    })  # ~5.0 seconds
    
    # data.table solution 2
    system.time({
      df[, sumVar := sum(Variable)/sum(df$Variable), by = Factor]
    })  # ~3.1 seconds
    
    # dplyr solution 1
    system.time({
      df %>% dplyr::group_by(Factor) %>% dplyr::mutate(A=sum(Variable)/sum(df$Variable))
    })  # ~6.9 seconds