代码之家 › 专栏 › 技术社区 › Neal Barsch

基于使用列的累积和创建的分组筛选R data.table

cumsum data.table filter r

Neal Barsch · 技术社区 · 7 年前

我需要一个高效的data.table解决方案来筛选一个列的累积和中每300个列的第一个和最后一个实例。我真正的数据集是数百万行,所以我不寻找循环解决方案。

#Example data:
  dt <- data.table(idcolref=c(1:1000),y=rep(10,1000))

下面是一个执行我所希望的操作的示例循环,但是它太慢了,对于大型data.table没有用处。

###example of a loop that produces the result I want but is too slow
  library(foreach)
  dt[,grp:=1,]
  dt[,cumsum:=0,]
  grp <- 1
  foreach(a=2:nrow(dt))%do%{
    dt[a,"cumsum"]<-dt[a,"y"]+dt[a-1,"cumsum"]
    if(dt[a,"cumsum"]>300){
      dt[a,"grp"] <- grp
      grp <- grp+1
      dt[a,"cumsum"]<-0
    }else{
      dt[a,"grp"]<-dt[a-1,"grp"]
    }
  }
  dt.desired <- foreach(a=2:nrow(dt),.combine=rbind)%do%{
    if(dt[a,"grp"]!=dt[a-1,"grp"]){
      dt[c(a-1,a),]
    }
  }
  dt.desired <- rbind(dt[1,],dt.desired)
  dt.desired <- rbind(dt.desired,dt[nrow(dt),])

如何使用快速矢量化的data.table函数获得相同的结果?谢谢!

2 回复 | 直到 7 年前

SymbolixAU Adam Erickson 7 年前

我想我已经正确地解释了你的要求:

要计算向量(列)的累积和。
如果累计总和达到300,则要将其重置回0。
每次重置为0时,都希望将向量的这些值设置为一个新组。
要选择每个组的第一行和最后一行

如果是这种情况,您可以在 Rcpp

library(data.table)

dt <- data.table(x=rep(5,1e7),y=rep(10,1e7))
## adding a row index to keep track of which rows are returned
dt[, id := .I]

library(Rcpp)

cppFunction('Rcpp::NumericVector findGroupRows(Rcpp::NumericVector x) {

  int cumsum = 0;
  int grpCounter = 0;
  size_t n = x.length();
  Rcpp::NumericVector groupedCumSum(n);

  for ( size_t i = 0; i < n; i++) {
    cumsum += x[i];
    if (cumsum > 300) {
      cumsum = 0;
      grpCounter++;
    }
    groupedCumSum[i] = grpCounter;
  }
  return groupedCumSum;
}')

dt[, grp := findGroupRows(y)]

dt[ dt[, .I[c(1, .N)], by = grp]$V1]

Stefan F 7 年前

data.table 和基R函数:

dt[, grp2 := (cumsum(y) - 1) %/% 300]  

# straight forward solution:
dt[, .SD[c(1, .N)], by = "grp"]

# more efficient for large datasets, as suggested by SymbolixAU
dt[ dt[, .I[c(1, .N)], by = "grp"]$V1]

# check if your groups are of the correct size
table(dt[, .N[[1]], by = "grp"]$V1)

%/%
.SD 是当前的 数据表
.N 是当前 nrow(.SD ))
这个 -1 确保第一组的大小正确

推荐文章

Amp · 使用R ggplot2删除geom_radial中axis.line和panel.border之间的空格

1 年前

Hard_Course · 用另一列中的值替换行的最后一个非NA条目

1 年前

Mark R · 使用geom_sf()删除地球仪上不需要的网格线

1 年前

Joe · 根据对工作日和本周早些时候的日期的了解,找到一个日期

1 年前

Ben · 统计向量中的单词在字符串中出现的频率

1 年前

TheCodeNovice · R中符号格式的尾随零和其他问题[重复]

1 年前

katefull06 · 在R中使用terra修改范围时,会为单独的SpatRaster重写范围

1 年前

dez93_2000 · 在R管道子功能中引用管道对象的当前状态

1 年前

accibio · 在ggplot2中为同一变量创建两个连续的颜色渐变比例

1 年前

Mankka · 如何在Ggplot2中绘制均匀的径向图

1 年前