代码之家 › 专栏 › 技术社区 › Cina

R中连续记录的循环检查情况及替换

replace loops r

Cina · 技术社区 · 6 年前

我的数据集由 user , time 和 condition . 我想替换以错误开头的序列的时间,后面接两个以上的连续序列 TRUE S与 时间 最后一个连续的 真的 .

比如说 df:

df <- read.csv(text="user,time,condition
11,1:05,FALSE
11,1:10,TRUE
11,1:10,FALSE
11,1:15,TRUE
11,1:20,TRUE
11,1:25,TRUE
11,1:40,FALSE
22,2:20,FALSE
22,2:30,FALSE
22,2:35,TRUE
22,2:40,TRUE", header=TRUE)

我期望的结果是:将第6行的时间复制到第3行到第6行的时间,因为连续 真的 从4点到6点。这同样适用于最后三个记录。

user time   condition
11  1:05    FALSE
11  1:10    TRUE
11  1:25    FALSE
11  1:25    TRUE
11  1:25    TRUE
11  1:25    TRUE
11  1:40    FALSE
22  2:20    FALSE
22  2:40    FALSE
22  2:40    TRUE
22  2:40    TRUE

我怎么能在R里做?

3 回复 | 直到 6 年前

LachlanO 6 年前

这个解决方案应该做到这一点,更多细节请参见代码中的注释。

false_positions <- which(!c(df$condition, FALSE)) #Flag the position of each of the FALSE occurences
                                                  #A dummy FALSE is put on the end to check for end of dataframe

false_differences <- diff(false_positions, 1)     #Calculate how far each FALSE occurence is from the last

false_starts <- which(false_differences > 2)      #Grab which of these FALSE differences are more than 2 apart
                                                  #Greater than 2 indicates 2 or more TRUEs as the first FALSE 
                                                  #counts as one position

#false_starts stores the beginning of each chain we want to update

#Go through each of the FALSE starts which have more than one consecutive TRUE
for(false_start in false_starts){

  false_first <- false_positions[false_start]     #Gets the position of the start of our chain

  true_last <- false_positions[false_start+1]-1   #Gets the position of the end of our chain, which is the
                                                  #the item before (thus the -1) the false after our
                                                  #initial FALSE (thus the +1)

  time_override <- df$time[true_last]             #Now we know the position of the end of our chain (the last TRUE)
                                                  #We can get the time we want to use

  df$time[false_first:true_last] <- time_override #Update all the times from the start to end of our chain with
                                                  #the time we just determined

}

> df
   user time condition
1    11 1:05     FALSE
2    11 1:10      TRUE
3    11 1:25     FALSE
4    11 1:25      TRUE
5    11 1:25      TRUE
6    11 1:25      TRUE
7    11 1:40     FALSE
8    22 2:20     FALSE
9    22 2:40     FALSE
10   22 2:40      TRUE
11   22 2:40      TRUE

如果可能的话,我想把底部的环与之平行,但从我的头顶上看,我正努力做到这一点。

要点是找出我们所有的谎言在哪里,然后确定我们所有的锁链的起点在哪里,因为我们只有真假,我们可以通过观察我们的谎言有多远来做到这一点!

一旦我们知道了我们的链从哪里开始(因为它们是第一个错误,而错误相距足够远),我们就可以通过在我们已经创建的所有错误列表中的下一个错误之前查看元素来获得链的结尾。

现在我们有了链的开始和结束,我们只需看看链的结束就可以得到我们想要的时间,然后填写时间值!

我希望这是一个相对快速的方法来做你想做的事情,不过:)

Gabe 6 年前

这里有一个选择使用 rle

## Run length encoding of df
df_rle <- rle(df$condition)
## Locations of 2 or more consecutive TRUEs in RLE
seq_changes <- which(df_rle$lengths >= 2 & df_rle$value == TRUE)
## End-point index in original data frame
df_ind <- cumsum(df_rle$lengths)

## Loop over breakpoints to change
for (i in seq_changes){
  i1 <- df_ind[i-1]
  i2 <- df_ind[i]
  df$time[i1:i2] <- df$time[i2]
}

chinsoon12 6 年前

这是一个 data.table 解决方案应该在运行时更快。

library(data.table)
setDT(df)
df[, time := if (.N > 2) time[.N] else time, 
    by=cumsum(!shift(c(condition, FALSE))[-1L])]

#    user time condition
# 1:   11 1:05     FALSE
# 2:   11 1:10      TRUE
# 3:   11 1:25     FALSE
# 4:   11 1:25      TRUE
# 5:   11 1:25      TRUE
# 6:   11 1:25      TRUE
# 7:   11 1:40     FALSE
# 8:   22 2:20     FALSE
# 9:   22 2:40     FALSE
#10:   22 2:40      TRUE
#11:   22 2:40      TRUE

我们的想法是从F开始,分成若干个序列。

[-1L] 在执行 cumsum .

我建议你运行一些 by 代码在 j 来看看。

数据:

df <- read.csv(text="user,time,condition
11,1:05,FALSE
11,1:10,TRUE
11,1:10,FALSE
11,1:15,TRUE
11,1:20,TRUE
11,1:25,TRUE
11,1:40,FALSE
22,2:20,FALSE
22,2:30,FALSE
22,2:35,TRUE
22,2:40,TRUE", header=TRUE)