我有一个代表多个“case_id”的数据集。对于每个“case_id”,我需要创建三个新列:“var1_below”、“var1_at”和“var1_above”。这些列应表示在“average_code_3”更改状态之后发生的会话的平均“session_length”,直到下一次更改为止。
以下是计算“var1_at”的算法(“var1_below”和“var1_above”的过程类似):
For each 'case_id':
If 'average_code_3' is "at":
Calculate the average of 'session_length' for all subsequent sessions until 'average_code_3' changes status. The first session after the change should also be included in the average calculation.
例如,在下面的数据集中,“var1_at”是4009、79118和833151(来自“session_length”列)的平均值,等于305426var1_below’是NA,因为‘average_code_3’列中没有“below”记录。’var1_abover'是1820655和886950的平均值,等于1353802.5。(我在附件中用绿色和蓝色对它们进行了颜色编码,我基本上想要列中这些彩色记录的平均值
session_length
)
示例数据帧“df”如下所示:
df <- data.frame(
case_id = rep("188_161_2", 21),
view_end_timestamp = c(1424065707, 1424112235, 1424154932, 1424155003, 1424156518, 1424156557, 1424288125, 1424288141, 1424331365, 1424413370, 1424413389, 1424413492, 1424454887, 1424496497, 1424496518, 1424496611, 1425329649, 1425329674, 1427150341, 1428037250, 1428037421),
view_length = c(6732, 20, 42681, 55, 1515, 39, 65, 6, 43224, 7, 10, 103, 37515, 7, 21, 74, 8, 25, 45, 4, 171),
average_code_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, "at", NA, NA, "at", NA, NA, "above", NA, NA, "above", NA, NA, "above"),
session_num = c(1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 7, 7, 7, 8, 8, 9, 10, 10),
session_length = c(53240, 42733, 42733, 133112, 133112, 133112, 125303, 125303, 125303, 4009, 4009, 4009, 79118, 833151, 833151, 833151, 1820655, 1820655, 886950, NA, NA)
)
我如何在R中实现这一点?