代码之家 › 专栏 › 技术社区 › SeGa

如何基于时间戳和值查找同步id

data.table dplyr r sql

SeGa · 技术社区 · 6 年前

我试图找到同步数据条目,它们在一定的时间戳上共享一定的值(“ref”)。

虚拟数据:

library(data.table)

dft <- data.table(
  id = rep(1:5, each=5),
  time = rep(1:5, 5),
  ref = c(10,11,11,11,11,
          10,11,11,11,21,
          20,31,31,31,31,
          20,41,41,41,41,
          20,51,51,51,51)
)

setorder(dft, time)
dft[, time := as.POSIXct(time, origin = "2018-10-14")]
dft

在该示例中,ID的1和2将在第1、2、6、7、11、12、16、17行的4个时间戳上同步,因为它们共享相同的时间戳 ref 值(行标记为 ! ). 注意:它们在一个时间戳内共享相同的ref值,并且可能在下一个时间戳中共享另一个ref值。

我怎样才能解决这个问题?我还想定义时间戳的数量,其中的值必须是相同的。如果我定义至少5个时间戳必须是同步的,那么在该示例中不应该出现任何ID。如果为4或更低,则ID的1&2应显示为同步数据项。

我必须在数百万行中进行计算,所以我希望 data.table 或 dplyr 解决方案或任何其他性能解决方案(SQL也可以)。

    id                time ref
 1:  1 2018-10-14 02:00:01  10    !
 2:  2 2018-10-14 02:00:01  10    !
 3:  3 2018-10-14 02:00:01  20
 4:  4 2018-10-14 02:00:01  20
 5:  5 2018-10-14 02:00:01  20
 6:  1 2018-10-14 02:00:02  11    !
 7:  2 2018-10-14 02:00:02  11    !
 8:  3 2018-10-14 02:00:02  31
 9:  4 2018-10-14 02:00:02  41
10:  5 2018-10-14 02:00:02  51
11:  1 2018-10-14 02:00:03  11    !
12:  2 2018-10-14 02:00:03  11    !
13:  3 2018-10-14 02:00:03  31
14:  4 2018-10-14 02:00:03  41
15:  5 2018-10-14 02:00:03  51
16:  1 2018-10-14 02:00:04  11    !
17:  2 2018-10-14 02:00:04  11    !
18:  3 2018-10-14 02:00:04  31
19:  4 2018-10-14 02:00:04  41
20:  5 2018-10-14 02:00:04  51
21:  1 2018-10-14 02:00:05  11
22:  2 2018-10-14 02:00:05  21
23:  3 2018-10-14 02:00:05  31
24:  4 2018-10-14 02:00:05  41
25:  5 2018-10-14 02:00:05  51

对来自@daviderenburg的两个例子进行基准测试:

library(microbenchmark)

mc = microbenchmark(times = 100,
  res1 = dft[dft, .(id, id2 = x.id), on = .(id > id, time, ref), nomatch = 0L, allow.cartesian=TRUE][, .N, by = .(id, id2)],
  res2= dft[dft, .(pmin(id, i.id), pmax(id, i.id)), on = .(time, ref), allow.cartesian=TRUE][V1 != V2, .(synced = .N / 2L), by = .(id1 = V1, id2 = V2)]
)

mc

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval cld
 res1 156.8389 158.8122 165.1828 159.6931 165.9156 292.7987   100  a 
 res2 311.1658 324.5684 350.3006 331.4310 343.6755 815.8397   100   b

2 回复 | 直到 6 年前

David Arenburg Ulrik 6 年前

可能的数据表解决方案

dft[dft, .(id, id2 = x.id), # get the desired columns
         on = .(id > id, time, ref), # the join condition
         nomatch = 0L, # remove unmatched records (NAs)
         allow.cartesian = TRUE # In case of a big join, allow Cartesian join 
     ][, .N, by = .(id, id2)] # Count obs. per ids combinations

#    id id2 N
# 1:  1   2 4
# 2:  3   4 1
# 3:  3   5 1
# 4:  4   5 1

解释

我们自己加入 time 和 ref ,同时指定 id > id 所以我们不会加入 id 提取连接的id( 身份证件 和 x.id 哪些是来自 二者都 数据集),同时删除所有不匹配的行( nomatch = 0L ) . 最后,我们计算匹配的组合( .N 是data.table中存储obs数的特殊符号。每种组合)。

旧的(以及更复杂的解决方案)

dft[dft, .(pmin(id, i.id), pmax(id, i.id)), on = .(time, ref)
    ][V1 != V2, .(synced = .N / 2L), by = .(id1 = V1, id2 = V2)]

SeGa 6 年前

将@David Arenburgs代码转换为 SQL语言 给我:

SELECT a.id as id, b.id as id2, count(*) FROM testdata a
INNER JOIN testdata b ON a.ref = b.ref AND a.timest = b.timest
WHERE a.id > b.id 
GROUP BY a.id, b.id
ORDER BY count(*) DESC;

只选那些 计数为1 :

SELECT a.id as id, b.id as id2, count(*) FROM testdata a
INNER JOIN testdata b ON a.ref = b.ref AND a.timest = b.timest
WHERE a.id > b.id 
GROUP BY a.id, b.id HAVING count(*) > 1
ORDER BY count(*) DESC;

生成包含结果数据帧的SQL表的代码( dft )从这个问题:

fwrite(x = dft, file = "C:/testdata.csv", row.names = F)

SQL语言:

CREATE TABLE testdata (
  id serial NOT NULL,
  timest timestamp,
  ref integer
  );

COPY testdata(id, timest, ref) 
FROM 'C:/testdata.csv' DELIMITER ',' CSV;