代码之家  ›  专栏  ›  技术社区  ›  kee

BigQuery:如何进行半左连接?

  •  0
  • kee  · 技术社区  · 6 年前

    我想不出这个问题的好题目。很抱歉。

    Table A:
    ========
    a_id int,
    common_id int,
    ts timestamp
    ...
    
    Table B:
    ========
    b_id int,
    common_id int,
    ts timestamp,
    temperature int
    

    表A更像是每次更改其状态时的设备数据。表B是更多的物联网数据,其中包含设备每分钟左右的温度。

    我想做的是从这两个表创建一个表C。表C实质上是表A+表B中最近时间内的温度。

    如何在BigQuery SQL中做到这一点?温度信息不需要精确。

    2 回复  |  直到 6 年前
        1
  •  2
  •   Mikhail Berlyant    6 年前

    下面的选项(对于BigQuery标准SQL)假设除了 temperature

    #standardSQL
    SELECT 
      ARRAY_AGG(
        STRUCT(a_id, a.common_id, a.ts, b_id, b.ts AS b_ts, temperature) 
        ORDER BY ABS(TIMESTAMP_DIFF(a.ts, b.ts, SECOND)) 
        LIMIT 1
      )[SAFE_OFFSET(0)].*
    FROM `project.dataset.table_a` a 
    LEFT JOIN `project.dataset.table_b` b
    ON a.common_id = b.common_id 
    AND ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
    GROUP BY TO_JSON_STRING(a)
    

    我用下面生成的虚拟数据测试了它

    #standardSQL
    WITH `project.dataset.table_a` AS ( 
      SELECT CAST(1000000 * RAND() AS INT64) a_id, common_id, ts
      FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2018-01-01 00:00:00', '2018-01-01 23:59:59', INTERVAL 45*60 + 27 SECOND)) ts
      CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) common_id
    ), `project.dataset.table_b` AS ( 
      SELECT CAST(1000000 * RAND() AS INT64) b_id, common_id, ts, CAST(60 + 40 * RAND() AS INT64) temperature 
      FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2018-01-01 00:00:00', '2018-01-01 23:59:59', INTERVAL 1 MINUTE)) ts
      CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) common_id
    ) 
    SELECT 
      ARRAY_AGG(
        STRUCT(a_id, a.common_id, a.ts, b_id, b.ts AS b_ts, temperature) 
        ORDER BY ABS(TIMESTAMP_DIFF(a.ts, b.ts, SECOND)) 
        LIMIT 1
      )[SAFE_OFFSET(0)].*
    FROM `project.dataset.table_a` a 
    LEFT JOIN `project.dataset.table_b` b
    ON a.common_id = b.common_id 
    AND ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
    GROUP BY TO_JSON_STRING(a)  
    

    以输出的几行为例:

    Row a_id    common_id ts                        b_id    b_ts                    temperature  
    1   276623  1         2018-01-01 00:00:00 UTC   166995  2018-01-01 00:00:00 UTC     74   
    2   218354  1         2018-01-01 00:45:27 UTC   464901  2018-01-01 00:45:00 UTC     87   
    3   265634  1         2018-01-01 01:30:54 UTC   565385  2018-01-01 01:31:00 UTC     87   
    4   758075  1         2018-01-01 02:16:21 UTC   55894   2018-01-01 02:16:00 UTC     84   
    5   306355  1         2018-01-01 03:01:48 UTC   844429  2018-01-01 03:02:00 UTC     92   
    6   348502  1         2018-01-01 03:47:15 UTC   375859  2018-01-01 03:47:00 UTC     90   
    7   774920  1         2018-01-01 04:32:42 UTC   438164  2018-01-01 04:33:00 UTC     61   
    

    这里-我将table_b设置为“2018-01-01”全天10台设备每分钟的温度,在table_a中,我将table_b设置为同一天10台设备每45分钟27秒的状态变化。a_id和b_id-只是0到999999之间的随机数

    ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30 从句 JOIN 控制时段,您可以认为可以查找最近的ts(如果表中缺少一些IoT条目的话

        2
  •  1
  •   Felipe Hoffa    6 年前

    TIMESTAMP_DIFF(a.ts,b.ts, SECOND) -通过其绝对值在任何方向上获得最接近的值:

    WITH a AS ( 
      SELECT 1 id, TIMESTAMP('2018-01-01 11:01:00') ts
      UNION ALL SELECT 1, ('2018-01-02 10:00:00')
      UNION ALL SELECT 2, ('2018-01-02 10:00:00')
    )
    , b AS ( 
      SELECT 1 id, TIMESTAMP('2018-01-01 12:01:00') ts, 43 temp 
      UNION ALL SELECT 1, TIMESTAMP('2018-01-01 12:06:00'), 47 
    )
    
    SELECT *, 
      (SELECT temp 
       FROM b 
       WHERE a.id=b.id 
       ORDER BY ABS(TIMESTAMP_DIFF(a.ts,b.ts, SECOND)) 
       LIMIT 1) temp
    FROM a
    

    enter image description here