代码之家  ›  专栏  ›  技术社区  ›  homam

按SQL中的第一个值对行序列进行分组

  •  0
  • homam  · 技术社区  · 4 年前

    如何根据SQL中每个序列的第一个值对序列数据集进行分组?

    例如,我有以下数据集

    id  name  key  metric
    1   alice a    0   <- key = 'a', start of a sequence
    2   alice b    1
    3   alice b    1
    -----------------
    4   alice a    1   <- key = 'a', start of a sequence
    5   alice b    0
    6   alice b    0
    7   alice b    0
    -----------------
    8   bob   a    1   <- key = 'a', start of a sequence
    9   bob   b    1
    -----------------
    10  bob   a    0   <- key = 'a', start of a sequence
    

    行与 key = 'a' 创建一个新组。例如,我想将所有后续行的指标相加,直到我到达另一行 key=“a” 或另一个 name .

    数据集按以下方式排序 id .

    最终结果应该是这样的:

    id  name   metric
    1   alice  2
    4   alice  1
    8   bob    2
    10  bob    0
    

    这是JavaScript中的等效操作,但我希望能够通过SQL查询获得相同的结果。

    data.reduce((acc, a) => {
        if(a.key === 'a'){
          // key = 'a' starts a new group
          return [{id: a.id, name: a.name, metric: a.metric}].concat(acc)
        } else {
          // because the data is sorted, 
          // all the subsequent rows with key = 'b' belong to the latest group
          const [head, ...tail] = acc
          const head_updated = {...head, metric: head.metric + a.metric}
          return [head_updated, ...tail]
        }
      }, [])
      .reverse()
    

    示例SQL数据集:

    with dataset as (
      select 
        1       as id
      , 'alice' as name
      , 'a'     as key
      , 0       as metric
      union select
        2       as id
      , 'alice' as name
      , 'b'     as key
      , 1       as metric
      union select
        3       as id
      , 'alice' as name
      , 'b'     as key
      , 1       as metric
      union select 
        4       as id
      , 'alice' as name
      , 'a'     as key
      , 1       as metric
      union select
        5       as id
      , 'alice' as name
      , 'b'     as key
      , 0       as metric
      union select
        6       as id
      , 'alice' as name
      , 'b'     as key
      , 0       as metric
      union select
        7       as id
      , 'alice' as name
      , 'b'     as key
      , 0       as metric
      union select
        8       as id
      , 'bob'   as name
      , 'a'     as key
      , 1       as metric
      union select
        9       as id
      , 'bob'   as name
      , 'b'     as key
      , 1       as metric
      union select
        10      as id
      , 'bob'   as name
      , 'a'     as key
      , 0       as metric
    )
    
    select * from dataset
    order by name, id
    
    0 回复  |  直到 4 年前
        1
  •  2
  •   forpas    4 年前

    您可以使用窗口功能 sum() 创建组,然后聚合:

    select min(id) id, name, sum(metric) metric
    from (
      select *, sum((key = 'a')::int) over (partition by name order by id) grp 
      from dataset
    ) t
    group by name, grp
    order by id
    

    请参阅 demo .
    结果:

    > id | name  | metric
    > -: | :---- | -----:
    >  1 | alice |      2
    >  4 | alice |      1
    >  8 | bob   |      2
    > 10 | bob   |      0
    
        2
  •  1
  •   MkWTF    4 年前

    根据OP在评论中写的内容,查询必须确实如下:

    SELECT MAX(t.head_id) AS id,
           t.head_name AS name,
           SUM(t.metric) AS metric
    FROM (
        SELECT SUM(CASE WHEN key = 'a' THEN 1 END) OVER (PARTITION BY name ORDER BY id) AS group_id,
               CASE WHEN key = 'a' THEN id END AS head_id,
               name AS head_name,
               metric
        FROM dataset
    ) t
    GROUP BY t.head_name, t.group_id
    

    但是,如果可以按名称和id添加索引,则确实可以提高查询的性能。这是因为在聚合之前不需要排序操作。

    用一个有一百万行的表进行测试,这是无索引的解释分析的输出:

    HashAggregate  (cost=177154.34..177158.34 rows=400 width=25) (actual time=3374.878..3489.755 rows=400000 loops=1)
      Group Key: dataset.name, sum(CASE WHEN (dataset.key = 'a'::text) THEN 1 ELSE NULL::integer END) OVER (?)
      ->  WindowAgg  (cost=132154.34..157154.34 rows=1000000 width=25) (actual time=1920.338..3000.218 rows=1000000 loops=1)
            ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=15) (actual time=1920.323..2232.936 rows=1000000 loops=1)
                  Sort Key: dataset.name, dataset.id
                  Sort Method: external merge  Disk: 28192kB
                  ->  Seq Scan on dataset  (cost=0.00..15406.00 rows=1000000 width=15) (actual time=0.020..172.746 rows=1000000 loops=1)
    
    Planning Time: 0.870 ms
    Execution Time: 3516.726 ms
    

    通过创建索引,查询计划将更改为以下内容:

    索引 :

    CREATE INDEX dataset__name_id__idx ON dataset(name, id);
    

    查询计划 :

    HashAggregate  (cost=90169.90..90173.90 rows=400 width=25) (actual time=1464.759..1567.778 rows=400000 loops=1)
      Group Key: dataset.name, sum(CASE WHEN (dataset.key = 'a'::text) THEN 1 ELSE NULL::integer END) OVER (?)
      ->  WindowAgg  (cost=0.42..70169.90 rows=1000000 width=25) (actual time=0.033..1077.362 rows=1000000 loops=1)
            ->  Index Scan using dataset__name_id__idx on dataset  (cost=0.42..47669.90 rows=1000000 width=15) (actual time=0.022..225.445 rows=1000000 loops=1)
    
    Planning Time: 0.131 ms
    Execution Time: 1590.040 ms
    

    老答案

    根据您的javascript代码,您不想按以下方式对窗口进行分区 name ,也不分组 名称 在外部查询中。如果没有这个,您实际上会得到一个更好的查询,允许您只使用主索引,假设 id 列已被索引。

    SELECT t.head_id AS id,
           MAX(t.head_name) AS name,
           SUM(t.metric) AS metric
    FROM (
            SELECT MAX(CASE WHEN key = 'a' THEN id END) OVER (ORDER BY id) AS head_id,
                   CASE WHEN key = 'a' THEN name END AS head_name,
                   metric
            FROM dataset
        ) t
    GROUP BY t.head_id
    

    这是一个查询计划 dataset 有100万行:

    HashAggregate  (cost=68889.43..68891.43 rows=200 width=44) (actual time=1277.469..1393.709 rows=400000 loops=1)
      Group Key: max(CASE WHEN (dataset.key = 'a'::text) THEN dataset.id ELSE NULL::integer END) OVER (?)
      ->  WindowAgg  (cost=0.42..51389.43 rows=1000000 width=44) (actual time=0.025..927.595 rows=1000000 loops=1)
            ->  Index Scan using dataset_pkey on dataset  (cost=0.42..31389.42 rows=1000000 width=15) (actual time=0.017..209.657 rows=1000000 loops=1)
    
    Planning Time: 0.127 ms
    Execution Time: 1411.975 ms