代码之家  ›  专栏  ›  技术社区  ›  MatBailie

窗口平均数,计算缺口

  •  1
  • MatBailie  · 技术社区  · 6 年前

    我需要计算过去4周的平均值。。。

    SELECT
      *,
      AVG(val) OVER (PARTITION BY some_identifier, day_of_week_column
                         ORDER BY date_column
                      ROW BETWEEN 4 PRECEDING AND 1 PRECEDING
                    )
                       AS preceding_4_week_average
    FROM
      myTable
    

    然而,数据是“稀疏的”

    在这种情况下,我的窗口函数应该返回“4周”,而不是“4行”。
    -错过的日期不是 0 它隐含着 NULL

     thing | date       | dow | val | avg
       1   | 2018-01-01 |  1  |  1  | NULL  <= AVG({})
       1   | 2018-01-08 |  1  |  2  | 1     <= AVG({1})
       1   | 2018-01-15 |  1  |  3  | 1.5   <= AVG({1,2})
       1   | 2018-01-22 |  1  |  4  | 2     <= AVG({1,2,3})
       1   | 2018-01-29 |  1  |  5  | 2.5   <= AVG({1,2,3,4})
    
       1   | 2018-02-12 |  1  |  7  | 4     <= AVG({3,4,5})
       1   | 2018-02-19 |  1  |  8  | 5.33  <= AVG({4,5,7})
       1   | 2018-02-26 |  1  |  9  | 6.66  <= AVG({5,7,8})
       1   | 2018-03-05 |  1  |  10 | 8     <= AVG({7,8,9})
       1   | 2018-03-12 |  1  |  11 | 11.25 <= AVG({7,8,9,10})
       1   | 2018-03-19 |  1  |  12 | 9.5   <= AVG({8,9,10,11})
    

    注:2018-02-05无数值

    我通常会用两种方法之一。。。


    1. 左键加入模板以“强制”所有日期存在,并依赖于 AVG() 实际上“忽略”了空值。

    这不太理想,因为“东西”的数量很大,而且构建这个模板非常昂贵。

    SELECT
      *,
      AVG(mytable.val) OVER (PARTITION BY things.id, dates.dow
                                 ORDER BY dates.date
                              ROW BETWEEN 4 PRECEDING AND 1 PRECEDING
                            )
                               AS preceding_4_week_average
    FROM
      things
    CROSS JOIN
      dates
    LEFT JOIN
      myTable
        ON  myTable.date = dates.date
        AND myTable.id   = things.id
    


    1. 不要使用窗口函数,而是使用自连接

    这不太理想,因为myTable中有数百列,而BigQuery在这方面的性能不是很好。

    SELECT
      myTable.*,
      AVG(hist.val)   AS preceding_4_week_average
    FROM
      myTable
    LEFT JOIN
      myTable   AS hist
        ON  hist.id    = myTable.id
        AND hist.date >= myTable.date - INTERVAL 28 DAYS
        AND hist.date <  myTable.date
    GROUP BY
      myTable.column1,
      myTable.column2,
      etc, etc
    


    实际问题

    其他人是否有其他选择,最好使用窗口/分析函数来“回顾4周”而不是“回顾4行”?

    2 回复  |  直到 6 年前
        1
  •  3
  •   Mikhail Berlyant    6 年前

    下面是BigQuery标准SQL

    如你所见-诀窍在于使用 RANGE 而不是 ROW

    #standardSQL
    SELECT *,
      AVG(val) OVER(
        PARTITION BY id, dow 
        ORDER BY DATE_DIFF(DATE_TRUNC(date, WEEK), DATE_TRUNC(CURRENT_DATE(), WEEK), WEEK) 
        RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING
      ) AVG
    FROM `project.dataset.table`   
    

    你可以使用下面问题中的虚拟数据来测试和玩上面的游戏

    #standardSQL
    WITH `project.dataset.table` AS (
      SELECT 1 id, DATE '2018-01-01' date, 1 dow, 1 val UNION ALL
      SELECT 1, '2018-01-08', 1, 2  UNION ALL
      SELECT 1, '2018-01-15', 1, 3  UNION ALL
      SELECT 1, '2018-01-22', 1, 4  UNION ALL
      SELECT 1, '2018-01-29', 1, 5  UNION ALL
      SELECT 1, '2018-02-12', 1, 7  UNION ALL
      SELECT 1, '2018-02-19', 1, 8  UNION ALL
      SELECT 1, '2018-02-26', 1, 9  UNION ALL
      SELECT 1, '2018-03-05', 1, 10 UNION ALL
      SELECT 1, '2018-03-12', 1, 11 UNION ALL
      SELECT 1, '2018-03-19', 1, 12  
    )
    SELECT *,
      AVG(val) OVER(
        PARTITION BY id, dow 
        ORDER BY DATE_DIFF(DATE_TRUNC(date, WEEK), DATE_TRUNC(CURRENT_DATE(), WEEK), WEEK) 
        RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING
      ) avg
    FROM `project.dataset.table`
    -- ORDER BY date
    

    结果是

    Row id      date    dow val avg 
    1   1   2018-01-01  1   1   null     
    2   1   2018-01-08  1   2   1.0  
    3   1   2018-01-15  1   3   1.5  
    4   1   2018-01-22  1   4   2.0  
    5   1   2018-01-29  1   5   2.5  
    6   1   2018-02-12  1   7   4.0  
    7   1   2018-02-19  1   8   5.333333333333333    
    8   1   2018-02-26  1   9   6.666666666666667    
    9   1   2018-03-05  1   10  8.0  
    10  1   2018-03-12  1   11  8.5  
    11  1   2018-03-19  1   12  9.5  
    
        2
  •  0
  •   Gordon Linoff    6 年前

    这是蛮力,但应该更快:

    select t.*,
           ((case when date_1 >= date_add(date, interval -4 week)
                  then val_1 else 0
             end) +
            (case when date_2 >= date_add(date, interval -4 week)
                  then val_2 else 0
             end) +
            (case when date_3 >= date_add(date, interval -4 week)
                  then val_3 else 0
             end) +
            (case when date_4 >= date_add(date, interval -4 week)
                  then val_4 else 0
             end)
            ) /
           ((case when date_1 >= date_add(date, interval -4 week)
                  then 1 else 0
             end) +
            (case when date_2 >= date_add(date, interval -4 week)
                  then 1 else 0
             end) +
            (case when date_3 >= date_add(date, interval -4 week)
                  then 1 else 0
             end) +
            (case when date_4 >= date_add(date, interval -4 week)
                  then 1 else 0
             end)
            )        
    from (select t.*,
                 lag(val, 1) over (partition by id, dow order by date) as val_1,
                 lag(val, 2) over (partition by id, dow order by date) as val_2,
                 lag(val, 3) over (partition by id, dow order by date) as val_3,
                 lag(val, 4) over (partition by id, dow order by date) as val_4,
                 lag(date, 1) over (partition by id, dow order by date) as date_1,
                 lag(date, 2) over (partition by id, dow order by date) as date_2,
                 lag(date, 3) over (partition by id, dow order by date) as date_3,
                 lag(date, 4) over (partition by id, dow order by date) as date_4
       from mytable t
         ) t;
    

    也许有一种聪明的方法可以用数组来表达,但我现在所处的位置要早一点。

    推荐文章