代码之家  ›  专栏  ›  技术社区  ›  rpanai

SQL统计一个组中有多少离群值

  •  0
  • rpanai  · 技术社区  · 6 年前

    我想计算一下每种元素的数量 group by 大于$\mu+\sigma$、$\mu+2\sigma$等。

    到目前为止,我发现了一个首先创建表的解决方案 gp 使用$\mu$和$\sigma$

    CREATE TABLE gp AS
    SELECT col_a,
           col_b,
           AVG(y) AS y_mean,
           STDDEV(y) AS y_std
    FROM my_table
    GROUP BY col_a, col_b;
    

    然后我做了一个 left join 使用原始表并通过

    SELECT col_a,
           col_b,
           SUM(CASE
                 WHEN y>y_mean+y_std THEN 1
                 ELSE 0
               END) AS std1,
           SUM(CASE
                 WHEN y>y_mean+2*y_std THEN 1
                 ELSE 0
               END) AS std2,
           SUM(CASE
                 WHEN y>y_mean+3*y_std THEN 1
                 ELSE 0
               END) AS std3, 
    FROM (
    SELECT a.*,
           b.y_mean,
           b.y_std
    FROM(
    (SELECT col_a,
           col_b,
           y
    FROM my_table) a
    LEFT JOIN (SELECT * FROM gp) b
    ON a.col_a=b.col_a AND a.col_b=b.col_b)
    )
    GROUP BY col_a, col_b
    

    我想知道是否有更有效的方法来达到同样的结果。

    1 回复  |  直到 6 年前
        1
  •  1
  •   Gordon Linoff    6 年前

    使用窗口功能:

    SELECT col_a, col_b,
           SUM(CASE WHEN y > y_mean + y_std THEN 1 ELSE 0
               END) AS std1,
           SUM(CASE WHEN y > y_mean + 2 * y_std THEN 1 ELSE 0
               END) AS std2,
           SUM(CASE WHEN y > y_mean + 3 * y_std THEN 1 ELSE 0
               END) AS std3
    FROM (SELECT t.*,
                 AVG(y) OVER (PARTITION BY col_a, col_b) as y_mean,
                 STDDEV(y) OVER (PARTITION BY col_a, col_b) as y_std
          FROM my_table t
         ) t
    GROUP BY col_a, col_b;
    

    从统计的角度来看,您也应该看到下限。如果分布只向正方向倾斜,那么标准偏差可能不是最佳度量(尽管在使用数据库时没有太多选择)。