代码之家  ›  专栏  ›  技术社区  ›  John Frazer

(Postgres)SQL:如何提供所有缺失的对?

  •  0
  • John Frazer  · 技术社区  · 7 年前

    给定一个包含成对“因子”和 exists

    create table pairs (
      factor_1  text,
      factor_2  text,
      exists    boolean
      );
    

    和以下数据(可读性分隔符):

     factor_1 | factor_2 | exists
    ----------+------------------
    foo       | one      | t
    foo       | two      | t
    -----------------------------
    bar       | three    | t
    -----------------------------
    baz       | four     | t
    baz       | five     | t
    

    如何创建一个视图来显示 给定因子集:

     factor_1 | factor_2 | exists
    ----------+------------------
    foo       | one      | t
    foo       | two      | t
    foo       | three    | f
    foo       | four     | f
    foo       | five     | f
    -----------------------------
    bar       | one      | f
    bar       | two      | f
    bar       | three    | t
    bar       | four     | f
    bar       | five     | f
    -----------------------------
    baz       | one      | f
    baz       | two      | f
    baz       | three    | f
    baz       | four     | t
    baz       | five     | t
    

    我想可以定义一个包含所有 的不同值 factor_1 ,另一个包含 factor_2 ,然后取叉积并设置 对所有人都是如此 表中的配对 pairs . 有更优雅/高效/地道的吗 实现相同目标的方法?

    编辑 讨论解决方案:

    在很短的时间内,在问这个问题和得到两个答案之间有一段距离 为此,我去实现了我在上面记下的解决方案。这是 它看起来像什么;它有3个连接时序和一个隐式交叉连接:

    with
      p1 as ( select distinct factor_1 from pairs  ),
      p2 as ( select distinct factor_2 from pairs  ),
      p3 as ( select *                 from p1, p2 )
      select
          p3.factor_1 as factor_1,
          p3.factor_2 as factor_2,
          ( case when p.exists then true else false end ) as exists
        from p3
        left join pairs as p on ( p3.factor_1 = p.factor_1 and p3.factor_2 = p.factor_2 )
        order by p3.factor_1, p3.factor_2;
    

    现在让我们将其与答案进行比较。我重新格式化并重命名为 让所有解决方案只在重要的地方有所不同。

    Gordon Linoff的解决方案A相当短,并且可以在没有CTE的情况下使用:

    select
        f1.factor_1                 as factor_1,
        f2.factor_2                 as factor_2,
        coalesce( p.exists, false ) as exists
      from        ( select distinct factor_1 from pairs ) as p1
      cross join  ( select distinct factor_2 from pairs ) as p2
      left  join  pairs p
        on p.factor_1 = p1.factor_1 and p.factor_2 = p2.factor_2
        order by p1.factor_1, p2.factor_2;
    

    Valli的解决方案B甚至更短一点;它的洞察力是它的组合 从交叉连接来看,什么应该是唯一的,所以 distinct 关键字可能会被分解掉 到达顶端 select :

    select distinct
        p1.factor_1                 as factor_1,
        p2.factor_2                 as factor_2,
        coalesce( p.exists, false ) as exists
      from        pairs as p1
      cross join  pairs as p2
      left  join  pairs as p
        on p1.factor_1 = p.factor_1 and p2.factor_2 = p.factor_2
        order by p1.factor_1, p2.factor_2;
    

    然后被过滤掉的重复对太多。所以我做到了 explain analyze ( :我删除了 order by 条款);结果表明,结果有些矛盾。我的解决方案 由于热膨胀系数的存在,热膨胀系数有缺点。我在SQL中经常使用它们,因为它们非常方便,但是 它们在PostgreSQL中也被称为优化岛(类似于独立视图),如图所示。

                                                           QUERY PLAN                                                        
    -------------------------------------------------------------------------------------------------------------------------
     Merge Left Join  (cost=4770.47..5085.69 rows=40000 width=65) (actual time=0.167..0.189 rows=15 loops=1)
       Merge Cond: ((v3.factor_1 = p.factor_1) AND (v3.factor_2 = p.factor_2))
       CTE v1
         ->  HashAggregate  (cost=20.88..22.88 rows=200 width=32) (actual time=0.026..0.028 rows=3 loops=1)
               Group Key: pairs.factor_1
               ->  Seq Scan on pairs  (cost=0.00..18.70 rows=870 width=32) (actual time=0.010..0.012 rows=5 loops=1)
       CTE v2
         ->  HashAggregate  (cost=20.88..22.88 rows=200 width=32) (actual time=0.011..0.012 rows=5 loops=1)
               Group Key: pairs_1.factor_2
               ->  Seq Scan on pairs pairs_1  (cost=0.00..18.70 rows=870 width=32) (actual time=0.003..0.005 rows=5 loops=1)
       CTE v3
         ->  Nested Loop  (cost=0.00..806.00 rows=40000 width=64) (actual time=0.044..0.062 rows=15 loops=1)
               ->  CTE Scan on v1  (cost=0.00..4.00 rows=200 width=32) (actual time=0.028..0.030 rows=3 loops=1)
               ->  CTE Scan on v2  (cost=0.00..4.00 rows=200 width=32) (actual time=0.005..0.007 rows=5 loops=3)
       ->  Sort  (cost=3857.54..3957.54 rows=40000 width=64) (actual time=0.118..0.123 rows=15 loops=1)
             Sort Key: v3.factor_1, v3.factor_2
             Sort Method: quicksort  Memory: 25kB
             ->  CTE Scan on v3  (cost=0.00..800.00 rows=40000 width=64) (actual time=0.046..0.074 rows=15 loops=1)
       ->  Sort  (cost=61.18..63.35 rows=870 width=65) (actual time=0.042..0.042 rows=5 loops=1)
             Sort Key: p.factor_1, p.factor_2
             Sort Method: quicksort  Memory: 25kB
             ->  Seq Scan on pairs p  (cost=0.00..18.70 rows=870 width=65) (actual time=0.005..0.008 rows=5 loops=1)
     Planning time: 0.368 ms
     Execution time: 0.421 ms
    (24 rows)
    

    注意有两个 sort 在这个计划中。

    解决方案A的计划要短得多(执行时间也高得出奇):

                                                                   QUERY PLAN                                                                
    -----------------------------------------------------------------------------------------------------------------------------------------
     Hash Right Join  (cost=1580.25..2499.00 rows=40000 width=65) (actual time=1.048..2.197 rows=15 loops=1)
       Hash Cond: ((p.factor_1 = pairs.factor_1) AND (p.factor_2 = pairs_1.factor_2))
       ->  Seq Scan on pairs p  (cost=0.00..18.70 rows=870 width=65) (actual time=0.010..0.015 rows=5 loops=1)
       ->  Hash  (cost=550.25..550.25 rows=40000 width=64) (actual time=0.649..0.649 rows=15 loops=1)
             Buckets: 65536  Batches: 2  Memory Usage: 513kB
             ->  Nested Loop  (cost=41.75..550.25 rows=40000 width=64) (actual time=0.058..0.077 rows=15 loops=1)
                   ->  HashAggregate  (cost=20.88..22.88 rows=200 width=32) (actual time=0.033..0.036 rows=3 loops=1)
                         Group Key: pairs.factor_1
                         ->  Seq Scan on pairs  (cost=0.00..18.70 rows=870 width=32) (actual time=0.017..0.018 rows=5 loops=1)
                   ->  Materialize  (cost=20.88..25.88 rows=200 width=32) (actual time=0.008..0.011 rows=5 loops=3)
                         ->  HashAggregate  (cost=20.88..22.88 rows=200 width=32) (actual time=0.013..0.016 rows=5 loops=1)
                               Group Key: pairs_1.factor_2
                               ->  Seq Scan on pairs pairs_1  (cost=0.00..18.70 rows=870 width=32) (actual time=0.004..0.006 rows=5 loops=1)
     Planning time: 0.258 ms
     Execution time: 2.342 ms
    (15 rows)
    

    解决方案B的执行计划比解决方案A长得多,有几个隐式 分类 s:

                                                                    QUERY PLAN                                                                
    ------------------------------------------------------------------------------------------------------------------------------------------
     Unique  (cost=282354.48..289923.48 rows=80000 width=65) (actual time=0.230..0.251 rows=15 loops=1)
       ->  Sort  (cost=282354.48..284246.73 rows=756900 width=65) (actual time=0.229..0.233 rows=25 loops=1)
             Sort Key: p1.factor_1, p2.factor_2, (COALESCE(p."exists", false))
             Sort Method: quicksort  Memory: 26kB
             ->  Merge Left Join  (cost=140389.32..146354.17 rows=756900 width=65) (actual time=0.122..0.157 rows=25 loops=1)
                   Merge Cond: ((p1.factor_1 = p.factor_1) AND (p2.factor_2 = p.factor_2))
                   ->  Sort  (cost=140328.14..142220.39 rows=756900 width=64) (actual time=0.095..0.100 rows=25 loops=1)
                         Sort Key: p1.factor_1, p2.factor_2
                         Sort Method: quicksort  Memory: 26kB
                         ->  Nested Loop  (cost=0.00..9500.83 rows=756900 width=64) (actual time=0.027..0.043 rows=25 loops=1)
                               ->  Seq Scan on pairs p1  (cost=0.00..18.70 rows=870 width=32) (actual time=0.010..0.011 rows=5 loops=1)
                               ->  Materialize  (cost=0.00..23.05 rows=870 width=32) (actual time=0.003..0.005 rows=5 loops=5)
                                     ->  Seq Scan on pairs p2  (cost=0.00..18.70 rows=870 width=32) (actual time=0.005..0.008 rows=5 loops=1)
                   ->  Sort  (cost=61.18..63.35 rows=870 width=65) (actual time=0.021..0.023 rows=8 loops=1)
                         Sort Key: p.factor_1, p.factor_2
                         Sort Method: quicksort  Memory: 25kB
                         ->  Seq Scan on pairs p  (cost=0.00..18.70 rows=870 width=65) (actual time=0.004..0.004 rows=5 loops=1)
     Planning time: 0.260 ms
     Execution time: 0.333 ms
    (19 rows)
    

    我认为我们可以忽略这个没有索引的短样本的执行时间;只有有了真实的数据,我们才能肯定地告诉他们。

    基于这些结果,我更喜欢Gordon Linoff的解决方案A,原因是它的SQL形式很短,而执行计划是最简洁的。我有点担心解决方案B的执行计划中出现性能不佳的机会,我的猜测是 优雅的 计算出 不同的 从句到最高层,它不一定是最高层 表达方式我不想交叉连接和过滤唯一对,我想交叉连接唯一值。不用说,如果执行时间关系(A:2.3ms/B:0.3ms)最终显示为实际的数据量,这将推翻我的决定。

    3 回复  |  直到 7 年前
        1
  •  1
  •   Gordon Linoff    7 年前

    使用 cross join 获取行和 left join 要获取布尔表达式,请执行以下操作:

    select f1.factor_1, f2.factor_2, coalesce(p.exists, false) as exists
    from (select distinct factor_1 from pairs) f1 cross join
         (select distinct factor_2 from pairs) f2 left join
         pairs p
         on p.factor_1 = f1.factor_1 and p.factor_2 = f2.factor_2;
    

    注:尽管Postgres接受 exists 作为列别名,我认为这是一个坏名字,因为它与SQL关键字冲突。

        2
  •  1
  •   Valli    7 年前

    我们可以在顶部使用distinct,而不是在from子句中过滤distinct记录。交叉连接表,然后左连接以获取存在的列

    SELECT distinct p1.factor_1,
                    p2.factor_2,
                    coalesce(p.exists, false)
      FROM pairs p1 CROSS JOIN
           pairs p2 LEFT JOIN 
           pairs p ON
           p1.factor_1= p.factor_1 and
           p2.factor_2= p.factor_2
    
        3
  •  0
  •   wildplasser    7 年前

    您不需要左连接+合并,因为EXISTS已经生成布尔值:


    SELECT f1.factor_1, f2.factor_2
      , EXISTS ( SELECT* pairs p
                WHERE p.factor_1 = f1.factor_1 AND p.factor_2 = f2.factor_2
                ) AS did_exist
    FROM (SELECT DISTINCT factor_1 FROM pairs) f1
    CROSS JOIN (SELECT DISTINCT factor_2 FROM pairs) f2 
        ;