代码之家  ›  专栏  ›  技术社区  ›  Chris

Spark SQL爆炸引用

  •  0
  • Chris  · 技术社区  · 4 年前

    我不清楚如何在同一子查询中引用分解列,也不确定要搜索什么来获得更多解释。

    SELECT explode(kit) exploded,
        exploded [0]
    FROM tabla
    

    当我们从外部查询引用时,这是非常清楚的,因为分解列已经存在。 我错过了什么?

    select exploded [0] from
    (SELECT explode(kit) exploded
    FROM tabla)
    

    enter image description here

    0 回复  |  直到 3 年前
        1
  •  1
  •   mck    4 年前

    这两个查询的物理平面图以及横向视图查询是相同的。显然,第一个查询的分析逻辑计划与横向视图查询相同。优化后,所有三个查询的逻辑计划都变得相同。


    创建数据帧:

    df = spark.range(1).selectExpr("array(array(1,2),array(3,4)) kit")
    

    第一个查询:

    spark.sql('select explode(kit) exploded, exploded[0] from tabla').explain(True)
    == Parsed Logical Plan ==
    'Project ['explode('kit) AS exploded#308, unresolvedalias('exploded[0], None)]
    +- 'UnresolvedRelation [tabla]
    
    == Analyzed Logical Plan ==
    exploded: array<int>, exploded[0]: int
    Project [exploded#309, exploded#309[0] AS exploded[0]#310]
    +- Generate explode(kit#292), false, [exploded#309]
       +- SubqueryAlias tabla
          +- Project [array(array(1, 2), array(3, 4)) AS kit#292]
             +- Range (0, 1, step=1, splits=Some(24))
    
    == Optimized Logical Plan ==
    Project [exploded#309, exploded#309[0] AS exploded[0]#310]
    +- Generate explode([[1,2],[3,4]]), [0], false, [exploded#309]
       +- Project [[[1,2],[3,4]] AS kit#292]
          +- Range (0, 1, step=1, splits=Some(24))
    
    == Physical Plan ==
    *(2) Project [exploded#309, exploded#309[0] AS exploded[0]#310]
    +- Generate explode([[1,2],[3,4]]), false, [exploded#309]
       +- *(1) Project [[[1,2],[3,4]] AS kit#292]
          +- *(1) Range (0, 1, step=1, splits=24)
    

    第二个查询:使用子查询

    spark.sql('select exploded[0] from (select explode(kit) exploded from tabla)').explain(True)
    == Parsed Logical Plan ==
    'Project [unresolvedalias('exploded[0], None)]
    +- 'SubqueryAlias __auto_generated_subquery_name
       +- 'Project ['explode('kit) AS exploded#313]
          +- 'UnresolvedRelation [tabla]
    
    == Analyzed Logical Plan ==
    exploded[0]: int
    Project [exploded#314[0] AS exploded[0]#315]
    +- SubqueryAlias __auto_generated_subquery_name
       +- Project [exploded#314]
          +- Generate explode(kit#292), false, [exploded#314]
             +- SubqueryAlias tabla
                +- Project [array(array(1, 2), array(3, 4)) AS kit#292]
                   +- Range (0, 1, step=1, splits=Some(24))
    
    == Optimized Logical Plan ==
    Project [exploded#314[0] AS exploded[0]#315]
    +- Generate explode([[1,2],[3,4]]), [0], false, [exploded#314]
       +- Project [[[1,2],[3,4]] AS kit#292]
          +- Range (0, 1, step=1, splits=Some(24))
    
    == Physical Plan ==
    *(2) Project [exploded#314[0] AS exploded[0]#315]
    +- Generate explode([[1,2],[3,4]]), false, [exploded#314]
       +- *(1) Project [[[1,2],[3,4]] AS kit#292]
          +- *(1) Range (0, 1, step=1, splits=24)
    

    第三个问题:使用 lateral view

    spark.sql('select exploded[0] from tabla lateral view explode(kit) as exploded').explain(True)
    == Parsed Logical Plan ==
    'Project [unresolvedalias('exploded[0], None)]
    +- 'Generate 'explode('kit), false, as, ['exploded]
       +- 'UnresolvedRelation [tabla]
    
    == Analyzed Logical Plan ==
    exploded[0]: int
    Project [exploded#353[0] AS exploded[0]#354]
    +- Generate explode(kit#292), false, as, [exploded#353]
       +- SubqueryAlias tabla
          +- Project [array(array(1, 2), array(3, 4)) AS kit#292]
             +- Range (0, 1, step=1, splits=Some(24))
    
    == Optimized Logical Plan ==
    Project [exploded#353[0] AS exploded[0]#354]
    +- Generate explode([[1,2],[3,4]]), [0], false, as, [exploded#353]
       +- Project [[[1,2],[3,4]] AS kit#292]
          +- Range (0, 1, step=1, splits=Some(24))
    
    == Physical Plan ==
    *(2) Project [exploded#353[0] AS exploded[0]#354]
    +- Generate explode([[1,2],[3,4]]), false, [exploded#353]
       +- *(1) Project [[[1,2],[3,4]] AS kit#292]
          +- *(1) Range (0, 1, step=1, splits=24)