代码之家  ›  专栏  ›  技术社区  ›  Masterbuilder

使用静态名称值spark

  •  1
  • Masterbuilder  · 技术社区  · 6 年前

    我有一个带有两个数组列的数据帧,

    +---------+-----------------------+
    |itemval  |fruit                  |
    +---------+-----------------------+
    |[1, 2, 3]|[apple, banana, orange]|
    +---------+-----------------------+
    

    我正在尝试压缩它们并创建一个名称-值对

    +---------+-----------------------+--------------------------------------+
    |itemval  |fruit                  |ziped                                 |
    +---------+-----------------------+--------------------------------------+
    |[1, 2, 3]|[apple, banana, orange]|[[1, apple], [2, banana], [3, orange]]|
    +---------+-----------------------+--------------------------------------+
    

    然后转到json,to-json输出的格式如下

    +---------------------------------------------------------------------------+
    |ziped                                                                      |
    +---------------------------------------------------------------------------+
    |[{"_1":"1","_2":"apple"},{"_1":"2","_2":"banana"},{"_1":"3","_2":"orange"}]|
    +---------------------------------------------------------------------------+
    

    我期待的格式是这样的

     +---------------------------------------------------------------------------+
        |ziped                                                                           |
        +---------------------------------------------------------------------------+
        |[{"itemval":"1","name":"apple"},{"itemval":"2","name":"banana"},{"itemval":"3","name":"orange"}]|
        +---------------------------------------------------------------------------+
    

    这是我的实现

    val df1 = Seq((Array(1,2,3),Array("apple","banana","orange"))).toDF("itemval","fruit")
    df1.show(false)
    def zipper=udf((list1:Seq[String],list2:Seq[String]) => {
       val zipList = list2 zip list1  
     zipList
    
    )
    df1.withColumn("ziped",to_json(zipper($"fruit",$"itemval"))).drop("itemval","fruit").show(false)
    
    1 回复  |  直到 6 年前
        1
  •  0
  •   Masterbuilder    6 年前

    这就是我的解决方案。使用新值创建架构并将其强制转换为列

    val schema = ArrayType(
      StructType(
        Array(
          StructField("itemval",StringType),
          StructField("name",StringType)
        )
      )
    )
    
    val casted =zival.withColumn("result",$"ziped".cast(schema))
    casted.show(false)
    casted.select(to_json($"result")).show(false)
    

    输出将是

    casted:org.apache.spark.sql.DataFrame
    ziped:array
    element:struct
    _1:string
    _2:string
    result:array
    element:struct
    itemval:string
    name:string
    
    +-----------------------------------------------------------------+
    |structstojson(result)                                            |
    +-----------------------------------------------------------------+
    |[{"itemval":"3","name":"orange"},{"itemval":"2","name":"banana"}]|
    +-----------------------------------------------------------------+