代码之家  ›  专栏  ›  技术社区  ›  mahmoud mehdi

如何在Spark中将DataFrame(具有嵌套结构类型)的所有列强制转换为string

  •  2
  • mahmoud mehdi  · 技术社区  · 6 年前

    出于某种原因,我正在尝试将数据帧(具有嵌套结构类型)的所有字段强制转换为字符串。

    我已经在StackOverflow中看到了一些解决方案(但是它们只在没有嵌套结构的简单数据帧上工作)(如下所示 how to cast all columns of dataframe to string )

    我将通过一个例子来解释我真正需要的是什么:

        import org.apache.spark.sql.{Row, SparkSession}
        import org.apache.spark.sql.types._
        import org.apache.spark.sql.functions._
        import spark.implicits._
        val rows1 = Seq(
        Row(1, Row("a", "b"), 8.00, Row(1,2)),
        Row(2, Row("c", "d"), 9.00, Row(3,4))
        )
    
        val rows1Rdd = spark.sparkContext.parallelize(rows1, 4)
    
        val schema1 = StructType(
        Seq(
        StructField("id", IntegerType, true),
        StructField("s1", StructType(
        Seq(
        StructField("x", StringType, true),
        StructField("y", StringType, true)
        )
        ), true),
        StructField("d", DoubleType, true),
        StructField("s2", StructType(
        Seq(
        StructField("u", IntegerType, true),
        StructField("v", IntegerType, true)
        )
        ), true)
        )
        )
    
        val df1 = spark.createDataFrame(rows1Rdd, schema1)
    
        println("Schema with nested struct")
        df1.printSchema()
    

    如果打印创建的数据帧的架构,则会得到以下结果:

    root
    |-- id: integer (nullable = true)
    |-- s1: struct (nullable = true)
    |    |-- x: string (nullable = true)
    |    |-- y: string (nullable = true)
    |-- d: double (nullable = true)
    |-- s2: struct (nullable = true)
    |    |-- u: integer (nullable = true)
    |    |-- v: integer (nullable = true)
    

    我尝试将所有值强制转换为字符串,如下所示:

      df1.select(df1.columns.map(c => col(c).cast(StringType)) : _*)
    

    但它将嵌套结构类型转换为字符串,而不是将其每个值转换为字符串:

    root
    |-- id: string (nullable = true)
    |-- s1: string (nullable = true)
    |-- d: string (nullable = true)
    |-- s2: string (nullable = true)
    

    有没有一个简单的解决方案可以帮助我将所有值转换为StringType? 下面是我希望在转换后将其作为数据帧架构的结构类型:

    root
    |-- id: string (nullable = true)
    |-- s1: struct (nullable = true)
    |    |-- x: string (nullable = true)
    |    |-- y: string (nullable = true)
    |-- d: string (nullable = true)
    |-- s2: struct (nullable = true)
    |    |-- u: string (nullable = true)
    |    |-- v: string (nullable = true)
    

    谢谢!

    3 回复  |  直到 6 年前
        1
  •  1
  •   philantrovert    6 年前

    您可以为更简单的类型和 struct 分别键入列。

    该解决方案不是很通用,但只要您只有作为复杂列的结构类型,就应该有效。代码可以处理 结构 不仅仅是两个。

    val structCastExpression = df1.schema
                                  .filter(_.dataType.isInstanceOf[StructType])
                                  .map(c=> (c.name, c.dataType.asInstanceOf[StructType].map(_.name)))
                                  .map{ case (col, sub) =>  s"""cast(${col} as struct${sub.map{ c => s"$c:string" }.mkString("<" , "," , ">")} ) as $col"""}
    //List(cast(s1 as struct<x:string,y:string> ) as s1,
    //     cast(s2 as struct<u:string,v:string> ) as s2)
    
    val otherColumns = df1.schema
                          .filterNot(_.dataType.isInstanceOf[StructType])
                          .map( c=> s""" cast(${c.name} as string) as ${c.name} """)
    //List(" cast(id as string) as id ", " cast(d as string) as d ")
    
    //original columns
    val originalColumns = df1.columns
    
    // Union both the expressions into one big expression
    val finalExpression = otherColumns.union(structCastExpression)
    // List(" cast(id as string) as id ", 
    //      " cast(d as string) as d ", 
    //      cast(s1 as struct<x:string,y:string> ) as s1, 
    //      cast(s2 as struct<u:string,v:string> ) as s2 )
    
    // Use `selectExpr` to pass the expression 
    
    df1.selectExpr(finalExpression : _*)
       .select(originalColumns.head, originalColumns.tail: _*)
       .printSchema
    
    //root
    // |-- id: string (nullable = true)
    // |-- s1: struct (nullable = true)
    // |    |-- x: string (nullable = true)
    // |    |-- y: string (nullable = true)
    // |-- d: string (nullable = true)
    // |-- s2: struct (nullable = true)
    // |    |-- u: string (nullable = true)
    // |    |-- v: string (nullable = true)
    
        2
  •  1
  •   mahmoud mehdi    6 年前

    经过几天的调查,我找到了解决问题的最好办法:

    val newSchema = StructType(
    Seq(
    StructField("id", StringType, true),
    StructField("s1", StructType(
    Seq(
    StructField("x", StringType, true),
    StructField("y", StringType, true)
    )
    ), true),
    StructField("d", StringType, true),
    StructField("s2", StructType(
    Seq(
    StructField("u", StringType, true),
    StructField("v", StringType, true)
    )
    ), true)
    )
    )
    val expressions = newSchema.map(
      field => s"CAST ( ${field.name} As ${field.dataType.sql}) ${field.name}"
    )
    val result = df1.selectExpr(expressions : _*)
    result.show()
    +---+------+---+------+
    | id|    s1|  d|    s2|
    +---+------+---+------+
    |  1|[a, b]|8.0|[1, 2]|
    |  2|[c, d]|9.0|[3, 4]|
    +---+------+---+------+
    

    我希望它能帮助一些人,我花了很多时间试图找到这个通用的解决方案(我需要它,因为我正在处理大数据帧和许多列,需要被铸造)。

        3
  •  0
  •   Manoj Kumar Dhakad    6 年前

    你可以 udf custom case class 如下所示

    case class s2(u:String,v:String)
    def changeToStr(row:Row):s2={
        return s2(row.get(0).toString(),row.get(1).toString())
      }
    
    val changeToStrUDF=udf(changeToStr _)
    val df2=df1.select(df1.col("id").cast(StringType),df1.col("s1"),df1.col("d").cast(StringType),changeToStrUDF(df1.col("s2")).alias("s2"))