代码之家  ›  专栏  ›  技术社区  ›  Ged

Spark/Scala代码不再在Spark3.x中工作

  •  0
  • Ged  · 技术社区  · 2 年前

    这在2.x:

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.{Encoder, Encoders}
    import org.apache.spark.sql.expressions.Window
    import org.apache.spark.sql.functions.split
    import org.apache.spark.sql.functions.broadcast
    import org.apache.spark.sql.functions.{lead, lag}
    import spark.implicits._
    
    // Gen example data via DF, can come from files, ordering in those files assumed. I.e. no need to sort.
    val df = Seq(
     ("1 February"), ("n"), ("c"), ("b"), 
     ("2 February"), ("hh"), ("www"), ("e"), 
     ("3 February"), ("y"), ("s"), ("j"),
     ("1 March"), ("c"), ("b"), ("x"),
     ("1 March"), ("c"), ("b"), ("x"),
     ("2 March"), ("c"), ("b"), ("x"),
     ("3 March"), ("c"), ("b"), ("x"), ("y"), ("z")
     ).toDF("line")
    
    // Define Case Classes to avoid Row aspects on df --> rdd --> to DF which I always must look up again.
    case class X(line: String)
    case class Xtra(key: Long, line: String)
    
    // Add the Seq Num using zipWithIndex.
    val rdd = df.as[X].rdd.zipWithIndex().map{case (v,k) => (k,v)}
    val ds = rdd.toDF("key", "line").as[Xtra]
    

    最后一条语句现在返回3.x:

    AnalysisException: Cannot up cast line from struct<line:string> to string.
    The type path of the target object is:
    - field (class: "java.lang.String", name: "line")
    - root class: "$linecfabb246f6fc445196875da751b278e883.$read.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.Xtra"
    You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object
    

    我发现这个信息很难理解,也很难理解改变的原因。我刚刚测试了2.4.5。一切都很好。

    0 回复  |  直到 2 年前
        1
  •  1
  •   Gabio    2 年前

    自从 line 被推断为结构,您可以稍微更改您的架构(case类):

    case class X(line: String)
    case class Xtra(key: Long, nested_line: X)
    

    然后使用以下方法获得所需结果:

    val ds = rdd.toDF("key", "nested_line").as[Xtra].select("key", "nested_line.line")