代码之家 › 专栏 › 技术社区 › angelcervera

在sparksql中处理具有循环引用的模型?

apache-spark-dataset apache-spark-sql apache-spark scala

angelcervera · 技术社区 · 6 年前

Scala/Spark SQL 2.2.1版

假设这是最初的模型方法,当然,它不起作用(请记住,真正的模型有几十个属性):

case class Branch(id: Int, branches: List[Branch] = List.empty)
case class Tree(id: Int, branches: List[Branch])

val trees = Seq(Tree(1, List(Branch(2, List.empty), Branch(3, List(Branch(4, List.empty))))))

val ds = spark.createDataset(trees)
ds.show

这就是它抛出的错误:

java.lang.UnsupportedOperationException: cannot have circular references in class, but got the circular reference of class Branch

我们的最高等级是5级 . 因此,作为一种解决方法,我认为:

case class BranchLevel5(id: Int)
case class BranchLevel4(id: Int, branches: List[BranchLevel5] = List.empty)
case class BranchLevel3(id: Int, branches: List[BranchLevel4] = List.empty)
case class BranchLevel2(id: Int, branches: List[BranchLevel3] = List.empty)
case class BranchLevel1(id: Int, branches: List[BranchLevel2] = List.empty)
case class Tree(id: Int, branches: List[BranchLevel1])

当然,这是有效的。但这一点也不优雅,您可以想象实现过程中的痛苦(可读性、耦合、维护、可用性、代码复制等)

所以问题是, 如何处理模型中循环引用的情况?

1 回复 | 直到 6 年前

Worakarn Isaratham 6 年前

如果您对使用私有API还满意,那么我发现了一种有效的方法:将整个自引用结构视为用户定义的类型。我遵循这个答案: https://stackoverflow.com/a/51957666/1823254 .

package org.apache.spark.custom.udts // we're calling some private API so need to be under 'org.apache.spark'

import java.io._
import org.apache.spark.sql.types.{DataType, UDTRegistration, UserDefinedType}

class BranchUDT extends UserDefinedType[Branch] {

  override def sqlType: DataType = org.apache.spark.sql.types.BinaryType
  override def serialize(obj: Branch): Any = {
    val bos = new ByteArrayOutputStream()
    val oos = new ObjectOutputStream(bos)
    oos.writeObject(obj)
    bos.toByteArray
  }
  override def deserialize(datum: Any): Branch = {
    val bis = new ByteArrayInputStream(datum.asInstanceOf[Array[Byte]])
    val ois = new ObjectInputStream(bis)
    val obj = ois.readObject()
    obj.asInstanceOf[Branch]
  }

  override def userClass: Class[Branch] = classOf[Branch]
}

object BranchUDT {
  def register() = UDTRegistration.register(classOf[Branch].getName, classOf[BranchUDT].getName)
}

BranchUDT.register()
val trees = Seq(Tree(1, List(Branch(2, List.empty), Branch(3, List(Branch(4, List.empty))))))

val ds = spark.createDataset(trees)
ds.show(false)

//+---+----------------------------------------------------+
//|id |branches                                            |
//+---+----------------------------------------------------+
//|1  |[Branch(2,List()), Branch(3,List(Branch(4,List())))]|
//+---+----------------------------------------------------+

推荐文章

user3579222 · 阅读以前的Spark API

5 月前

JFlo · 在PySpark笔记本中读取多个Parquet文件

6 月前

Matthew Thomas · partition覆盖动态和“逻辑”分区

10 月前

Jamal Khan · 如何在Apache Spark中读取500 GB的大文件CSV文件并对其执行聚合?

11 月前

Nakeuh · 从数组列中新建数据帧列

11 月前

maximodesousadias · 如何根据条件删除日期后的记录

11 月前

Ajay S Pal · 当调用函数时传递参数时,PySpark没有在函数内部创建Dataframe

1 年前

SUBHOJEET · 如何使用pyspark读取rds文件?

1 年前

Shankar Panda · 如何从org.apache.spark.sql获取密钥。在scala中键入列并将其放入列表变量中?

1 年前

Aaron Brazier · 连接2个pyspark数据帧并继续运行窗口sum和max

1 年前