代码之家  ›  专栏  ›  技术社区  ›  Masterbuilder

Spark根据第一个数据集中的值更新第二个数据集中的值

  •  1
  • Masterbuilder  · 技术社区  · 6 年前

    我有两个spark数据集,一个带有accountid和key列,key列的格式为数组[key1,key2,key3..]另一个数据集有两列accountid和键值,它们是json格式的。accountid,{key:value,key,value…}。如果在第一个数据集中accountid出现键,我需要更新第二个数据集中的值。

       import org.apache.spark.sql.functions._
    val df= sc.parallelize(Seq(("20180610114049", "id1","key1"),
      ("20180610114049", "id2","key2"),
      ("20180610114049", "id1","key1"),
      ("20180612114049", "id2","key1"),
      ("20180613114049", "id3","key2"),
      ("20180613114049", "id3","key3")
     )).toDF("date","accountid", "key")
    val gp=df.groupBy("accountid","date").agg(collect_list("key"))
    
        +---------+--------------+-----------------+
    |accountid|          date|collect_list(key)|
    +---------+--------------+-----------------+
    |      id2|20180610114049|           [key2]|
    |      id1|20180610114049|     [key1, key1]|
    |      id3|20180613114049|     [key2, key3]|
    |      id2|20180612114049|           [key1]|
    +---------+--------------+-----------------+
    
    
    val df2= sc.parallelize(Seq(("20180610114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
      ("20180610114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
      ("20180611114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
      ("20180612114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
      ("20180613114049", "id3","{'key1':'0.0','key2':'0.0','key3':'0.0'}")
     )).toDF("date","accountid", "result")
    
    +--------------+---------+----------------------------------------+
    |date          |accountid|result                                  |
    +--------------+---------+----------------------------------------+
    |20180610114049|id1      |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
    |20180610114049|id2      |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
    |20180611114049|id1      |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
    |20180612114049|id2      |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
    |20180613114049|id3      |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
    +--------------+---------+----------------------------------------+
    

    预期产量

    +--------------+---------+----------------------------------------+
    |date          |accountid|result                                  |
    +--------------+---------+----------------------------------------+
    |20180610114049|id1      |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
    |20180610114049|id2      |{'key1':'0.0','key2':'1.0','key3':'0.0'}|
    |20180611114049|id1      |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
    |20180612114049|id2      |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
    |20180613114049|id3      |{'key1':'0.0','key2':'1.0','key3':'1.0'}|
    +--------------+---------+----------------------------------------+
    
    2 回复  |  直到 6 年前
        1
  •  1
  •   philantrovert    6 年前

    在加入后,可以将数组和JSON都传递给UDF date accountid

    val gp=df.groupBy("accountid","date").agg(collect_list("key").as("key"))
    
    val joined = df2.join(gp, Seq("date", "accountid") , "left_outer")
    
    joined.show(false)
    //+--------------+---------+----------------------------------------+------------+
    //|date          |accountid|result                                  |key         |
    //+--------------+---------+----------------------------------------+------------+
    //|20180610114049|id2      |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2]      |
    //|20180613114049|id3      |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2, key3]|
    //|20180610114049|id1      |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1, key1]|
    //|20180611114049|id1      |{'key1':'0.0','key2':'0.0','key3':'0.0'}|null        |
    //|20180612114049|id2      |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1]      |
    //+--------------+---------+----------------------------------------+------------+
    
    // the UDF that will do the most work
    // it's important to declare `formats` inside the function
    // to avoid object not Serializable exception
    // Not all cases are covered, use with caution :D
    val convertJsonValues = udf{(json: String, arr: Seq[String]) =>
        import org.json4s.jackson.JsonMethods._
        import org.json4s.JsonDSL._
        implicit val format = org.json4s.DefaultFormats
        // replace single quotes with double
        val kvMap = parse(json.replaceAll("'", """"""")).values.asInstanceOf[Map[String,String]]
        val updatedKV = kvMap.map{ case(k,v) => if(arr.contains(k)) (k,"1.0") else (k,v) }
        compact(render(updatedKV))
    }
    
    // Use when-otherwise and send empty array where `key` is null
    joined.select($"date", 
                  $"accountid",
                  when($"key".isNull, convertJsonValues($"result", array()))
                   .otherwise(convertJsonValues($"result", $"key"))
                   .as("result")
                  ).show(false)
    
    //+--------------+---------+----------------------------------------+
    //|date          |accountid|result                                  |
    //+--------------+---------+----------------------------------------+
    //|20180610114049|id2      |{"key1":"0.0","key2":"1.0","key3":"0.0"}|
    //|20180613114049|id3      |{"key1":"0.0","key2":"1.0","key3":"1.0"}|
    //|20180610114049|id1      |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
    //|20180611114049|id1      |{"key1":"0.0","key2":"0.0","key3":"0.0"}|
    //|20180612114049|id2      |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
    //+--------------+---------+----------------------------------------+
    
        2
  •  1
  •   Ramesh Maharjan    6 年前

    udf 跟着你 join 两个数据帧。当然也有像c这样的东西 (提供评论以作进一步解释)

    import org.apache.spark.sql.functions._
    
    //aliasing the collected key
    val gp = df.groupBy("accountid","date").agg(collect_list("key").as("keys"))
    
    //schema for converting json to struct
    val schema = StructType(Seq(StructField("key1", StringType, true), StructField("key2", StringType, true), StructField("key3", StringType, true)))
    
    //udf function to update the values of struct where result is a case class
    def updateKeysUdf = udf((arr: Seq[String], json: Row) => Seq(json.schema.fieldNames.map(key => if(arr.contains(key)) "1.0" else json.getAs[String](key))).collect{case Array(a,b,c) => result(a,b,c)}.toList(0))
    
    //changing json string to stuct using the above schema
    df2.withColumn("result", from_json(col("result"), schema))
      .as("df2")   //aliasing df2 for joining and selecting
        .join(gp.as("gp"), col("df2.accountid") === col("gp.accountid"), "left")   //aliasing gp dataframe and joining with accountid
        .select(col("df2.accountid"), col("df2.date"), to_json(updateKeysUdf(col("gp.keys"), col("df2.result"))).as("result"))  //selecting and calling above udf function and finally converting to json stirng
      .show(false)
    

    结果是 case class

    case class result(key1: String, key2: String, key3: String)
    

    这应该给你

    +---------+--------------+----------------------------------------+
    |accountid|date          |result                                  |
    +---------+--------------+----------------------------------------+
    |id3      |20180613114049|{"key1":"0.0","key2":"1.0","key3":"1.0"}|
    |id1      |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
    |id1      |20180611114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
    |id2      |20180610114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
    |id2      |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
    |id2      |20180612114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
    |id2      |20180612114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
    +---------+--------------+----------------------------------------+
    

    我希望答案是有帮助的