代码之家  ›  专栏  ›  技术社区  ›  user3407267

如何在scala中连接两个数据集?

  •  0
  • user3407267  · 技术社区  · 6 年前

    我有两个数据集:

    itemname       itemId       coupons
    A               1            true
    A               2            false
    
    
    itemname      purchases
    B               10
    A               10
    C               10
    

    我要去接你

    itemname   itemId   coupons  purchases
    A             1       true      10
    A             2       false     10
    

    我在做什么-

     val mm = items.join(purchases, items("itemname") === purchases("itemname")).drop(items("itemname"))
    

    在spark scala中这是正确的方法吗?

    1 回复  |  直到 6 年前
        1
  •  1
  •   Prasad Khode    6 年前

    val itemsSchema =  List(
      StructField("itemname", StringType, nullable = false),
      StructField("itemid", IntegerType, nullable = false),
      StructField("coupons", BooleanType, nullable = false))
    
    val purchasesSchema =  List(
      StructField("itemname", StringType, nullable = false),
      StructField("purchases", IntegerType, nullable = false))
    
    
    val items = Seq(Row("A", 1, true), Row("A", 2, false))
    val purchases = Seq(Row("A", 10), Row("B", 10), Row("C", 10))
    
    val itemsDF = spark.createDataFrame(
      spark.sparkContext.parallelize(items),
      StructType(itemsSchema)
    )
    
    val purchasesDF = spark.createDataFrame(
      spark.sparkContext.parallelize(purchases),
      StructType(purchasesSchema)
    )
    
    purchasesDF.join(itemsDF, Seq("itemname")).show(false)
    

    +--------+---------+------+-------+
    |itemname|purchases|itemid|coupons|
    +--------+---------+------+-------+
    |A       |10       |1     |true   |
    |A       |10       |2     |false  |
    +--------+---------+------+-------+
    

    希望这有帮助