代码之家  ›  专栏  ›  技术社区  ›  ds_user

结合类型和子类型的Apache Spark组

  •  0
  • ds_user  · 技术社区  · 6 年前

    我在spark中有这个数据集,

    val sales = Seq(
      ("Warsaw", 2016, "facebook","share",100),
      ("Warsaw", 2017, "facebook","like",200),
      ("Boston", 2015,"twitter","share",50),
      ("Boston", 2016,"facebook","share",150),
      ("Toronto", 2017,"twitter","like",50)
    ).toDF("city", "year","media","action","amount")
    

    我现在可以按城市和媒体进行分组,

    val groupByCityAndYear = sales
      .groupBy("city", "media") 
      .count()
    groupByCityAndYear.show()
    
    +-------+--------+-----+
    |   city|   media|count|
    +-------+--------+-----+
    | Boston|facebook|    1|
    | Boston| twitter|    1|
    |Toronto| twitter|    1|
    | Warsaw|facebook|    2|
    +-------+--------+-----+
    

    但是,我如何才能将媒体和动作结合在一个专栏中,所以预期的输出应该是,

    +-------+--------+-----+
    | Boston|facebook|    1|
    | Boston| share  |    2|
    | Boston| twitter|    1|
    |Toronto| twitter|    1|
    |Toronto| like   |    1|
    | Warsaw|facebook|    2|
    | Warsaw|share   |    1|
    | Warsaw|like    |    1|
    +-------+--------+-----+
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   akuiper    6 年前

    结合 media action 列作为 array explode 那就做吧 groupBy count :

    sales.select(
        $"city", explode(array($"media", $"action")).as("mediaAction")
    ).groupBy("city", "mediaAction").count().show()
    
    +-------+-----------+-----+
    |   city|mediaAction|count|
    +-------+-----------+-----+
    | Boston|      share|    2|
    | Boston|   facebook|    1|
    | Warsaw|      share|    1|
    | Boston|    twitter|    1|
    | Warsaw|       like|    1|
    |Toronto|    twitter|    1|
    |Toronto|       like|    1|
    | Warsaw|   facebook|    2|
    +-------+-----------+-----+
    

    或假设 媒体 行动 不相交(两列没有公共元素):

    sales.groupBy("city", "media").count().union(
        sales.groupBy("city", "action").count()
    ).show
    +-------+--------+-----+
    |   city|   media|count|
    +-------+--------+-----+
    | Boston|facebook|    1|
    | Boston| twitter|    1|
    |Toronto| twitter|    1|
    | Warsaw|facebook|    2|
    | Boston|   share|    2|
    | Warsaw|   share|    1|
    | Warsaw|    like|    1|
    |Toronto|    like|    1|
    +-------+--------+-----+