代码之家  ›  专栏  ›  技术社区  ›  Markus

如何获取列中最频繁的非空值?

  •  1
  • Markus  · 技术社区  · 6 年前

    我有以下数据框 df :

    +-------------------+--------+--------------------+
    |   id|         name|    type|                 url|
    +-------------------+--------+--------------------+
    |    1|      NT Note|    aaaa|                null|
    |    1|      NT Note|    aaaa|http://www.teleab...|
    |    1|      NT Note|    aaaa|http://www.teleab...|
    |    1|      NT Note|    aaaa|                null|
    |    1|      NT Note|    aaaa|                null|
    |    2|          ABC|    bbbb|                null|
    |    2|          ABC|    bbbb|                null|
    |    2|          ABC|    bbbb|                null|
    |    2|          ABC|    bbbb|                null|
    +-------------------+--------+--------------------+
    

    我指派最频繁的 url type 每个节点的值:

    def windowSpec = Window.partitionBy("id", "url", "type") 
    val result = df.withColumn("count", count("url").over(windowSpec))  
      .orderBy($"count".desc)                                                                                 
      .groupBy("id")                                                                                     
      .agg(
      first("url").as("URL"),
      first("type").as("Typel")
    )
    

    但事实上我需要 网址

    +-------------------+--------+--------------------+
    |   id|         name|    type|                 url|
    +-------------------+--------+--------------------+
    |    1|      NT Note|    aaaa|http://www.teleab...|
    |    2|          ABC|    bbbb|                null|
    +-------------------+--------+--------------------+
    

    现在我得到下面显示的输出,因为 null 对于记录id更频繁 1

    +-------------------+--------+--------------------+
    |   id|         name|    type|                 url|
    +-------------------+--------+--------------------+
    |    1|      NT Note|    aaaa|                null|
    |    2|          ABC|    bbbb|                null|
    +-------------------+--------+--------------------+
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   Manoj Kumar Dhakad    6 年前

    你可以用 udf 如下所示

    import org.apache.spark.sql.functions._
    import scala.collection.mutable.WrappedArray
    
    //function to return most frequent url
    
    def mfnURL(arr: WrappedArray[String]): String = {
            val filterArr = arr.filterNot(_ == null)
            if (filterArr.length == 0)
                return null
            else {
                filterArr.groupBy(identity).maxBy(_._2.size)._1
            }
        }
    
    //registering udf mfnURL
    
    val mfnURLUDF = udf(mfnURL _)
    
    //applying groupby , agg and udf
    
    df.groupBy("id", "name", "type").agg(mfnURLUDF(collect_list("url")).alias("url")).show
    
    //Sample output
    
    +---+-------+----+--------------------+
    | id|   name|type|                 url|
    +---+-------+----+--------------------+
    |  2|    ABC|bbbb|                null|
    |  1|NT Note|aaaa|http://www.teleab...|
    +---+-------+----+--------------------+