代码之家 › 专栏 › 技术社区 › Markus

如何获取列中最频繁的非空值?

apache-spark-sql apache-spark scala

Markus · 技术社区 · 6 年前

我有以下数据框 df :

+-------------------+--------+--------------------+
|   id|         name|    type|                 url|
+-------------------+--------+--------------------+
|    1|      NT Note|    aaaa|                null|
|    1|      NT Note|    aaaa|http://www.teleab...|
|    1|      NT Note|    aaaa|http://www.teleab...|
|    1|      NT Note|    aaaa|                null|
|    1|      NT Note|    aaaa|                null|
|    2|          ABC|    bbbb|                null|
|    2|          ABC|    bbbb|                null|
|    2|          ABC|    bbbb|                null|
|    2|          ABC|    bbbb|                null|
+-------------------+--------+--------------------+

我指派最频繁的 url 和 type 每个节点的值:

def windowSpec = Window.partitionBy("id", "url", "type") 
val result = df.withColumn("count", count("url").over(windowSpec))  
  .orderBy($"count".desc)                                                                                 
  .groupBy("id")                                                                                     
  .agg(
  first("url").as("URL"),
  first("type").as("Typel")
)

但事实上我需要 网址

+-------------------+--------+--------------------+
|   id|         name|    type|                 url|
+-------------------+--------+--------------------+
|    1|      NT Note|    aaaa|http://www.teleab...|
|    2|          ABC|    bbbb|                null|
+-------------------+--------+--------------------+

现在我得到下面显示的输出,因为 null 对于记录id更频繁 1

+-------------------+--------+--------------------+
|   id|         name|    type|                 url|
+-------------------+--------+--------------------+
|    1|      NT Note|    aaaa|                null|
|    2|          ABC|    bbbb|                null|
+-------------------+--------+--------------------+

1 回复 | 直到 6 年前

Manoj Kumar Dhakad 6 年前

你可以用 udf 如下所示

import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray

//function to return most frequent url

def mfnURL(arr: WrappedArray[String]): String = {
        val filterArr = arr.filterNot(_ == null)
        if (filterArr.length == 0)
            return null
        else {
            filterArr.groupBy(identity).maxBy(_._2.size)._1
        }
    }

//registering udf mfnURL

val mfnURLUDF = udf(mfnURL _)

//applying groupby , agg and udf

df.groupBy("id", "name", "type").agg(mfnURLUDF(collect_list("url")).alias("url")).show

//Sample output

+---+-------+----+--------------------+
| id|   name|type|                 url|
+---+-------+----+--------------------+
|  2|    ABC|bbbb|                null|
|  1|NT Note|aaaa|http://www.teleab...|
+---+-------+----+--------------------+