代码之家 › 专栏 › 技术社区 › Carmen Pérez Carrillo

ArrayColumn Pyspark上的计数器函数

counter apache-spark-sql pyspark apache-spark

Carmen Pérez Carrillo · 技术社区 · 6 年前

从此数据框

+-----+-----------------+
|store|     values      |
+-----+-----------------+
|    1|[1, 2, 3,4, 5, 6]|
|    2|            [2,3]|
+-----+-----------------+

我想申请 Counter 函数以获取:

+-----+------------------------------+
|store|     values                   |
+-----+------------------------------+
|    1|{1:1, 2:1, 3:1, 4:1, 5:1, 6:1}|
|    2|{2:1, 3:1}                    |
+-----+------------------------------+

我用另一个问题的答案得到了这个数据框:

GroupBy and concat array columns pyspark

因此,我尝试修改答案中的代码,如下所示:

选项1:

def flatten_counter(val):
    return Counter(reduce (lambda x, y:x+y, val))

udf_flatten_counter = sf.udf(flatten_counter,     ty.ArrayType(ty.IntegerType()))
df3 = df2.select("store", flatten_counter("values2").alias("values3"))
df3.show(truncate=False)

选项2:

df.rdd.map(lambda r: (r.store, r.values)).reduceByKey(lambda x, y: x + y).map(lambda row: Counter(row[1])).toDF(['store', 'values']).show()

但它不起作用。

有人知道我怎么做吗?

非常感谢。

1 回复 | 直到 5 年前

Alper t. Turker 6 年前

您只需提供正确的数据类型

udf_flatten_counter = sf.udf(
    lambda x: dict(Counter(x)),
    ty.MapType(ty.IntegerType(), ty.IntegerType()))

df = spark.createDataFrame(
   [(1, [1, 2, 3, 4, 5, 6]), (2, [2, 3])], ("store", "values"))


df.withColumn("cnt", udf_flatten_counter("values")).show(2, False)
# +-----+------------------+---------------------------------------------------+
# |store|values            |cnt                                                |
# +-----+------------------+---------------------------------------------------+
# |1    |[1, 2, 3, 4, 5, 6]|Map(5 -> 1, 1 -> 1, 6 -> 1, 2 -> 1, 3 -> 1, 4 -> 1)|
# |2    |[2, 3]            |Map(2 -> 1, 3 -> 1)                                |
# +-----+------------------+---------------------------------------------------+

与RDD类似

df.rdd.mapValues(Counter).mapValues(dict).toDF(["store", "values"]).show(2, False)
# +-----+---------------------------------------------------+
# |store|values                                             |
# +-----+---------------------------------------------------+
# |1    |Map(5 -> 1, 1 -> 1, 6 -> 1, 2 -> 1, 3 -> 1, 4 -> 1)|
# |2    |Map(2 -> 1, 3 -> 1)                                |
# +-----+---------------------------------------------------+

转换为 dict 因为显然辉绿岩无法处理 Counter 物体。

推荐文章

Leonard · Pyspark:JSON到Pyspark数据帧

1 年前

Fran Arenas · Pyspark collect()方法在Pycharm或控制台中执行测试时给出了不同的顺序

2 年前

titutubs · 有没有一种更有效的方法来为Databricks SQL中的bin值编写代码?

2 年前

markwatson · AWS Glue:如何在输出中添加具有源文件名的列?

6 年前

juamd · 顺序(k,<元组>)RDD

6 年前

Gaurav Gupta · 如何加载多行记录的CSV文件?

6 年前

Jared · 如何在本地模式下运行的pyspark中读取S3?

6 年前

ka_boom · 在pyspark中链接多个groupBy

6 年前

ds_user · 在apache spark中复制记录计数

6 年前

ds_user · 结合类型和子类型的Apache Spark组

6 年前