代码之家 › 专栏 › 技术社区 › funseiki

多个pyspark“window()”调用在执行“groupby()”时显示错误。

apache-spark-sql pyspark apache-spark python

funseiki · 技术社区 · 6 年前

这个问题是对 this answer . 出现以下情况时,火花显示错误:

# Group results in 12 second windows of "foo", then by integer buckets of 2 for "bar"
fooWindow = window(col("foo"), "12 seconds"))

# A sub bucket that contains values in [0,2), [2,4), [4,6]...
barWindow = window(col("bar").cast("timestamp"), "2 seconds").cast("struct<start:bigint,end:bigint>")

results = df.groupBy(fooWindow, barWindow).count()

错误是:

“多个时间窗口表达式将导致笛卡尔积 ,因此当前不支持行。“

有什么方法可以达到预期的行为吗?

1 回复 | 直到 6 年前

funseiki 6 年前

我能用 this SO answer .

注意:此解决方案仅在最多有一个呼叫 window ,表示不允许多个时间窗口。快速搜索 spark github 显示有一个硬限制 <= 1 窗户。

通过使用 withColumn 要为每一行定义存储桶,我们可以直接按该新列分组:

from pyspark.sql import functions as F
from datetime import datetime as dt, timedelta as td

start = dt.now()
second = td(seconds=1)
data = [(start, 0), (start+second, 1), (start+ (12*second), 2)]
df = spark.createDataFrame(data, ('foo', 'bar'))

# Create a new column defining the window for each bar
df = df.withColumn("barWindow", F.col("bar") - (F.col("bar") % 2))

# Keep the time window as is
fooWindow = F.window(F.col("foo"), "12 seconds").start.alias("foo")

# Use the new column created
results = df.groupBy(fooWindow, F.col("barWindow")).count().show()

# +-------------------+---------+-----+
# |                foo|barWindow|count|
# +-------------------+---------+-----+
# |2019-01-24 14:12:48|        0|    2|
# |2019-01-24 14:13:00|        2|    1|
# +-------------------+---------+-----+

推荐文章

Kevin Smeeks · Pyspark JDBC分区读取

5 月前

user3579222 · 阅读以前的Spark API

5 月前

Danylo Kuznetsov · 如何在PySpark Rancher中将DataFrame转换为整数?

6 月前

JFlo · 在PySpark笔记本中读取多个Parquet文件

6 月前

Matthew Thomas · partition覆盖动态和“逻辑”分区

10 月前

lenpyspanacb · 在Pyspark中计算重复次数

10 月前

Jamal Khan · 如何在Apache Spark中读取500 GB的大文件CSV文件并对其执行聚合?

11 月前

Jamal Khan · 我们如何在Apache Spark中实现CDC(变更数据捕获)?

11 月前

maximodesousadias · 如何根据条件删除日期后的记录

11 月前

Joe Bloggr · 如何将Dataframe类型的函数参数传递给SparkSQL查询

1 年前