代码之家 › 专栏 › 技术社区 › Qubix

在pyspark数据帧中删除连续的重复项

pyspark-sql pyspark apache-spark

Qubix · 技术社区 · 6 年前

## +---+---+
## | id|num|
## +---+---+
## |  2|3.0|
## |  3|6.0|
## |  3|2.0|
## |  3|1.0|
## |  2|9.0|
## |  4|7.0|
## +---+---+

我想去掉连续的重复,得到:

## +---+---+
## | id|num|
## +---+---+
## |  2|3.0|
## |  3|6.0|
## |  2|9.0|
## |  4|7.0|
## +---+---+

我发现 ways of doing this

1 回复 | 直到 6 年前

plalanne 6 年前

答案应该如您所愿,但是可能还有一些优化的空间:

from pyspark.sql.window import Window as W
test_df = spark.createDataFrame([
    (2,3.0),(3,6.0),(3,2.0),(3,1.0),(2,9.0),(4,7.0)
    ], ("id", "num"))
test_df = test_df.withColumn("idx", monotonically_increasing_id())  # create temporary ID because window needs an ordered structure
w = W.orderBy("idx")
get_last= when(lag("id", 1).over(w) == col("id"), False).otherwise(True) # check if the previous row contains the same id

test_df.withColumn("changed",get_last).filter(col("changed")).select("id","num").show() # only select the rows with a changed ID

+---+---+
| id|num|
+---+---+
|  2|3.0|
|  3|6.0|
|  2|9.0|
|  4|7.0|
+---+---+

推荐文章

Anneso · 获取系列第一次/最后一次出现的日期

6 年前

Jared · 如何将时间戳类型的PySpark数据帧截断到当天?

6 年前

Ashley O · 从当前时间算起N天内的发生次数-pyspark

6 年前

Nevermore · 检查两个pyspark行是否相等

6 年前

Ahmad Senousi · 将时间划分为每30分钟一个周期

6 年前

desaiankitb · 如何使用jdbc执行连接查询,而不是使用pyspark获取多个表

6 年前

silviacamplani · 如何删除加入同一pyspark数据帧的“重复”行?

7 年前

Jack · 如何在pyspark dataframe中将字符串以外的任何数据类型转换为字符串

7 年前

abhjt · 读取ASCII字符pyspark上的csv和连接行

7 年前

RobinL · Pyspark:使用带参数的UDF创建新列[重复]

7 年前