代码之家  ›  专栏  ›  技术社区  ›  Masterbuilder

使用窗口函数将前一行与当前行相加

  •  1
  • Masterbuilder  · 技术社区  · 6 年前

    我有一个spark数据框,其中,我想根据当前行金额值和基于groupid和id的金额值的前一行和计算一个运行总计。让我输出df

    import findspark
    findspark.init()
    import pyspark 
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.getOrCreate()
    import pandas as pd
    
    
     sc = spark.sparkContext
    data1 = {'date': {0: '2018-04-03', 1: '2018-04-04', 2: '2018-04-05', 3: '2018-04-06', 4: '2018-04-07'},
             'id': {0: 'id1', 1: 'id2', 2: 'id1', 3: 'id3', 4: 'id2'},
             'group': {0: '1', 1: '1', 2: '1', 3: '2', 4: '1'},
             'amount': {0: 50, 1: 40, 2: 50, 3: 55, 4: 20}}
    df1_pd = pd.DataFrame(data1, columns=data1.keys())
    
    df1 = spark.createDataFrame(df1_pd)
    df1.show()
    
    
    +----------+---+-----+------+
    |      date| id|group|amount|
    +----------+---+-----+------+
    |2018-04-03|id1|    1|    50|
    |2018-04-04|id2|    1|    40|
    |2018-04-05|id1|    1|    50|
    |2018-04-06|id3|    2|    55|
    |2018-04-07|id2|    1|    20|
    +----------+---+-----+------+
    

    我要找的出口

    +----------+---+-----+------+---+
    |      date| id|group|amount|sum|
    +----------+---+-----+------+---+
    |2018-04-03|id1|    1|    50|50 |
    |2018-04-04|id2|    1|    40|90 |
    |2018-04-05|id1|    1|    50|140|
    |2018-04-06|id3|    2|    55|55 |
    |2018-04-07|id2|    1|    20|160|
    +----------+---+-----+------+---+
    
    1 回复  |  直到 6 年前
        1
  •  6
  •   user9732570 user9732570    6 年前

    窗口定义:

    from pyspark.sql.window import Window
    from pyspark.sql.functions import sum
    
    w = Window.partitionBy("group").orderBy("date").rowsBetween(
        Window.unboundedPreceding,  # Take all rows from the beginning of frame
        Window.currentRow           # To current row
    )
    

    总额:

    (df1.withColumn("sum", sum("amount").over(w))
        .orderBy("date")   # Sort for easy inspection. Not necessary
        .show())
    

    结果:

    +----------+---+-----+------+---+      
    |      date| id|group|amount|sum|
    +----------+---+-----+------+---+
    |2018-04-03|id1|    1|    50| 50|
    |2018-04-04|id2|    1|    40| 90|
    |2018-04-05|id1|    1|    50|140|
    |2018-04-06|id3|    2|    55| 55|
    |2018-04-07|id2|    1|    20|160|
    +----------+---+-----+------+---+