代码之家  ›  专栏  ›  技术社区  ›  agustin Vraj Patel

具有多个条件和列的Pandas Groupby和cumsum-Python

  •  0
  • agustin Vraj Patel  · 技术社区  · 4 年前

    我有以下数据帧:

    import pandas as pd
    
    hits = {'id': ['A','A','A','A','A','A','B','B','B','C','C','C'],
            'datetime': ['2010-01-02 03:00:00','2010-01-02 03:00:14','2010-01-02 03:00:35','2010-01-02 03:00:38',
                        '2010-01-02 03:29:10','2010-01-02 03:29:35','2010-01-02 03:45:20','2010-01-02 06:10:05',
                        '2010-01-02 06:10:15','2010-01-02 07:40:15','2010-01-02 07:40:20','2010-01-02 07:40:25'],
            'uri_len': [10,20,25,15,20,10,20,25,15,30,40,45]
           }
    
    df = pd.DataFrame(hits, columns = ['id', 'datetime','uri_len'])
    
    df['datetime'] =  pd.to_datetime(df['datetime'], format='%Y-%m-%d %H:%M:%S')
    
    print (df)
    
       id            datetime  uri_len
    0   A 2010-01-02 03:00:00       10
    1   A 2010-01-02 03:00:14       20
    2   A 2010-01-02 03:00:35       25
    3   A 2010-01-02 03:00:38       15
    4   A 2010-01-02 03:29:10       20
    5   A 2010-01-02 03:29:35       10
    6   B 2010-01-02 03:45:20       20
    7   B 2010-01-02 06:10:05       25
    8   B 2010-01-02 06:10:15       15
    9   C 2010-01-02 07:40:15       30
    10  C 2010-01-02 07:40:20       40
    11  C 2010-01-02 07:40:25       45
    

    我想按会话对点击进行分组,使用 id 如按变量分组。对我来说,会话是一个超过15秒的不活动期(根据 datetime 列),或减少 uri_len 在这两种情况下,比较连续的点击。

    我知道如何按每种情况单独分组:

    df['session1'] = (df.groupby('id')['datetime']
                   .transform(lambda x: x.diff().gt('15Sec').cumsum())
                  )
    
    df['session2'] = (df.groupby('id')['uri_len']
                   .transform(lambda x: x.diff().lt(0).cumsum())
                  ) 
    

    有没有一种方法可以在同一行中组合这两个转换,所以输出直接是这样的?:

       id            datetime  uri_len  session
    0   A 2010-01-02 03:00:00       10        0
    1   A 2010-01-02 03:00:14       20        0
    2   A 2010-01-02 03:00:35       25        1
    3   A 2010-01-02 03:00:38       15        2
    4   A 2010-01-02 03:29:10       20        3
    5   A 2010-01-02 03:29:35       10        4
    6   B 2010-01-02 03:45:20       20        0
    7   B 2010-01-02 06:10:05       25        1
    8   B 2010-01-02 06:10:15       15        2
    9   C 2010-01-02 07:40:15       30        0
    10  C 2010-01-02 07:40:20       40        0
    11  C 2010-01-02 07:40:25       45        0
    
    1 回复  |  直到 4 年前
        1
  •  1
  •   Nikolaos Efthymiou    4 年前

    如果我理解正确,你想添加它们吗?

    df['session'] = df.groupby('id')['datetime'].transform(lambda x: 
    x.diff().gt('15Sec').cumsum()) + df.groupby('id')['uri_len'].transform(lambda x: 
    x.diff().lt(0).cumsum())
    

    更清晰的方式:

    s1 = df.groupby('id')['datetime'].transform(lambda x: 
    x.diff().gt('15Sec').cumsum())
    
    s2 = df.groupby('id')['uri_len'].transform(lambda x: x.diff().lt(0).cumsum())
    
    df['session'] = s1+s2