代码之家  ›  专栏  ›  技术社区  ›  whitefang1993

基于更复杂的情况,在熊猫身上丢行

  •  0
  • whitefang1993  · 技术社区  · 5 年前

    我有以下数据帧:

    time        id  type
    2012-12-19  1   abcF1
    2013-11-02  1   xF1yz
    2012-12-19  1   abcF1
    2012-12-18  1   abcF1
    2013-11-02  1   xF1yz
    2006-07-07  5   F5spo
    2006-07-06  5   F5spo
    2005-07-07  5   F5abc
    

    最长的日期我需要检查一下类型。

    如果给定id的类型与max date的类型不同,我必须删除该id的每一行。

    time        id  type
    <deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
    2013-11-02  1   xF1yz
    <deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
    <deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
    2013-11-02  1   xF1yz
    2006-07-07  5   F5spo
    2006-07-06  5   F5spo //kept because although the date is not max, it has the same type as the row with the max date for id 5
    <deleted because for id 5 the date is not the max value and the type differs from the type of the max date for id 5>
    

    我怎样才能做到这一点?

    0 回复  |  直到 5 年前
        1
  •  3
  •   jezrael    5 年前

    使用 DataFrameGroupBy.idxmax id type DataFrame.merge :

    df = df.merge(df.loc[df.groupby('id')['time'].idxmax(), ['id','type']])
    print (df)
            time  id   type
    0 2013-11-02   1  xF1yz
    1 2013-11-02   1  xF1yz
    2 2006-07-07   5  F5spo
    3 2006-07-06   5  F5spo
    

    或使用 DataFrame.sort_values DataFrame.drop_duplicates

    df = df.merge(df.sort_values('time').drop_duplicates('id', keep='last')[["id", "type"]])
    
        2
  •  1
  •   DYZ    5 年前

    您可以按时间对dataframe进行排序,然后按id分组并选择每组中的最后一行。那是日期最大的一行。

    last_rows = df.sort_values('time').groupby('id').last()
    

    然后将原始数据帧与新数据帧合并:

    result = df.merge(last_rows, on=["id", "type"])
    #       time_x  id   type      time_y
    #0  2013-11-02   1  xF1yz  2013-11-02
    #1  2013-11-02   1  xF1yz  2013-11-02
    #2  2006-07-07   5  F5spo  2006-07-07
    #3  2006-07-06   5  F5spo  2006-07-07
    

    如果需要,删除最后一个重复列:

    result.drop('time_y', axis=1, inplace=True)
    
        3
  •  1
  •   Chris Adams    5 年前

    Series 使用 set_index groupby transform idxmax . 然后使用 boolean indexing

    # If neccessary cast to datetime dtype
    # df['time'] = pd.to_datetime(df['time'])
    
    s = df.set_index('type').groupby('id')['time'].transform('idxmax')
    df[df.type == s.values]
    

    [出去]

            time  id   type
    1 2013-11-02   1  xF1yz
    4 2013-11-02   1  xF1yz
    5 2006-07-07   5  F5spo
    6 2006-07-06   5  F5spo
    
        4
  •  0
  •   zero    5 年前
    import pandas as pd
    
    df = pd.DataFrame({
        'time': ['2012-12-19', '2013-11-02', '2013-12-19', '2013-12-18', '2013-11-02', '2006-07-07', '2006-07-06', '2005-07-07'],
        'id': [1,1,1,1,1,5,5,5],
        'type': ['abcF1', 'xF1yz', 'abcF1', 'abcF1', 'xF1yz', 'F5spo', 'F5spo', 'F5abc']
    })
    
    df['time'] = pd.to_datetime(df['time'])
    def remove_non_max_date_ids(df):
        max_type = df.loc[df['time'].idxmax()]['type']
        print(max_type)
        return df[
            df['type'] != max_type
        ]
    
    df.groupby('id').apply(remove_non_max_date_ids)
    

    创建一个helper函数,该函数筛选出与max date类型不同的行,然后根据 id

        5
  •  0
  •   Umar Sahid    5 年前

    duplicated .

    import pandas as pd
    import datetime
    
    # if needed
    df['time'] = pd.to_datetime(df['time'])
    
    # sort values of id and time ascendingly, and tagged the duplicates
    df = df.sort_values(by=['id','time'], ascending=[True,True])
    df['time_max'] = df.duplicated(subset=['id'], keep='last')
    # keep the max value only
    df2 = df.loc[~df['time_max'],['id','type']].rename(columns={'type':'type_max'}).copy()
    
    # merge with the original df
    df = pd.merge(df, df2, on=['id'], how='left')
    # get the result
    df['for_drop'] = df['type']==df['type_max']
    df = df.loc[df['for_drop'],:]
    

    [出来]:

    df
        time        id  type    time_max    type_max    for_drop
    3   2013-11-02  1   xF1yz   True          xF1yz       True
    4   2013-11-02  1   xF1yz   False         xF1yz       True
    6   2006-07-06  5   F5spo   True          F5spo       True
    7   2006-07-07  5   F5spo   False         F5spo       True