代码之家 › 专栏 › 技术社区 › whitefang1993

基于更复杂的情况,在熊猫身上丢行

pandas python

whitefang1993 · 技术社区 · 5 年前

我有以下数据帧:

time        id  type
2012-12-19  1   abcF1
2013-11-02  1   xF1yz
2012-12-19  1   abcF1
2012-12-18  1   abcF1
2013-11-02  1   xF1yz
2006-07-07  5   F5spo
2006-07-06  5   F5spo
2005-07-07  5   F5abc

最长的日期我需要检查一下类型。

如果给定id的类型与max date的类型不同,我必须删除该id的每一行。

time        id  type
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
2013-11-02  1   xF1yz
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
2013-11-02  1   xF1yz
2006-07-07  5   F5spo
2006-07-06  5   F5spo //kept because although the date is not max, it has the same type as the row with the max date for id 5
<deleted because for id 5 the date is not the max value and the type differs from the type of the max date for id 5>

我怎样才能做到这一点?

0 回复 | 直到 5 年前

jezrael 5 年前

使用 DataFrameGroupBy.idxmax id 和 type DataFrame.merge :

df = df.merge(df.loc[df.groupby('id')['time'].idxmax(), ['id','type']])
print (df)
        time  id   type
0 2013-11-02   1  xF1yz
1 2013-11-02   1  xF1yz
2 2006-07-07   5  F5spo
3 2006-07-06   5  F5spo

或使用 DataFrame.sort_values DataFrame.drop_duplicates

df = df.merge(df.sort_values('time').drop_duplicates('id', keep='last')[["id", "type"]])

DYZ 5 年前

您可以按时间对dataframe进行排序,然后按id分组并选择每组中的最后一行。那是日期最大的一行。

last_rows = df.sort_values('time').groupby('id').last()

然后将原始数据帧与新数据帧合并:

result = df.merge(last_rows, on=["id", "type"])
#       time_x  id   type      time_y
#0  2013-11-02   1  xF1yz  2013-11-02
#1  2013-11-02   1  xF1yz  2013-11-02
#2  2006-07-07   5  F5spo  2006-07-07
#3  2006-07-06   5  F5spo  2006-07-07

如果需要,删除最后一个重复列:

result.drop('time_y', axis=1, inplace=True)

Chris Adams 5 年前

Series 使用 set_index groupby 和 transform idxmax . 然后使用 boolean indexing

# If neccessary cast to datetime dtype
# df['time'] = pd.to_datetime(df['time'])

s = df.set_index('type').groupby('id')['time'].transform('idxmax')
df[df.type == s.values]

[出去]

        time  id   type
1 2013-11-02   1  xF1yz
4 2013-11-02   1  xF1yz
5 2006-07-07   5  F5spo
6 2006-07-06   5  F5spo

zero 5 年前

import pandas as pd

df = pd.DataFrame({
    'time': ['2012-12-19', '2013-11-02', '2013-12-19', '2013-12-18', '2013-11-02', '2006-07-07', '2006-07-06', '2005-07-07'],
    'id': [1,1,1,1,1,5,5,5],
    'type': ['abcF1', 'xF1yz', 'abcF1', 'abcF1', 'xF1yz', 'F5spo', 'F5spo', 'F5abc']
})

df['time'] = pd.to_datetime(df['time'])
def remove_non_max_date_ids(df):
    max_type = df.loc[df['time'].idxmax()]['type']
    print(max_type)
    return df[
        df['type'] != max_type
    ]

df.groupby('id').apply(remove_non_max_date_ids)

创建一个helper函数,该函数筛选出与max date类型不同的行,然后根据 id

Umar Sahid 5 年前

duplicated .

import pandas as pd
import datetime

# if needed
df['time'] = pd.to_datetime(df['time'])

# sort values of id and time ascendingly, and tagged the duplicates
df = df.sort_values(by=['id','time'], ascending=[True,True])
df['time_max'] = df.duplicated(subset=['id'], keep='last')
# keep the max value only
df2 = df.loc[~df['time_max'],['id','type']].rename(columns={'type':'type_max'}).copy()

# merge with the original df
df = pd.merge(df, df2, on=['id'], how='left')
# get the result
df['for_drop'] = df['type']==df['type_max']
df = df.loc[df['for_drop'],:]

[出来]:

df
    time        id  type    time_max    type_max    for_drop
3   2013-11-02  1   xF1yz   True          xF1yz       True
4   2013-11-02  1   xF1yz   False         xF1yz       True
6   2006-07-06  5   F5spo   True          F5spo       True
7   2006-07-07  5   F5spo   False         F5spo       True