我试图用pandas来解决我用纯python所做的问题,但不知道数据帧groupby的最佳实践。
我想为每个邮政编码选择(在该邮政编码的所有药物中)最常用药物的百分比。
如果两种药物的处方量相同,我想取“按字母顺序第一”的那一种:
import pandas as pd
drugs_prescriptions = pd.DataFrame({'PostCode': ['P1', 'P1', 'P1', 'P2', 'P2', 'P3'],
'Drug': ['D1', 'D2', 'D1', 'D2', 'D1', 'D2'],
'Quantity': [3, 6, 5, 7, 7, 8]})
Drug PostCode Quantity
# 0 D1 P1 3
# 1 D2 P1 6
# 2 D1 P1 5
# 3 D2 P2 7
# 4 D1 P2 7
# 5 D2 P3 8
#This should be the RESULT:
# postCode, drug with highest quantity, percentage of all drugs per post code
# (post code P2 has two drugs with the same quantity, alphabetically first one is selected
# [('P1', 'D1', 0.57),
# ('P2', 'D1', 0.50),
# ('P3', 'D2', 1)]
我已经按邮政编码、药物进行了分组,但在选择行(应用lambda)时遇到了问题。
durg_qualtity_per_post_code = drugs_prescriptions.groupby(['PostCode', 'Drug']).agg('sum')
所有按邮政编码出售的药物,我打算将此数据集与以前的数据集一起应用或转换:
all_by_post_code = drugs_prescriptions.groupby(['PostCode'])['Quantity'].sum()
我不知道如何选择每个邮政编码的药品最大数量行,如果两种药品的数量相同,则应选择第一个字母顺序的药品(邮政编码p2为d1)。
我想做这样的事:
durg_qualtity_per_post_code [durg_qualtity_per_post_code .apply(lambda x: int(x['Quantity']) == max_items_by_post_code[x['post_code']], axis=1, reduce=True)]
更新:
# sort by PostCode, Drug
df = drugs_prescriptions.groupby(['PostCode', 'Drug']).agg('sum')
df = df.groupby(['PostCode']).apply(lambda x: x.sort_values(['Quantity', 'Drug'], ascending=[False, True]))
# select first value by PostCode
# reset index in order to have drug in the output as well
df.reset_index(level=[1], inplace=True)
df = df.groupby(['PostCode']).first()
# calculate percentage of total by PostCode
allQuantities = drugs_prescriptions.groupby(['PostCode']).agg('sum')
df['Quantity'] = df.apply(lambda row: row['Quantity']/allQuantities.loc[row.name], axis=1)