代码之家  ›  专栏  ›  技术社区  ›  user007

按多级选择分组的数据帧

  •  2
  • user007  · 技术社区  · 6 年前

    我试图用pandas来解决我用纯python所做的问题,但不知道数据帧groupby的最佳实践。

    我想为每个邮政编码选择(在该邮政编码的所有药物中)最常用药物的百分比。 如果两种药物的处方量相同,我想取“按字母顺序第一”的那一种:

    import pandas as pd
    
    drugs_prescriptions = pd.DataFrame({'PostCode': ['P1', 'P1', 'P1', 'P2', 'P2', 'P3'],
                        'Drug': ['D1', 'D2', 'D1', 'D2', 'D1', 'D2'],
                        'Quantity': [3, 6, 5, 7, 7, 8]})
    
        Drug    PostCode    Quantity
    # 0 D1      P1          3
    # 1 D2      P1          6
    # 2 D1      P1          5
    # 3 D2      P2          7
    # 4 D1      P2          7
    # 5 D2      P3          8
    
    #This should be the RESULT:
    # postCode, drug with highest quantity, percentage of all drugs per post code
    # (post code P2 has two drugs with the same quantity, alphabetically first one is selected
    # [('P1', 'D1', 0.57),
    # ('P2', 'D1', 0.50),
    # ('P3', 'D2', 1)]
    

    我已经按邮政编码、药物进行了分组,但在选择行(应用lambda)时遇到了问题。

    durg_qualtity_per_post_code = drugs_prescriptions.groupby(['PostCode', 'Drug']).agg('sum')
    

    enter image description here

    所有按邮政编码出售的药物,我打算将此数据集与以前的数据集一起应用或转换:

    all_by_post_code = drugs_prescriptions.groupby(['PostCode'])['Quantity'].sum()
    

    enter image description here

    我不知道如何选择每个邮政编码的药品最大数量行,如果两种药品的数量相同,则应选择第一个字母顺序的药品(邮政编码p2为d1)。

    我想做这样的事:

    durg_qualtity_per_post_code [durg_qualtity_per_post_code .apply(lambda x: int(x['Quantity']) == max_items_by_post_code[x['post_code']], axis=1, reduce=True)]
    

    更新:

    # sort by PostCode, Drug
    df = drugs_prescriptions.groupby(['PostCode', 'Drug']).agg('sum')
    df = df.groupby(['PostCode']).apply(lambda x: x.sort_values(['Quantity', 'Drug'], ascending=[False, True]))
    
    # select first value by PostCode
    # reset index in order to have drug in the output as well
    df.reset_index(level=[1], inplace=True)
    df = df.groupby(['PostCode']).first()
    
    # calculate percentage of total by PostCode
    allQuantities = drugs_prescriptions.groupby(['PostCode']).agg('sum')
    df['Quantity'] = df.apply(lambda row: row['Quantity']/allQuantities.loc[row.name], axis=1)
    
    1 回复  |  直到 6 年前
        1
  •  2
  •   run-out    6 年前

    这里有一个可能的解决方案,但它让人觉得尴尬和没有蟒蛇。但它起作用了,注释在代码中。

    # setting string to integer
    df.Quantity = df.Quantity.astype('int')
    
    # create a mulitiindex
    df.set_index(['PostCode', 'Drug'], inplace=True)
    
    # use transform to divide the sum of the 'Drug' level by the 'PostCode' level
    df = df.groupby(level=[0,1]).transform('sum') / df.groupby(level=0).transform('sum')
    
    # move 'Drug' out of the multi index to allow for sorting
    df.reset_index(level=[1], inplace=True)
    
    # Sort the 'Quantity' descending order, and the 'Drug' in ascending order,
    # then we can select the first 'PostCode' for our result
    df.sort_values(['Quantity','Drug'], ascending=[False, True], inplace=True)
    
    df.groupby('PostCode').first()
    
               Drug Quantity
    PostCode        
    P1          D1  0.571429
    P2          D1  0.500000
    P3          D2  1.000000