代码之家  ›  专栏  ›  技术社区  ›  jeangelj

Python Pandas:按具有零值的数字分组进行分类/bin

  •  2
  • jeangelj  · 技术社区  · 7 年前

    我不确定这是否是最有效的方式,但我正在努力将客户支出分组到箱子/桶中。

    这是我正在处理的df:

    df.head()
    
    Best_ID_S| Dollar
    abc2464    0.00 
    fdhg357    672.00  
    hjg5235    250.00 
    mjhur57    199.00 
    erew3452   116.25 
    

    bins = [0,250,500,750,1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,6000,6500,7000,8000,1000000000000]
    #I didn't know how to create 8000+ so I just added a crazy number in the end, it works
    
    group_names = ['0-250','251-500','501-749','750-999','1000-1499','1500-1999','2000-2499','2500-2999','3000-3499','3500-3999','4000-4499','4500-4999','5000-5499','5500-5999','6000-6499','6500-6999','7000-7499','8000+']
    
    categories = pd.cut(df_2014['Dollar'], bins, labels=group_names)
    df['Category'] = pd.cut(df['Dollar'], bins, labels=group_names)
    df['Buckets'] = pd.cut(df['Dollar'], bins)
    

    这就是我做df时得到的。头部():

    Best_ID_S| Dollar | Category |  Buckets
    abc2464    0.00     NaN
    fdhg357    672.00   501-749        (500, 750]
    hjg5235    250.00   0-250          (0, 250]
    mjhur57    199.00   0-250          (0, 250]
    erew3452   116.25   0-250          (0, 250]
    

    2 回复  |  直到 7 年前
        1
  •  5
  •   Bharath M Shetty    7 年前

    的默认值 right 参数为true。数学上 ( 表示不包括左侧,因此需要 [

    df['Category'] = pd.cut(df['Dollar'], bins, labels=group_names,right=False)
    df['Buckets'] = pd.cut(df['Dollar'], bins,right=False)
    
     Best_ID_S|  Dollar Category     Buckets
    0    abc2464    0.00    0-250    [0, 250)
    1    fdhg357  672.00  501-749  [500, 750)
    2    hjg5235  250.00  251-500  [250, 500)
    3    mjhur57  199.00    0-250    [0, 250)
    4   erew3452  116.25    0-250    [0, 250)
    

    为了使其保持包容性,您还可以设置 include_lowest True 通过保留正确的参数 真的

        2
  •  1
  •   Vaishali    7 年前

    要创建8000以上的箱子,可以使用最后一个箱子作为np。inf公司

    bins = [0,250,500,750,1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,6000,6500,7000,8000,np.inf]
    

    df['Category'] = pd.cut(df['Dollar'], bins, labels=group_names, include_lowest=True)
    df['Buckets'] = pd.cut(df['Dollar'], bins, include_lowest=True)
    

        Best_ID_S   Dollar  Category    Buckets
    0   abc2464     0.00    0-250   [0, 250]
    1   fdhg357     672.00  501-749 (500, 750]
    2   hjg5235     250.00  0-250   [0, 250]
    3   mjhur57     199.00  0-250   [0, 250]
    4   erew3452    116.25  0-250   [0, 250]