代码之家  ›  专栏  ›  技术社区  ›  honeymoon

panda获取大小范围内的行的子集

  •  0
  • honeymoon  · 技术社区  · 2 年前

    我有一个熊猫数据帧:

    df = pd.DataFrame({'start': [50, 100, 50000, 50030, 100000],
                    'end': [51, 101, 50001, 50031, 100001],
                    'value': [1, 2, 3, 4, 5]},
                   index=['id1', 'id2', 'id3', 'id4', 'id5'])
    
    >>> df
                start      end      value
     id1           50      51         1
     id2           100     101        2
     id3           50000   50001      3
     id4           50030   50031      4
     id5           100000  100001     5
    

    现在,我想提取大小范围内的所有行的组 150 在“开始”列中。输出应该如下所示:

    group    group_start   group_end       min_val      max_value   id_count
      1         50           101             1              2         2
      2         50000        50031           3              4         2
      3         100000       100001          5              5         1
    

    如何提取这些组?

    1 回复  |  直到 2 年前
        1
  •  2
  •   keramat    2 年前

    使用:

    start = df['start'].iloc[0]
    g = 0
    gs = []
    for val in df['start']:
        if val-start<150:
            gs.append(g)
        else:
            g+=1
            start = val
            gs.append(g)
            
    df['g'] = gs
    df.groupby('g').agg(group_start = ('start', 'first'), group_end = ('end', 'last'), min_val = ('value', 'min'), max_value = ('value', 'max'), id_count = ('value', 'count'))
    

    输出:

    enter image description here

    根据评论:

    df.groupby('g').agg(group_start = ('start', 'first'), group_end = ('end', 'last'), min_val = ('value', 'min'), max_value = ('value', 'max'), id_count = ('value', 'idxmax'))
    
        2
  •  2
  •   jezrael    2 年前

    对于范围 150 可以使用整数除以 150 具有 factorize 对于起始于的组 1 然后通过命名聚合聚合列,最后添加列 group :

    a = pd.factorize(df['start'] // 150)[0] + 1
    df = (df.groupby(a).agg(group_start = ('start','first'),
                                            group_end = ('end','last'),
                                            min_val = ('value','min'),
                                            max_val = ('value','max'),
                                            id_count=('value','size'))
                                        .rename_axis('group')
                                        .reset_index())
    
    print (df)
       group  group_start  group_end  min_val  max_val  id_count
    0      1           50        101        1        2         2
    1      2        50000      50031        3        4         2
    2      3       100000     100001        5        5         1
    

    如果 150 是从值中减去的第一个值 start 使用:

    a = pd.factorize(df['start'].sub(df['start'].iat[0]) // 150)[0] + 1
    df = (df.groupby(a).agg(group_start = ('start','first'),
                                            group_end = ('end','last'),
                                            min_val = ('value','min'),
                                            max_val = ('value','max'),
                                            id_count=('value','size'))
                                        .rename_axis('group')
                                        .reset_index())
    

    编辑:对于最大值索引 value 列的使用 DataFrameGroupBy.idxmax :

    a = pd.factorize(df['start'] // 150)[0] + 1
    df = (df.groupby(a).agg(group_start = ('start','first'),
                                            group_end = ('end','last'),
                                            min_val = ('value','min'),
                                            max_val = ('value','max'),
                                            id_count=('value','size'),
                                            indicec_by_max_val=('value','idxmax'))
                                        .rename_axis('group')
                                        .reset_index())
    
    print (df)
       group  group_start  group_end  min_val  max_val  id_count  \
    0      1           50        101        1        2         2   
    1      2        50000      50031        3        4         2   
    2      3       100000     100001        5        5         1   
    
      indicec_by_max_val  
    0                id2  
    1                id4  
    2                id5  
    
        3
  •  1
  •   Patrick Artner    2 年前

    这将使用中的实际值 df["start"] 对于 装箱-边界 并且不使用固定的装箱到0…149150.299,比如 jezrael's answer

    它遵循了一种非常手动的(因此可能是次优的)方法。。。不确定它如何适用于更大的数据集,但它是这样的:

    import pandas as pd 
    
    df = pd.DataFrame({'start': [50, 100, 50000, 50030, 100000],
                    'end': [51, 101, 50001, 50031, 100001],
                    'value': [1, 2, 3, 4, 5]},
                index=['id1', 'id2', 'id3', 'id4', 'id5'])
    
    gcol = "group,group_start,group_end,min_val,max_value,id_count".split(",")
    grpd = pd.DataFrame(columns=gcol)
    
    last_end = None   # to skip already included values
    groupnr = 1       # simple numbering
    
    # df should be sorted for "start" like your demo data is
    for value in df.start:
        # skip row if already included in last range
        if value < (last_end or value): 
            print("Skipped", value)
            continue
    
        # get all values in range 
        m = df[ df["start"].isin(range(-150+value,value+151))]
    
        # get stuff for grouped df
        mi = m.start.min()
        ma = m.end.max()
        last_end = ma
        mi_i = m[m.start == mi].value[0]
        ma_i = m[m.end == ma].value[0]
        l = len(m)
    
        # add to grp'ed df
        grpd = grpd.append(dict(zip(gcol, [groupnr, mi,ma,mi_i,ma_i,l])), 
                           ignore_index=True)
        # and one more
        groupnr += 1
    
    # output shenanigans
    pd.set_option('display.max_columns', 10)
    pd.set_option('display.width', 10000)
    print(grpd)
    

    输出:

    Skipped 100      # was already used up in 50-200
    Skipped 50030    # was already used up in 50000-50150
    
          group group_start group_end min_val max_value id_count
    0     1     50          101       1       2         2
    1     2     50000       50031     3       4         2
    2     3     100000      100001    5       5         1