代码之家  ›  专栏  ›  技术社区  ›  andraiamatrix

使用熊猫系列返回索引。sample()?

  •  8
  • andraiamatrix  · 技术社区  · 7 年前

    index    
    row1    user1
    row2    user2
    row3    user2
    row4    user1
    row5    user2
    row6    user1
    row7    user3
    ...
    

    def get_random_sample(series, sample_size, users):
    """ Grab a random sample of size sample_size of the tickets resolved by each user in the list users.
        Series has the ticket number as index, and the username as the series values.
        Returns a dict {user:[sample_tickets]}
    """
        sample_dict = {}
        for user in users: 
            sample_dict[user] = series[series==user].sample(n=sample_size, replace=False) 
    
        return sample_dict
    

    # assuming sample_size is 4
    {user1: [user1, user1, user1, user1],
     user2: [user2, user2, user2, user2],
    ...}
    

    但我想从我的输出中得到的是:

    {user1: [row1, row6, row32, row40],
     user2: [row3, row5, row17, row39],
    ...}
    # where row# is the index label for the corresponding row.
    

    基本上我想要熊猫系列。sample()返回随机样本项的索引,而不是项值。不确定这是否可行,或者我最好先重新构造数据(可能将用户作为数据帧中的序列名称,索引成为该序列下的值?但不确定如何做到这一点)。 任何洞察都将不胜感激。

    2 回复  |  直到 7 年前
        1
  •  4
  •   Ameb    5 年前

    @user48956 commented 根据公认的答案,使用 numpy.random.choice

    np.random.seed(42)
    df = pd.DataFrame(np.random.randint(0,100,size=(10000000, 4)), columns=list('ABCD'))
    %time df.sample(100000).index
    print(_)
    %time pd.Index(np.random.choice(df.index, 100000))
    
    Wall time: 710 ms
    Int64Index([7141956, 9256789, 1919656, 2407372, 9181191, 2474961, 2345700,
                4394530, 8864037, 6096638,
                ...
                 471501, 3616956, 9397742, 6896140,  670892, 9546169, 4146996,
                3465455, 7748682, 5271367],
               dtype='int64', length=100000)
    Wall time: 6.05 ms
    
    Int64Index([7141956, 9256789, 1919656, 2407372, 9181191, 2474961, 2345700,
                4394530, 8864037, 6096638,
                ...
                 471501, 3616956, 9397742, 6896140,  670892, 9546169, 4146996,
                3465455, 7748682, 5271367],
               dtype='int64', length=100000)
    
        2
  •  3
  •   Scott Boston    7 年前

    .index 采样后返回这些样本的索引:

    sample_dict[user] = series[series==user].sample(n=sample_size, replace=False).index