代码之家  ›  专栏  ›  技术社区  ›  bluesmonk

一个热编码数据帧上的聚合

  •  1
  • bluesmonk  · 技术社区  · 6 年前

    假设我有一个拥有10000部电影的数据集 https://bpaste.net/show/05fa224794e4 ,数据集的ERCT为

    tt0111161   The Shawshank Redemption (1994) 1994     9.2    619479  142 mins.   Crime|Drama
    tt0110912   Pulp Fiction (1994) 1994     9.0    490065  154 mins.   Crime|Thriller
    tt0137523   Fight Club (1999)   1999     8.8    458173  139 mins.   Drama|Mystery|Thriller
    tt0133093   The Matrix (1999)   1999     8.7    448114  136 mins.   Action|Adventure|Sci-Fi
    tt1375666   Inception (2010)    2010     8.9    385149  148 mins.   Action|Adventure|Sci-Fi|Thriller
    tt0109830   Forrest Gump (1994) 1994     8.7    368994  142 mins.   Comedy|Drama|Romance
    tt0169547   American Beauty (1999)  1999     8.6    338332  122 mins.   Drama
    tt0499549   Avatar (2009)   2009     8.1    336855  162 mins.   Action|Adventure|Fantasy|Sci-Fi
    tt0108052   Schindler's List (1993) 1993     8.9    325888  195 mins.   Biography|Drama|History|War
    tt0080684   Star Wars: Episode V - The Empire Strikes Back (1980)   1980     8.8    320105  124 mins.   Action|Adventure|Family|Sci-Fi
    tt0372784   Batman Begins (2005)    2005     8.3    316613  140 mins.   Action|Crime|Drama|Thriller
    tt0114814   The Usual Suspects (1995)   1995     8.7    306624  106 mins.   Crime|Mystery|Thriller
    tt0102926   The Silence of the Lambs (1991) 1991     8.7    293081  118 mins.   Crime|Thriller
    tt0120338   Titanic (1997)  1997     7.4    284245  194 mins.   Adventure|Drama|History|Romance
    

    我有这样的代码要点,加载我的数据集并对其执行一些更改

    import pandas as pd
    import numpy as np
    
    headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres'] 
    movies = pd.read_csv("imdb_top_10000.txt", sep="\t", header=None, names=headers, encoding='UTF-8')
    movies.head()
    
    one_hot_encoding = movies["genres"].str.get_dummies(sep='|')
    movies = pd.concat([movies, one_hot_encoding], axis=1)
    
    movies_top_250 = movies.sort_values('score', ascending=False).head(250)
    

    鉴于此

    • 我们想从前250个最佳收视率的作品中找出最好的三个流派
    • 这些类型电影的平均
    • 此外,我们还想找出这组影片中三个最差评级类型的平均分

    我在想一张透视表?这里只使用流派栏目的一个子集。

    pd.pivot_table(movies_top_250, values=['votes', 'Action', 'Adult'], index='title', aggfunc=np.sum).sort_values('votes', ascending=False)
    
                                                        Action  Adult   votes
    title                                                                    
    The Shawshank Redemption (1994)                          0      0  619479
    The Dark Knight (2008)                                   1      0  555122
    Pulp Fiction (1994)                                      0      0  490065
    The Godfather (1972)                                     0      0  474189
    Fight Club (1999)                                        0      0  458173
    The Lord of the Rings: The Fellowship of the Ri...       1      0  451263
    The Matrix (1999)                                        1      0  448114
    The Lord of the Rings: The Return of the King (...       1      0  428791
    Inception (2010)                                         1      0  385149
    The Lord of the Rings: The Two Towers (2002)             1      0  383113
    Forrest Gump (1994)                                      0      0  368994
    

    movies.groupby('genres').score.mean()
    

    返回类似

    genres
    Action                                                   5.837500
    Action|Adventure                                         6.152381
    Action|Adventure|Animation|Comedy|Family|Fantasy         7.500000
    Action|Adventure|Animation|Family|Fantasy|Sci-Fi         6.100000
    Action|Adventure|Biography|Crime|History|Western         6.300000
    Action|Adventure|Biography|Drama|History                 7.700000
    

    Genre          mean_score  votes_sum
    Action           7.837500     103237
    Adventure        6.152381     103226
    Animation        5.500000     103275
    
    2 回复  |  直到 6 年前
        1
  •  1
  •   manuelmatas    6 年前

    你可以用这个 单线

    movies = \
        (movies.set_index(mv.columns.drop('genres',1).tolist())
        .genres.str.split('|',expand=True)
        .stack()
        .reset_index()
        .rename(columns={0:'genre'})
        .loc[:,['genre','score','votes']]
        .groupby('genre').agg({'score':['mean'], 'votes':['sum']})
        )
    
                  score    votes
                   mean      sum
    genre                      
    Action     8.425714  7912508
    Adventure  8.430000  7460632
    Animation  8.293333  1769806
    Biography  8.393750  2112875
    Comedy     8.341509  3166269
    ...
    

    解释

    主要问题是多重性 True one_hot_encoding 过程胜过体裁。一部电影可以指定一种或多种类型。因此,不能按类型正确使用聚合方法。另一方面,使用 genres field as is将溶解您在问题中显示的多种性别结果:

                                                               genres
    Action                                                   5.837500
    Action|Adventure                                         6.152381
    Action|Adventure|Animation|Comedy|Family|Fantasy         7.500000
    Action|Adventure|Animation|Family|Fantasy|Sci-Fi         6.100000
    Action|Adventure|Biography|Crime|History|Western         6.300000
    Action|Adventure|Biography|Drama|History                 7.700000
    

    split expand ,您可以创建多个数据帧,然后对它们进行堆栈。例如,具有2种类型的电影将出现在2个结果数据帧中,其中每个数据帧表示分配给每个类型的电影。最后,在解析之后,您可以使用多个函数按性别进行聚合。我将逐步解释:

    1.获得前250部电影(按分数)

    import pandas as pd
    import numpy as np
    
    headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres'] 
    movies = pd.read_csv("imdb_top_10000.txt", sep="\t", header=None, names=headers, encoding='UTF-8')
    

    体裁

                 imdbID             title  year  score  votes   runtime genres
    7917  tt0990404  Chop Shop (2007)  2007    7.2   2104  84 mins.    NaN
    

    movies.loc[movies.genres.isnull(),"genres"] = "Drama"
    

    现在,正如你已经展示的,我们需要排名前250的电影:

    movies = movies.sort_values('score', ascending=False).head(250)
    

    2.使用split with expand从流派创建流派字段

    movies = movies.set_index(movies.columns.drop('genres',1).tolist())                                                                         
                                                                                         genres
    imdbID    title                                 year score votes  runtime                  
    tt0111161 The Shawshank Redemption (1994)       1994 9.2   619479 142 mins.     Crime|Drama
    tt0068646 The Godfather (1972)                  1972 9.2   474189 175 mins.     Crime|Drama
    tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0   195238 161 mins.         Western
    tt0110912 Pulp Fiction (1994)                   1994 9.0   490065 154 mins.  Crime|Thriller
    tt0252487 Outrageous Class (1975)               1975 9.0   9823   87 mins.     Comedy|Drama
    (250, 1)
    

    2.2. 按流派划分

    这将从剥离的N次迭代中创建N个数据帧。

    movies = movies.genres.str.split('|',expand=True)
    
                                                                                           0  \
    imdbID    title                                 year score votes  runtime              
    tt0111161 The Shawshank Redemption (1994)       1994 9.2   619479 142 mins.    Crime   
    tt0068646 The Godfather (1972)                  1972 9.2   474189 175 mins.    Crime   
    tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0   195238 161 mins.  Western   
    tt0110912 Pulp Fiction (1994)                   1994 9.0   490065 154 mins.    Crime   
    tt0252487 Outrageous Class (1975)               1975 9.0   9823   87 mins.    Comedy   
    
                                                                                        1  \
    imdbID    title                                 year score votes  runtime               
    tt0111161 The Shawshank Redemption (1994)       1994 9.2   619479 142 mins.     Drama   
    tt0068646 The Godfather (1972)                  1972 9.2   474189 175 mins.     Drama   
    tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0   195238 161 mins.      None   
    tt0110912 Pulp Fiction (1994)                   1994 9.0   490065 154 mins.  Thriller   
    tt0252487 Outrageous Class (1975)               1975 9.0   9823   87 mins.      Drama   
    
    ...
    

    现在,每个电影都有一个唯一的genre值,其中一部电影可以有多行如果分配了多个genre,则可以堆叠数据帧集。请注意,现在我们有超过250行(662行),但是有250个不同的电影。

    movies = movies.stack()
    
    imdbID     title                                  year  score  votes   runtime     
    tt0111161  The Shawshank Redemption (1994)        1994  9.2    619479  142 mins.  0      Crime
                                                                                      1      Drama
    tt0068646  The Godfather (1972)                   1972  9.2    474189  175 mins.  0      Crime
                                                                                      1      Drama
    tt0060196  The Good, the Bad and the Ugly (1966)  1966  9.0    195238  161 mins.  0    Western
    dtype: object
    (662,)
    

    3.解析

    聚合前获取合适的数据结构:

    # Multiple index to columns
    movies = movies.reset_index()
    
    # Name the new column for genre
    movies = movies.rename(columns={0:'genre'})
    
    # Only wanted fields to be aggregated
    movies = movies.loc[:,['genre','score','votes']]
    
         genre  score   votes
    0    Crime    9.2  619479
    1    Drama    9.2  619479
    2    Crime    9.2  474189
    3    Drama    9.2  474189
    4  Western    9.0  195238
    (662, 3)
    

    4.骨料

    根据您的要求,分数必须按平均数进行汇总,投票必须按总和进行:

    movies = movies.groupby('genres').agg({'score':['mean'], 'votes':['sum']})
    
                  score    votes
                   mean      sum
    genre                      
    Action     8.425714  7912508
    Adventure  8.430000  7460632
    Animation  8.293333  1769806
    Biography  8.393750  2112875
    Comedy     8.341509  3166269
    (21, 2)
    
        2
  •  0
  •   shadowtalker    6 年前

    安装程序

    import io
    
    import numpy as np
    import pandas as pd
    
    colnames = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']
    
    data_url = 'https://bpaste.net/raw/05fa224794e4'
    
    movies = pd.read_csv(data_url, sep="\t", header=None, names=colnames, encoding='UTF-8', index_col='imdbID')
    

    def arg_nlargest(x, n, use_index=True):
        if isinstance(x, pd.Series):
            x = x.values
        return np.argpartition(-x, n)[:n]
    

    250部顶级电影中的3个顶级类型

    top250_iloc = arg_nlargest(movies['score'], 250)
    movies250 = movies.iloc[top250_iloc]
    

    接下来,我们将每个电影的类型扩展为指标,就像您所做的那样

    movies250_genre_inds = movies250["genres"].str.get_dummies(sep='|')
    

    genre_agg = {}
    for genre in movies250_genre_inds.columns:
        mask = movies250_genre_inds[genre].astype(bool)
        aggregates = movies250.loc[mask].agg({'score': 'mean', 'votes': 'sum'}) 
        genre_agg[genre] = aggregates.tolist()
    
    genre_agg = pd.DataFrame.from_dict(genre_agg, orient='index', columns=['score_mean', 'votes_sum'])
    
    genre3_iloc = arg_nlargest(genre_agg['score_mean'], 3)
    genre3 = genre_agg.iloc[genre3_iloc].sort_values('score_mean', ascending=False)