代码之家 › 专栏 › 技术社区 › bluesmonk

一个热编码数据帧上的聚合

pivot-table group-by pandas

bluesmonk · 技术社区 · 6 年前

假设我有一个拥有10000部电影的数据集 https://bpaste.net/show/05fa224794e4 ,数据集的ERCT为

tt0111161   The Shawshank Redemption (1994) 1994     9.2    619479  142 mins.   Crime|Drama
tt0110912   Pulp Fiction (1994) 1994     9.0    490065  154 mins.   Crime|Thriller
tt0137523   Fight Club (1999)   1999     8.8    458173  139 mins.   Drama|Mystery|Thriller
tt0133093   The Matrix (1999)   1999     8.7    448114  136 mins.   Action|Adventure|Sci-Fi
tt1375666   Inception (2010)    2010     8.9    385149  148 mins.   Action|Adventure|Sci-Fi|Thriller
tt0109830   Forrest Gump (1994) 1994     8.7    368994  142 mins.   Comedy|Drama|Romance
tt0169547   American Beauty (1999)  1999     8.6    338332  122 mins.   Drama
tt0499549   Avatar (2009)   2009     8.1    336855  162 mins.   Action|Adventure|Fantasy|Sci-Fi
tt0108052   Schindler's List (1993) 1993     8.9    325888  195 mins.   Biography|Drama|History|War
tt0080684   Star Wars: Episode V - The Empire Strikes Back (1980)   1980     8.8    320105  124 mins.   Action|Adventure|Family|Sci-Fi
tt0372784   Batman Begins (2005)    2005     8.3    316613  140 mins.   Action|Crime|Drama|Thriller
tt0114814   The Usual Suspects (1995)   1995     8.7    306624  106 mins.   Crime|Mystery|Thriller
tt0102926   The Silence of the Lambs (1991) 1991     8.7    293081  118 mins.   Crime|Thriller
tt0120338   Titanic (1997)  1997     7.4    284245  194 mins.   Adventure|Drama|History|Romance

我有这样的代码要点,加载我的数据集并对其执行一些更改

import pandas as pd
import numpy as np

headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres'] 
movies = pd.read_csv("imdb_top_10000.txt", sep="\t", header=None, names=headers, encoding='UTF-8')
movies.head()

one_hot_encoding = movies["genres"].str.get_dummies(sep='|')
movies = pd.concat([movies, one_hot_encoding], axis=1)

movies_top_250 = movies.sort_values('score', ascending=False).head(250)

鉴于此

我们想从前250个最佳收视率的作品中找出最好的三个流派
这些类型电影的平均
此外,我们还想找出这组影片中三个最差评级类型的平均分

我在想一张透视表?这里只使用流派栏目的一个子集。

pd.pivot_table(movies_top_250, values=['votes', 'Action', 'Adult'], index='title', aggfunc=np.sum).sort_values('votes', ascending=False)

                                                    Action  Adult   votes
title                                                                    
The Shawshank Redemption (1994)                          0      0  619479
The Dark Knight (2008)                                   1      0  555122
Pulp Fiction (1994)                                      0      0  490065
The Godfather (1972)                                     0      0  474189
Fight Club (1999)                                        0      0  458173
The Lord of the Rings: The Fellowship of the Ri...       1      0  451263
The Matrix (1999)                                        1      0  448114
The Lord of the Rings: The Return of the King (...       1      0  428791
Inception (2010)                                         1      0  385149
The Lord of the Rings: The Two Towers (2002)             1      0  383113
Forrest Gump (1994)                                      0      0  368994

movies.groupby('genres').score.mean()

返回类似

genres
Action                                                   5.837500
Action|Adventure                                         6.152381
Action|Adventure|Animation|Comedy|Family|Fantasy         7.500000
Action|Adventure|Animation|Family|Fantasy|Sci-Fi         6.100000
Action|Adventure|Biography|Crime|History|Western         6.300000
Action|Adventure|Biography|Drama|History                 7.700000

Genre          mean_score  votes_sum
Action           7.837500     103237
Adventure        6.152381     103226
Animation        5.500000     103275

2 回复 | 直到 6 年前

manuelmatas 6 年前

你可以用这个单线

movies = \
    (movies.set_index(mv.columns.drop('genres',1).tolist())
    .genres.str.split('|',expand=True)
    .stack()
    .reset_index()
    .rename(columns={0:'genre'})
    .loc[:,['genre','score','votes']]
    .groupby('genre').agg({'score':['mean'], 'votes':['sum']})
    )

              score    votes
               mean      sum
genre                      
Action     8.425714  7912508
Adventure  8.430000  7460632
Animation  8.293333  1769806
Biography  8.393750  2112875
Comedy     8.341509  3166269
...

解释

主要问题是多重性 True 从 one_hot_encoding 过程胜过体裁。一部电影可以指定一种或多种类型。因此,不能按类型正确使用聚合方法。另一方面,使用 genres field as is将溶解您在问题中显示的多种性别结果:

                                                           genres
Action                                                   5.837500
Action|Adventure                                         6.152381
Action|Adventure|Animation|Comedy|Family|Fantasy         7.500000
Action|Adventure|Animation|Family|Fantasy|Sci-Fi         6.100000
Action|Adventure|Biography|Crime|History|Western         6.300000
Action|Adventure|Biography|Drama|History                 7.700000

split expand ,您可以创建多个数据帧,然后对它们进行堆栈。例如,具有2种类型的电影将出现在2个结果数据帧中,其中每个数据帧表示分配给每个类型的电影。最后,在解析之后,您可以使用多个函数按性别进行聚合。我将逐步解释:

1.获得前250部电影(按分数)

import pandas as pd
import numpy as np

headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres'] 
movies = pd.read_csv("imdb_top_10000.txt", sep="\t", header=None, names=headers, encoding='UTF-8')

体裁

             imdbID             title  year  score  votes   runtime genres
7917  tt0990404  Chop Shop (2007)  2007    7.2   2104  84 mins.    NaN

movies.loc[movies.genres.isnull(),"genres"] = "Drama"

现在,正如你已经展示的,我们需要排名前250的电影:

movies = movies.sort_values('score', ascending=False).head(250)

2.使用split with expand从流派创建流派字段

movies = movies.set_index(movies.columns.drop('genres',1).tolist())                                                                         
                                                                                     genres
imdbID    title                                 year score votes  runtime                  
tt0111161 The Shawshank Redemption (1994)       1994 9.2   619479 142 mins.     Crime|Drama
tt0068646 The Godfather (1972)                  1972 9.2   474189 175 mins.     Crime|Drama
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0   195238 161 mins.         Western
tt0110912 Pulp Fiction (1994)                   1994 9.0   490065 154 mins.  Crime|Thriller
tt0252487 Outrageous Class (1975)               1975 9.0   9823   87 mins.     Comedy|Drama
(250, 1)

2.2. 按流派划分

这将从剥离的N次迭代中创建N个数据帧。

movies = movies.genres.str.split('|',expand=True)

                                                                                       0  \
imdbID    title                                 year score votes  runtime              
tt0111161 The Shawshank Redemption (1994)       1994 9.2   619479 142 mins.    Crime   
tt0068646 The Godfather (1972)                  1972 9.2   474189 175 mins.    Crime   
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0   195238 161 mins.  Western   
tt0110912 Pulp Fiction (1994)                   1994 9.0   490065 154 mins.    Crime   
tt0252487 Outrageous Class (1975)               1975 9.0   9823   87 mins.    Comedy   

                                                                                    1  \
imdbID    title                                 year score votes  runtime               
tt0111161 The Shawshank Redemption (1994)       1994 9.2   619479 142 mins.     Drama   
tt0068646 The Godfather (1972)                  1972 9.2   474189 175 mins.     Drama   
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0   195238 161 mins.      None   
tt0110912 Pulp Fiction (1994)                   1994 9.0   490065 154 mins.  Thriller   
tt0252487 Outrageous Class (1975)               1975 9.0   9823   87 mins.      Drama   

...

现在,每个电影都有一个唯一的genre值,其中一部电影可以有多行如果分配了多个genre,则可以堆叠数据帧集。请注意,现在我们有超过250行(662行),但是有250个不同的电影。

movies = movies.stack()

imdbID     title                                  year  score  votes   runtime     
tt0111161  The Shawshank Redemption (1994)        1994  9.2    619479  142 mins.  0      Crime
                                                                                  1      Drama
tt0068646  The Godfather (1972)                   1972  9.2    474189  175 mins.  0      Crime
                                                                                  1      Drama
tt0060196  The Good, the Bad and the Ugly (1966)  1966  9.0    195238  161 mins.  0    Western
dtype: object
(662,)

3.解析

聚合前获取合适的数据结构:

# Multiple index to columns
movies = movies.reset_index()

# Name the new column for genre
movies = movies.rename(columns={0:'genre'})

# Only wanted fields to be aggregated
movies = movies.loc[:,['genre','score','votes']]

     genre  score   votes
0    Crime    9.2  619479
1    Drama    9.2  619479
2    Crime    9.2  474189
3    Drama    9.2  474189
4  Western    9.0  195238
(662, 3)

4.骨料

根据您的要求,分数必须按平均数进行汇总,投票必须按总和进行:

movies = movies.groupby('genres').agg({'score':['mean'], 'votes':['sum']})

              score    votes
               mean      sum
genre                      
Action     8.425714  7912508
Adventure  8.430000  7460632
Animation  8.293333  1769806
Biography  8.393750  2112875
Comedy     8.341509  3166269
(21, 2)

shadowtalker 6 年前

安装程序

import io

import numpy as np
import pandas as pd

colnames = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']

data_url = 'https://bpaste.net/raw/05fa224794e4'

movies = pd.read_csv(data_url, sep="\t", header=None, names=colnames, encoding='UTF-8', index_col='imdbID')

def arg_nlargest(x, n, use_index=True):
    if isinstance(x, pd.Series):
        x = x.values
    return np.argpartition(-x, n)[:n]

250部顶级电影中的3个顶级类型

top250_iloc = arg_nlargest(movies['score'], 250)
movies250 = movies.iloc[top250_iloc]

接下来,我们将每个电影的类型扩展为指标,就像您所做的那样

movies250_genre_inds = movies250["genres"].str.get_dummies(sep='|')

genre_agg = {}
for genre in movies250_genre_inds.columns:
    mask = movies250_genre_inds[genre].astype(bool)
    aggregates = movies250.loc[mask].agg({'score': 'mean', 'votes': 'sum'}) 
    genre_agg[genre] = aggregates.tolist()

genre_agg = pd.DataFrame.from_dict(genre_agg, orient='index', columns=['score_mean', 'votes_sum'])

genre3_iloc = arg_nlargest(genre_agg['score_mean'], 3)
genre3 = genre_agg.iloc[genre3_iloc].sort_values('score_mean', ascending=False)