你可以用这个
单线
movies = \
(movies.set_index(mv.columns.drop('genres',1).tolist())
.genres.str.split('|',expand=True)
.stack()
.reset_index()
.rename(columns={0:'genre'})
.loc[:,['genre','score','votes']]
.groupby('genre').agg({'score':['mean'], 'votes':['sum']})
)
score votes
mean sum
genre
Action 8.425714 7912508
Adventure 8.430000 7460632
Animation 8.293333 1769806
Biography 8.393750 2112875
Comedy 8.341509 3166269
...
解释
主要问题是多重性
True
从
one_hot_encoding
过程胜过体裁。一部电影可以指定一种或多种类型。因此,不能按类型正确使用聚合方法。另一方面,使用
genres
field as is将溶解您在问题中显示的多种性别结果:
genres
Action 5.837500
Action|Adventure 6.152381
Action|Adventure|Animation|Comedy|Family|Fantasy 7.500000
Action|Adventure|Animation|Family|Fantasy|Sci-Fi 6.100000
Action|Adventure|Biography|Crime|History|Western 6.300000
Action|Adventure|Biography|Drama|History 7.700000
split
expand
,您可以创建多个数据帧,然后对它们进行堆栈。例如,具有2种类型的电影将出现在2个结果数据帧中,其中每个数据帧表示分配给每个类型的电影。最后,在解析之后,您可以使用多个函数按性别进行聚合。我将逐步解释:
1.获得前250部电影(按分数)
import pandas as pd
import numpy as np
headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']
movies = pd.read_csv("imdb_top_10000.txt", sep="\t", header=None, names=headers, encoding='UTF-8')
体裁
imdbID title year score votes runtime genres
7917 tt0990404 Chop Shop (2007) 2007 7.2 2104 84 mins. NaN
movies.loc[movies.genres.isnull(),"genres"] = "Drama"
现在,正如你已经展示的,我们需要排名前250的电影:
movies = movies.sort_values('score', ascending=False).head(250)
2.使用split with expand从流派创建流派字段
movies = movies.set_index(movies.columns.drop('genres',1).tolist())
genres
imdbID title year score votes runtime
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Crime|Drama
tt0068646 The Godfather (1972) 1972 9.2 474189 175 mins. Crime|Drama
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0 195238 161 mins. Western
tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Crime|Thriller
tt0252487 Outrageous Class (1975) 1975 9.0 9823 87 mins. Comedy|Drama
(250, 1)
2.2. 按流派划分
这将从剥离的N次迭代中创建N个数据帧。
movies = movies.genres.str.split('|',expand=True)
0 \
imdbID title year score votes runtime
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Crime
tt0068646 The Godfather (1972) 1972 9.2 474189 175 mins. Crime
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0 195238 161 mins. Western
tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Crime
tt0252487 Outrageous Class (1975) 1975 9.0 9823 87 mins. Comedy
1 \
imdbID title year score votes runtime
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Drama
tt0068646 The Godfather (1972) 1972 9.2 474189 175 mins. Drama
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0 195238 161 mins. None
tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Thriller
tt0252487 Outrageous Class (1975) 1975 9.0 9823 87 mins. Drama
...
现在,每个电影都有一个唯一的genre值,其中一部电影可以有多行如果分配了多个genre,则可以堆叠数据帧集。请注意,现在我们有超过250行(662行),但是有250个不同的电影。
movies = movies.stack()
imdbID title year score votes runtime
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. 0 Crime
1 Drama
tt0068646 The Godfather (1972) 1972 9.2 474189 175 mins. 0 Crime
1 Drama
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0 195238 161 mins. 0 Western
dtype: object
(662,)
3.解析
聚合前获取合适的数据结构:
# Multiple index to columns
movies = movies.reset_index()
# Name the new column for genre
movies = movies.rename(columns={0:'genre'})
# Only wanted fields to be aggregated
movies = movies.loc[:,['genre','score','votes']]
genre score votes
0 Crime 9.2 619479
1 Drama 9.2 619479
2 Crime 9.2 474189
3 Drama 9.2 474189
4 Western 9.0 195238
(662, 3)
4.骨料
根据您的要求,分数必须按平均数进行汇总,投票必须按总和进行:
movies = movies.groupby('genres').agg({'score':['mean'], 'votes':['sum']})
score votes
mean sum
genre
Action 8.425714 7912508
Adventure 8.430000 7460632
Animation 8.293333 1769806
Biography 8.393750 2112875
Comedy 8.341509 3166269
(21, 2)