我对Python有点陌生。我正在玩一个虚拟数据集,以获得一些Python数据操作练习。以下是生成伪数据的代码:
d = {
'SeniorCitizen': [0,1,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0] ,
'CollegeDegree': [0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1] ,
'Married': [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1] ,
'FulltimeJob': [1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,0,0,1,1,0,0,0,1] ,
'DistancefromBranch': [7,9,14,20,21,12,22,25,9,9,9,12,13,14,16,25,27,4,14,14,20,19,15,23,2] ,
'ReversedPayment': [0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,1,0] }
CarWash = pd.DataFrame(data = d)
categoricals = ['SeniorCitizen','CollegeDegree','Married','FulltimeJob','ReversedPayment']
numerical = ['DistancefromBranch']
CarWash[categoricals] = CarWash[categoricals].astype('category')
我基本上在为几件事而挣扎:
#1.A
具有绝对值的堆叠条形图
(
就像下面的excel示例
)
#2.A
具有百分比值的堆叠条形图
(
就像下面的excel示例
)
以下是我使用的#1和#2的目标可视化
countplot()
.
1.
2.
对于#1,而不是堆叠的条形图
countplot()
我可以制作一个集群的barplot,如下所示,而且注释片段更像是一种变通方法,而不是Python优雅。
# Looping through each categorical column and viewing target variable distribution (ReversedPayment) by value
figure, axes = plt.subplots(2,2,figsize = (10,10))
for i,ax in zip(categoricals[:-1],axes.flatten()):
sns.countplot(x= i, hue = 'ReversedPayment', data = CarWash, ax = ax)
for p in ax.patches:
height = np.nan_to_num(p.get_height()) # gets the height of each patch/bar
adjust = np.nan_to_num(p.get_width())/2 # a calculation for adusting the data label later
label_xy = (np.nan_to_num(p.get_x()) + adjust,np.nan_to_num(p.get_height()) + adjust) #x,y coordinates where we want to put our data label
ax.annotate(height,label_xy) # final annotation
对于#2,我尝试创建一个包含%值的新数据帧,但这感觉乏味且容易出错。
我觉得有一个选择
stacked = True, proportion = True, axis = 1, annotate = True
本可以如此有用
countplot()
拥有。
有没有其他库可以直接使用,代码密集度更低?欢迎任何意见或建议。