有一项任务是把产品价格分成3组{高、平均、低}价格组。尝试使用sklearn包通过k-means实现。数据采用pandas数据帧格式,float64类型
dfcl
Out[173]:
price
product_option_id
10012|0 372.15
10048|0 11.30
10049|0 12.26
10050|0 6.20
10051|0 5.90
10052|0 9.00
10053|0 11.10
10054|0 9.30
10055|0 4.20
10056|0 5.60
# Convert DataFrame to matrix
mat = dfcl.as_matrix()
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=3)
km.fit(mat)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pd.DataFrame(data=labels, columns=['cluster'], index=dfcl.index)
结果出来了,但小组之间似乎很不平衡
print('Total features -', len(results))
print('Cluster 0 -',len(results.loc[results['cluster'] == 0]))
print('Cluster 1 -',len(results.loc[results['cluster'] == 1]))
print('Cluster 2 -',len(results.loc[results['cluster'] == 2]))
Total features - 5222
Cluster 0 - 4470
Cluster 1 - 733
Cluster 2 - 19
顺便说一下,当我重新计算拟合数据时,有时会发生数据在集群之间高度交换的情况。有没有办法解决组之间数据不平衡的问题,并让集群名称保持静态以重新计算算法?我还尝试使用
preprocessing.MinMaxScaler()
也没用。
也许有一些集群算法可以帮助我做我想做的或者其他的黑客?
Total features - 5222
Cluster 0 - 733
Cluster 1 - 4470
Cluster 2 - 19