我们需要通过
df
到
KMeans
,我们要计算到质心的距离,只需要
东风
. 所以我们可以为这个量定义一个变量:
points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
然后,我们可以使用以下公式计算从每行坐标部分到相应质心的距离:
import scipy.spatial.distance as sdist
centroids = kmeans.cluster_centers_
dist = sdist.norm(points - centroids[df['cluster']])
注意到
centroids[df['cluster']]
返回形状与相同的numpy数组
points
. 索引依据
df['cluster']
“扩展”了
centroids
数组。
然后我们可以分配这些
dist
数据帧列的值使用
df['dist'] = dist
例如,
import numpy as np
import pandas as pd
import sklearn.cluster as cluster
import scipy.spatial.distance as sdist
df = pd.DataFrame({'Type1': [0.0, 0.0, 473.6, 0.0, 0.0],
'Type2': [0.0, 63.72, 174.0, 996.0, 524.91],
'Type3': [0.0, 0.0, 31.6, 160.92, 0.0],
'id': [1000, 10001, 10002, 10003, 10004]})
points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(points)
df['cluster'] = kmeans.labels_
centroids = kmeans.cluster_centers_
dist = sdist.norm(points - centroids[df['cluster']])
df['dist'] = dist
print(df)
产量
Type1 Type2 Type3 id cluster dist
0 0.0 0.00 0.00 1000 4 2.842171e-14
1 0.0 63.72 0.00 10001 2 2.842171e-14
2 473.6 174.00 31.60 10002 1 2.842171e-14
3 0.0 996.00 160.92 10003 3 2.842171e-14
4 0.0 524.91 0.00 10004 0 2.842171e-14
如果需要从每个点到每个簇形心的距离,可以使用
sdist.cdist
:
import scipy.spatial.distance as sdist
sdist.cdist(points, centroids)
例如,
import numpy as np
import pandas as pd
import sklearn.cluster as cluster
import scipy.spatial.distance as sdist
df = pd.DataFrame({'Type1': [0.0, 0.0, 473.6, 0.0, 0.0],
'Type2': [0.0, 63.72, 174.0, 996.0, 524.91],
'Type3': [0.0, 0.0, 31.6, 160.92, 0.0],
'id': [1000, 10001, 10002, 10003, 10004]})
points = df.drop('id', axis=1)
# or points = df[['Type1', 'Type2', 'Type3']]
kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(points)
df['cluster'] = kmeans.labels_
centroids = kmeans.cluster_centers_
dists = pd.DataFrame(
sdist.cdist(points, centroids),
columns=['dist_{}'.format(i) for i in range(len(centroids))],
index=df.index)
df = pd.concat([df, dists], axis=1)
print(df)
产量
Type1 Type2 Type3 id cluster dist_0 dist_1 dist_2 dist_3 dist_4
0 0.0 0.00 0.00 1000 4 524.910000 505.540819 6.372000e+01 1008.915877 0.000000
1 0.0 63.72 0.00 10001 2 461.190000 487.295802 2.842171e-14 946.066195 63.720000
2 473.6 174.00 31.60 10002 1 590.282431 0.000000 4.872958e+02 957.446929 505.540819
3 0.0 996.00 160.92 10003 3 497.816266 957.446929 9.460662e+02 0.000000 1008.915877
4 0.0 524.91 0.00 10004 0 0.000000 590.282431 4.611900e+02 497.816266 524.910000