代码之家  ›  专栏  ›  技术社区  ›  OverflowingTheGlass

k表示欧几里得到每个质心的距离,避免从df的其余部分分离特征。

  •  0
  • OverflowingTheGlass  · 技术社区  · 6 年前

    我有一个测向仪:

        id      Type1   Type2   Type3   
    0   10000   0.0     0.00    0.00    
    1   10001   0.0     63.72   0.00    
    2   10002   473.6   174.00  31.60   
    3   10003   0.0     996.00  160.92  
    4   10004   0.0     524.91  0.00
    

    我将k-均值应用于这个df,并将结果集群添加到df:

    kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(df.drop('id', axis=1))
    df['cluster'] = kmeans.labels_
    

    现在,我尝试在df中添加每个点(即df中的行)和每个质心之间的欧几里得距离的列:

    def distance_to_centroid(row, centroid):
        row = row[['Type1',
                   'Type2',
                   'Type3']]
        return euclidean(row, centroid)
    
    df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
    

    这将导致此错误:

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-34-56fa3ae3df54> in <module>()
    ----> 1 df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
    
    ~\_installed\anaconda\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
       6002                          args=args,
       6003                          kwds=kwds)
    -> 6004         return op.get_result()
       6005 
       6006     def applymap(self, func):
    
    ~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in get_result(self)
        140             return self.apply_raw()
        141 
    --> 142         return self.apply_standard()
        143 
        144     def apply_empty_result(self):
    
    ~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in apply_standard(self)
        246 
        247         # compute the result using the series generator
    --> 248         self.apply_series_generator()
        249 
        250         # wrap results
    
    ~\_installed\anaconda\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
        275             try:
        276                 for i, v in enumerate(series_gen):
    --> 277                     results[i] = self.f(v)
        278                     keys.append(v.name)
        279             except Exception as e:
    
    <ipython-input-34-56fa3ae3df54> in <lambda>(r)
    ----> 1 df['distance_to_center_0'] = df.apply(lambda r: distance_to_centroid(r, kmeans.cluster_centers_[0]),1)
    
    <ipython-input-33-7b988ca2ad8c> in distance_to_centroid(row, centroid)
          7                 'atype',
          8                 'anothertype']]
    ----> 9     return euclidean(row, centroid)
    
    ~\_installed\anaconda\lib\site-packages\scipy\spatial\distance.py in euclidean(u, v, w)
        596 
        597     """
    --> 598     return minkowski(u, v, p=2, w=w)
        599 
        600 
    
    ~\_installed\anaconda\lib\site-packages\scipy\spatial\distance.py in minkowski(u, v, p, w)
        488     if p < 1:
        489         raise ValueError("p must be at least 1")
    --> 490     u_v = u - v
        491     if w is not None:
        492         w = _validate_weights(w)
    
    ValueError: ('operands could not be broadcast together with shapes (7,) (8,) ', 'occurred at index 0')
    

    出现此错误的原因似乎是 id 不包括在 row 函数中的变量 distance_to_centroid . 为了解决这个问题,我可以把df分成两部分( 身份证件 在df1和df2中的其余列)。但是,这是非常手动的,不允许轻松更改列。有没有一种方法可以在不拆分原始df的情况下将到每个质心的距离转换为原始df?同样,有没有更好的方法可以找到欧几里得距离,而不需要手动将列输入 变量,以及手动创建多少列作为集群?

    预期结果:

        id      Type1   Type2   Type3   cluster    distanct_to_cluster_0
    0   10000   0.0     0.00    0.00    1          2.3
    1   10001   0.0     63.72   0.00    2          3.6 
    2   10002   473.6   174.00  31.60   0          0.5 
    3   10003   0.0     996.00  160.92  3          3.7 
    4   10004   0.0     524.91  0.00    4          1.8  
    
    1 回复  |  直到 6 年前
        1
  •  4
  •   unutbu    6 年前

    我们需要通过 df KMeans ,我们要计算到质心的距离,只需要 东风 . 所以我们可以为这个量定义一个变量:

    points = df.drop('id', axis=1)
    # or points = df[['Type1', 'Type2', 'Type3']]
    

    然后,我们可以使用以下公式计算从每行坐标部分到相应质心的距离:

    import scipy.spatial.distance as sdist
    centroids = kmeans.cluster_centers_
    dist = sdist.norm(points - centroids[df['cluster']])
    

    注意到 centroids[df['cluster']] 返回形状与相同的numpy数组 points . 索引依据 df['cluster'] “扩展”了 centroids 数组。

    然后我们可以分配这些 dist 数据帧列的值使用

    df['dist'] = dist
    

    例如,

    import numpy as np
    import pandas as pd
    import sklearn.cluster as cluster
    import scipy.spatial.distance as sdist
    
    df = pd.DataFrame({'Type1': [0.0, 0.0, 473.6, 0.0, 0.0],
     'Type2': [0.0, 63.72, 174.0, 996.0, 524.91],
     'Type3': [0.0, 0.0, 31.6, 160.92, 0.0],
     'id': [1000, 10001, 10002, 10003, 10004]})
    
    points = df.drop('id', axis=1)
    # or points = df[['Type1', 'Type2', 'Type3']]
    kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(points)
    df['cluster'] = kmeans.labels_
    
    centroids = kmeans.cluster_centers_
    dist = sdist.norm(points - centroids[df['cluster']])
    df['dist'] = dist
    
    print(df)
    

    产量

       Type1   Type2   Type3     id  cluster          dist
    0    0.0    0.00    0.00   1000        4  2.842171e-14
    1    0.0   63.72    0.00  10001        2  2.842171e-14
    2  473.6  174.00   31.60  10002        1  2.842171e-14
    3    0.0  996.00  160.92  10003        3  2.842171e-14
    4    0.0  524.91    0.00  10004        0  2.842171e-14
    

    如果需要从每个点到每个簇形心的距离,可以使用 sdist.cdist :

    import scipy.spatial.distance as sdist
    sdist.cdist(points, centroids)
    

    例如,

    import numpy as np
    import pandas as pd
    import sklearn.cluster as cluster
    import scipy.spatial.distance as sdist
    
    df = pd.DataFrame({'Type1': [0.0, 0.0, 473.6, 0.0, 0.0],
     'Type2': [0.0, 63.72, 174.0, 996.0, 524.91],
     'Type3': [0.0, 0.0, 31.6, 160.92, 0.0],
     'id': [1000, 10001, 10002, 10003, 10004]})
    
    points = df.drop('id', axis=1)
    # or points = df[['Type1', 'Type2', 'Type3']]
    kmeans = cluster.KMeans(n_clusters=5, random_state=0).fit(points)
    df['cluster'] = kmeans.labels_
    
    centroids = kmeans.cluster_centers_
    dists = pd.DataFrame(
        sdist.cdist(points, centroids), 
        columns=['dist_{}'.format(i) for i in range(len(centroids))],
        index=df.index)
    df = pd.concat([df, dists], axis=1)
    
    print(df)
    

    产量

       Type1   Type2   Type3     id  cluster      dist_0      dist_1        dist_2       dist_3       dist_4
    0    0.0    0.00    0.00   1000        4  524.910000  505.540819  6.372000e+01  1008.915877     0.000000
    1    0.0   63.72    0.00  10001        2  461.190000  487.295802  2.842171e-14   946.066195    63.720000
    2  473.6  174.00   31.60  10002        1  590.282431    0.000000  4.872958e+02   957.446929   505.540819
    3    0.0  996.00  160.92  10003        3  497.816266  957.446929  9.460662e+02     0.000000  1008.915877
    4    0.0  524.91    0.00  10004        0    0.000000  590.282431  4.611900e+02   497.816266   524.910000