代码之家  ›  专栏  ›  技术社区  ›  Vebjorn Ljosa

python中的主成分分析

  •  107
  • Vebjorn Ljosa  · 技术社区  · 15 年前

    我想使用主成分分析(PCA)进行降维。是麻木还是坐骨神经痛已经有了,还是我必须自己用 numpy.linalg.eigh 是吗?

    我不想仅仅使用奇异值分解(SVD),因为我的输入数据是相当高的维(~460维),所以我认为SVD将比计算协方差矩阵的特征向量慢。

    我希望找到一个预先准备好的、调试过的实现,它已经为何时使用哪个方法做出了正确的决定,并且可能会进行我不知道的其他优化。

    10 回复  |  直到 7 年前
        1
  •  28
  •   nikow    14 年前

    你可以看看 MDP .

    我没有机会亲自测试它,但我已经为PCA功能做了正确的书签。

        2
  •  65
  •   ali_m    7 年前

    几个月后,这里有一个小类PCA和一张图片:

    #!/usr/bin/env python
    """ a small class for Principal Component Analysis
    Usage:
        p = PCA( A, fraction=0.90 )
    In:
        A: an array of e.g. 1000 observations x 20 variables, 1000 rows x 20 columns
        fraction: use principal components that account for e.g.
            90 % of the total variance
    
    Out:
        p.U, p.d, p.Vt: from numpy.linalg.svd, A = U . d . Vt
        p.dinv: 1/d or 0, see NR
        p.eigen: the eigenvalues of A*A, in decreasing order (p.d**2).
            eigen[j] / eigen.sum() is variable j's fraction of the total variance;
            look at the first few eigen[] to see how many PCs get to 90 %, 95 % ...
        p.npc: number of principal components,
            e.g. 2 if the top 2 eigenvalues are >= `fraction` of the total.
            It's ok to change this; methods use the current value.
    
    Methods:
        The methods of class PCA transform vectors or arrays of e.g.
        20 variables, 2 principal components and 1000 observations,
        using partial matrices U' d' Vt', parts of the full U d Vt:
        A ~ U' . d' . Vt' where e.g.
            U' is 1000 x 2
            d' is diag([ d0, d1 ]), the 2 largest singular values
            Vt' is 2 x 20.  Dropping the primes,
    
        d . Vt      2 principal vars = p.vars_pc( 20 vars )
        U           1000 obs = p.pc_obs( 2 principal vars )
        U . d . Vt  1000 obs, p.obs( 20 vars ) = pc_obs( vars_pc( vars ))
            fast approximate A . vars, using the `npc` principal components
    
        Ut              2 pcs = p.obs_pc( 1000 obs )
        V . dinv        20 vars = p.pc_vars( 2 principal vars )
        V . dinv . Ut   20 vars, p.vars( 1000 obs ) = pc_vars( obs_pc( obs )),
            fast approximate Ainverse . obs: vars that give ~ those obs.
    
    
    Notes:
        PCA does not center or scale A; you usually want to first
            A -= A.mean(A, axis=0)
            A /= A.std(A, axis=0)
        with the little class Center or the like, below.
    
    See also:
        http://en.wikipedia.org/wiki/Principal_component_analysis
        http://en.wikipedia.org/wiki/Singular_value_decomposition
        Press et al., Numerical Recipes (2 or 3 ed), SVD
        PCA micro-tutorial
        iris-pca .py .png
    
    """
    
    from __future__ import division
    import numpy as np
    dot = np.dot
        # import bz.numpyutil as nu
        # dot = nu.pdot
    
    __version__ = "2010-04-14 apr"
    __author_email__ = "denis-bz-py at t-online dot de"
    
    #...............................................................................
    class PCA:
        def __init__( self, A, fraction=0.90 ):
            assert 0 <= fraction <= 1
                # A = U . diag(d) . Vt, O( m n^2 ), lapack_lite --
            self.U, self.d, self.Vt = np.linalg.svd( A, full_matrices=False )
            assert np.all( self.d[:-1] >= self.d[1:] )  # sorted
            self.eigen = self.d**2
            self.sumvariance = np.cumsum(self.eigen)
            self.sumvariance /= self.sumvariance[-1]
            self.npc = np.searchsorted( self.sumvariance, fraction ) + 1
            self.dinv = np.array([ 1/d if d > self.d[0] * 1e-6  else 0
                                    for d in self.d ])
    
        def pc( self ):
            """ e.g. 1000 x 2 U[:, :npc] * d[:npc], to plot etc. """
            n = self.npc
            return self.U[:, :n] * self.d[:n]
    
        # These 1-line methods may not be worth the bother;
        # then use U d Vt directly --
    
        def vars_pc( self, x ):
            n = self.npc
            return self.d[:n] * dot( self.Vt[:n], x.T ).T  # 20 vars -> 2 principal
    
        def pc_vars( self, p ):
            n = self.npc
            return dot( self.Vt[:n].T, (self.dinv[:n] * p).T ) .T  # 2 PC -> 20 vars
    
        def pc_obs( self, p ):
            n = self.npc
            return dot( self.U[:, :n], p.T )  # 2 principal -> 1000 obs
    
        def obs_pc( self, obs ):
            n = self.npc
            return dot( self.U[:, :n].T, obs ) .T  # 1000 obs -> 2 principal
    
        def obs( self, x ):
            return self.pc_obs( self.vars_pc(x) )  # 20 vars -> 2 principal -> 1000 obs
    
        def vars( self, obs ):
            return self.pc_vars( self.obs_pc(obs) )  # 1000 obs -> 2 principal -> 20 vars
    
    
    class Center:
        """ A -= A.mean() /= A.std(), inplace -- use A.copy() if need be
            uncenter(x) == original A . x
        """
            # mttiw
        def __init__( self, A, axis=0, scale=True, verbose=1 ):
            self.mean = A.mean(axis=axis)
            if verbose:
                print "Center -= A.mean:", self.mean
            A -= self.mean
            if scale:
                std = A.std(axis=axis)
                self.std = np.where( std, std, 1. )
                if verbose:
                    print "Center /= A.std:", self.std
                A /= self.std
            else:
                self.std = np.ones( A.shape[-1] )
            self.A = A
    
        def uncenter( self, x ):
            return np.dot( self.A, x * self.std ) + np.dot( x, self.mean )
    
    
    #...............................................................................
    if __name__ == "__main__":
        import sys
    
        csv = "iris4.csv"  # wikipedia Iris_flower_data_set
            # 5.1,3.5,1.4,0.2  # ,Iris-setosa ...
        N = 1000
        K = 20
        fraction = .90
        seed = 1
        exec "\n".join( sys.argv[1:] )  # N= ...
        np.random.seed(seed)
        np.set_printoptions( 1, threshold=100, suppress=True )  # .1f
        try:
            A = np.genfromtxt( csv, delimiter="," )
            N, K = A.shape
        except IOError:
            A = np.random.normal( size=(N, K) )  # gen correlated ?
    
        print "csv: %s  N: %d  K: %d  fraction: %.2g" % (csv, N, K, fraction)
        Center(A)
        print "A:", A
    
        print "PCA ..." ,
        p = PCA( A, fraction=fraction )
        print "npc:", p.npc
        print "% variance:", p.sumvariance * 100
    
        print "Vt[0], weights that give PC 0:", p.Vt[0]
        print "A . Vt[0]:", dot( A, p.Vt[0] )
        print "pc:", p.pc()
    
        print "\nobs <-> pc <-> x: with fraction=1, diffs should be ~ 0"
        x = np.ones(K)
        # x = np.ones(( 3, K ))
        print "x:", x
        pc = p.vars_pc(x)  # d' Vt' x
        print "vars_pc(x):", pc
        print "back to ~ x:", p.pc_vars(pc)
    
        Ax = dot( A, x.T )
        pcx = p.obs(x)  # U' d' Vt' x
        print "Ax:", Ax
        print "A'x:", pcx
        print "max |Ax - A'x|: %.2g" % np.linalg.norm( Ax - pcx, np.inf )
    
        b = Ax  # ~ back to original x, Ainv A x
        back = p.vars(b)
        print "~ back again:", back
        print "max |back - x|: %.2g" % np.linalg.norm( back - x, np.inf )
    
    # end pca.py
    

    enter image description here

        3
  •  43
  •   j13r    9 年前

    主成分分析 numpy.linalg.svd 超级容易。下面是一个简单的演示:

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.misc import lena
    
    # the underlying signal is a sinusoidally modulated image
    img = lena()
    t = np.arange(100)
    time = np.sin(0.1*t)
    real = time[:,np.newaxis,np.newaxis] * img[np.newaxis,...]
    
    # we add some noise
    noisy = real + np.random.randn(*real.shape)*255
    
    # (observations, features) matrix
    M = noisy.reshape(noisy.shape[0],-1)
    
    # singular value decomposition factorises your data matrix such that:
    # 
    #   M = U*S*V.T     (where '*' is matrix multiplication)
    # 
    # * U and V are the singular matrices, containing orthogonal vectors of
    #   unit length in their rows and columns respectively.
    #
    # * S is a diagonal matrix containing the singular values of M - these 
    #   values squared divided by the number of observations will give the 
    #   variance explained by each PC.
    #
    # * if M is considered to be an (observations, features) matrix, the PCs
    #   themselves would correspond to the rows of S^(1/2)*V.T. if M is 
    #   (features, observations) then the PCs would be the columns of
    #   U*S^(1/2).
    #
    # * since U and V both contain orthonormal vectors, U*V.T is equivalent 
    #   to a whitened version of M.
    
    U, s, Vt = np.linalg.svd(M, full_matrices=False)
    V = Vt.T
    
    # PCs are already sorted by descending order 
    # of the singular values (i.e. by the
    # proportion of total variance they explain)
    
    # if we use all of the PCs we can reconstruct the noisy signal perfectly
    S = np.diag(s)
    Mhat = np.dot(U, np.dot(S, V.T))
    print "Using all PCs, MSE = %.6G" %(np.mean((M - Mhat)**2))
    
    # if we use only the first 20 PCs the reconstruction is less accurate
    Mhat2 = np.dot(U[:, :20], np.dot(S[:20, :20], V[:,:20].T))
    print "Using first 20 PCs, MSE = %.6G" %(np.mean((M - Mhat2)**2))
    
    fig, [ax1, ax2, ax3] = plt.subplots(1, 3)
    ax1.imshow(img)
    ax1.set_title('true image')
    ax2.imshow(noisy.mean(0))
    ax2.set_title('mean of noisy images')
    ax3.imshow((s[0]**(1./2) * V[:,0]).reshape(img.shape))
    ax3.set_title('first spatial PC')
    plt.show()
    
        4
  •  33
  •   Noam Peled    11 年前

    您可以使用sklearn:

    import sklearn.decomposition as deco
    import numpy as np
    
    x = (x - np.mean(x, 0)) / np.std(x, 0) # You need to normalize your data first
    pca = deco.PCA(n_components) # n_components is the components number after reduction
    x_r = pca.fit(x).transform(x)
    print ('explained variance (first %d components): %.2f'%(n_components, sum(pca.explained_variance_ratio_)))
    
        5
  •  31
  •   gypaetus    10 年前
        6
  •  14
  •   dwf    15 年前

    SVD的尺寸应为460。我的Atom上网本大约需要7秒钟。eig()方法采用 更多 时间(正如它应该使用的那样,它使用了更多的浮点运算),而且几乎总是不够精确。

    如果少于460个示例,那么您要做的是对散点矩阵(x-数据平均值)^t(x-平均值),假设您的数据点是列,然后左乘(x-数据平均值)。那 可以 在维度多于数据的情况下要更快。

        7
  •  10
  •   Has QUIT--Anony-Mousse    10 年前

    你可以很容易地“滚动”你自己的使用 scipy.linalg (假设一个预先居中的数据集 data ):

    covmat = data.dot(data.T)
    evs, evmat = scipy.linalg.eig(covmat)
    

    然后 evs 是你的特征值,和 evmat 是你的投影矩阵。

    如果你想保持 d 尺寸,使用第一个 D 特征值和第一 D 特征向量。

    鉴于此 林亚格 分解和nummy矩阵乘法,你还需要什么?

        8
  •  7
  •   sunqiang    15 年前

    我刚看完这本书 Machine Learning: An Algorithmic Perspective .本书中的所有代码示例都是由Python编写的(几乎是用numpy编写的)。的代码段 chatper10.2 Principal Components Analysis 也许值得一读。它使用numpy.linalg.eig。
    顺便说一下,我认为SVD可以很好地处理460*460个维度。我计算了一个6500*6500的SVD,在一台非常老的PC上使用numpy/scipy.linalg.svd:Pentium III 733MHz。老实说,脚本需要大量的内存(大约1.xg)和大量的时间(大约30分钟)来获得SVD结果。 但我认为在现代个人电脑上460*460不会是一个大问题,除非你需要做大量的SVD。

        9
  •  5
  •   Nicolas Barbey    12 年前

    你不需要完全奇异值分解(SVD)在它计算所有特征值和特征向量,可以禁止大型矩阵。 scipy 其稀疏模块提供了在稀疏矩阵和稠密矩阵上工作的通用线性Algrebra函数,其中有eig*函数系列:

    http://docs.scipy.org/doc/scipy/reference/sparse.linalg.html#matrix-factorizations

    Scikit-learn 提供了一个 Python PCA implementation 目前只支持密集矩阵。

    计时:

    In [1]: A = np.random.randn(1000, 1000)
    
    In [2]: %timeit scipy.sparse.linalg.eigsh(A)
    1 loops, best of 3: 802 ms per loop
    
    In [3]: %timeit np.linalg.svd(A)
    1 loops, best of 3: 5.91 s per loop
    
        10
  •  4
  •   rcs    15 年前

    Here 是另一个使用numpy、scipy和c扩展的python PCA模块的实现。该模块使用SVD或用C语言实现的NIPALS(非线性迭代偏最小二乘)算法来实现PCA。