我应用了pyspark tf idf函数,得到了以下结果。
| features |
|----------|
| (35,[7,9,11,12,19,26,33],[1.2039728043259361,1.2039728043259361,1.2039728043259361,1.6094379124341003,1.6094379124341003,1.6094379124341003,1.6094379124341003]) |
| (35,[0,2,4,5,6,11,22],[0.9162907318741551,0.9162907318741551,1.2039728043259361,1.2039728043259361,1.2039728043259361,1.2039728043259361,1.6094379124341003]) |
因此,一个数据帧有1列(特征),其中包含作为行的稀疏部分。
现在我想从这个数据帧构建IndexRowMatrix,这样我就可以运行这里描述的奇异值分解函数
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=svd#pyspark.mllib.linalg.distributed.IndexedRowMatrix.computeSVD
我尝试了以下方法,但没有成功:
mat = RowMatrix(tfidfData.rdd.map(lambda x: x.features))
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector
那么,如何在pyspark中tf idf数据帧的输出上运行IndexedRowMatrix呢?