我是Pyspark的新手,正在尝试在Pyspark中创建ML模型
我的目标是创建一个TFidf矢量器,并将这些特征传递给我的SVM模型。
我试过这个
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local[2]").setAppName("Stream")
sc = SparkContext(conf=conf)
parallelized = sc.parallelize(Dataset.CleanText)
#dataset is a pandas dataframe with CleanText as one of the column
from pyspark.mllib.feature import HashingTF, IDF
hashingTF = HashingTF()
tf = hashingTF.transform(parallelized)
# While applying HashingTF only needs a single pass to the data, applying IDF needs two passes:
# First to compute the IDF vector and second to scale the term frequencies by IDF.
#tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
print ("vecs: ",tfidf.glom().collect())
#This is printing all the TFidf vectors
import numpy as np
labels = np.array(Dataset['LabelNo'])
现在,我应该如何将这些Tfidf和标签值传递给我的模型?
我跟踪了这个
http://spark.apache.org/docs/2.0.0/api/python/pyspark.mllib.html
并尝试将标记点创建为
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
spark = SparkSession.builder.appName("SparkSessionZipsExample").getOrCreate()
dd = [(labels[i], Vectors.dense(tfidf[i])) for i in range(len(labels))]
df = spark.createDataFrame(sc.parallelize(dd),schema=["label", "features"])
print ("df: ",df.glom().collect())
但这给了我一个错误:
---–ª15 dd=[(标签[i],向量。密集(tfidf[i]),对于范围内的i(len(标签))]
16 df=火花。createDataFrame(sc.parallelize(dd),模式=[“label”,“features”])
17
TypeError:“RDD”对象不支持索引