代码之家  ›  专栏  ›  技术社区  ›  CJ Sullivan

为什么朴素贝叶斯不能像逻辑回归那样在Spark MLlib管道中工作?

  •  2
  • CJ Sullivan  · 技术社区  · 8 年前

    我正在使用Spark和Scala对推特进行情绪分析。我有一个使用逻辑回归模型的工作版本,如下所示:

    import org.apache.spark.mllib.regression.LabeledPoint
    import org.apache.spark.mllib.feature.HashingTF
    import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
    import org.apache.spark.sql.SQLContext
    import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
    import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
    import org.apache.spark.mllib.util.MLUtils
    import org.apache.spark.ml.feature.{CountVectorizer, RegexTokenizer, StopWordsRemover}
    import org.apache.spark.sql.functions._
    import org.apache.spark.ml.classification.LogisticRegression
    import org.apache.spark.ml.Pipeline
    import org.apache.spark.ml.feature.Word2Vec
    import org.apache.spark.mllib.evaluation.RegressionMetrics
    import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
    
    val sqlContext = new SQLContext(sc)
    
    // Sentiment140 training corpus
    val trainFile = "s3://someBucket/training.1600000.processed.noemoticon.csv"
    val swFile = "s3://someBucket/stopwords.txt"
    val tr = sc.textFile(trainFile)
    val stopwords: Array[String] = sc.textFile(swFile).flatMap(_.stripMargin.split("\\s+")).collect ++ Array("rt")
    
    val parsed = tr.filter(_.contains("\",\"")).map(_.split("\",\"").map(_.replace("\"", ""))).filter(row => row.forall(_.nonEmpty)).map(row => (row(0).toDouble, row(5))).filter(row => row._1 != 2).map(row => (row._1 / 4, row._2)) 
    val pDF = parsed.toDF("label","tweet") 
    val tokenizer = new RegexTokenizer().setGaps(false).setPattern("\\p{L}+").setInputCol("tweet").setOutputCol("words")
    val filterer = new StopWordsRemover().setStopWords(stopwords).setCaseSensitive(false).setInputCol("words").setOutputCol("filtered")
    val countVectorizer = new CountVectorizer().setInputCol("filtered").setOutputCol("features")
    
    val lr = new LogisticRegression().setMaxIter(50).setRegParam(0.2).setElasticNetParam(0.0) 
    val pipeline = new Pipeline().setStages(Array(tokenizer, filterer, countVectorizer, lr))
    
    val lrModel = pipeline.fit(pDF)
    
    // Now model is made.  Lets get some test data...
    
    val testFile = "s3://someBucket/testdata.manual.2009.06.14.csv"
    val te = sc.textFile(testFile)
    val teparsed = te.filter(_.contains("\",\"")).map(_.split("\",\"").map(_.replace("\"", ""))).filter(row => row.forall(_.nonEmpty)).map(row => (row(0).toDouble, row(5))).filter(row => row._1 != 2).map(row => (row._1 / 4, row._2)) 
    val teDF = teparsed.toDF("label","tweet")
    
    val res = lrModel.transform(teDF)
    val restup = res.select("label","prediction").rdd.map(r => (r(1).asInstanceOf[Double], r(0).asInstanceOf[Double]))
    val metrics = new BinaryClassificationMetrics(restup)
    
    metrics.areaUnderROC()
    

    使用逻辑回归,这将返回完全正常的AUC。然而,当我从逻辑回归切换到val nb=new NaiveBayes()时,我得到以下错误:

    found   : org.apache.spark.mllib.classification.NaiveBayes
    required: org.apache.spark.ml.PipelineStage
       val pipeline = new Pipeline().setStages(Array(tokenizer, filterer, countVectorizer, nb))
    

    参考MLlib上的API文档 PipelineStage 列出逻辑回归和朴素贝叶斯都列为子类。那么,为什么LR有效而NB无效?

    1 回复  |  直到 8 年前
        1
  •  3
  •   user7333721    8 年前

    它不起作用,因为您使用了不正确的类。管道使用:

    org.apache.spark.ml.NaiveBayes
    

    咨询 the documentation 以获得正确的语法。