我想比较Scala中的两个文本并计算相似度。我有这个,但我没有成功地计算出for循环中的平均值。我是Scala新手,不知道如何在循环中完成
import org.apache.spark._
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]):Unit = {
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val wordCounts1 = sc.textFile("/chatblanc.txt"). //The white cat is eating a white soup
flatMap(_.split("\\W+")).
map((_, 1)).
reduceByKey(_ + _)
wordCounts1.collect.foreach(println)
//Res : (is,1)
(eating,1)
(cat,1)
(white,2)
(The,1)
(soup,1)
(a,1)
print("======= End first file ========\n")
val wordCounts2 = sc.textFile("/chatnoir.txt").
//The black cat is eating a white sandwich
flatMap(_.split("\\W+")).
map((_, 1)).
reduceByKey(_ + _)
wordCounts2.collect.foreach(println)
// Res2 : (is,1)
(eating,1)
(cat,1)
(white,1)
(The,1)
(a,1)
(sandwich,1)
(black,1)
print("======= End second file ========\n")
print("======= Display similarity rate ========\n")
val result = for( (t1,t2) <- wordCounts1.cartesian(wordCounts2) if( t1._1==t2._1)) yield = (Math.min(t1._2,t2._2).toDouble/Math.max(t1._2,t2._2).toDouble)
result.collect.foreach(println)
//Res :
1.0
1.0
1.0
0.5
1.0
1.0
}
}
最后,我们想要的是将这6个值的平均值存储在一个变量中。
你能帮帮我吗?