我一直很想知道为什么我在做一份有火花的工作时会有奇怪的行为。如果我执行操作(A),作业将出错
.show(1)
方法)在缓存数据帧之后或将数据帧写回hdfs之前。
这里有一个非常类似的帖子:
Spark SQL SaveMode.Overwrite, getting java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName'
.
基本上,另一篇文章解释说,当你从正在写入的同一个HDFS目录中读取时
SaveMode
是
"overwrite"
,则您将获得
java.io.FileNotFoundException
.
但在这里,我发现,只要在程序中移动操作的位置,就可以得到非常不同的结果——要么完成程序,要么给出这个例外。
我想知道是否有人能解释为什么Spark在这里不一致?
val myDF = spark.read.format("csv")
.option("header", "false")
.option("delimiter", "\t")
.schema(schema)
.load(myPath)
// If I cache it here or persist it then do an action after the cache, it will occasionally
// not throw the error. This is when completely restarting the SparkSession so there is no
// risk of another user interfering on the same JVM.
myDF.cache()
myDF.show(1)
// Just an example.
// Many different transformations are then applied...
val secondDF = mergeOtherDFsWithmyDF(myDF, otherDF, thirdDF)
val fourthDF = mergeTwoDFs(thirdDF, StringToCheck, fifthDF)
// Below is the same .show(1) action call as was previously done, only this below
// action ALWAYS results in a successful completion and the above .show(1) sometimes results
// in FileNotFoundException and sometimes results in successful completion. The only
// thing that changes among test runs is only one is executed. Either
// fourthDF.show(1) or myDF.show(1) is left commented out
fourthDF.show(1)
fourthDF.write
.mode(writeMode)
.option("header", "false")
.option("delimiter", "\t")
.csv(myPath)