我的
PySpark
已创建脚本保存
DataFrame
到目录:
df.write.save(full_path, format=file_format, mode=options['mode'])
如果我在同一次运行中读取此文件,则一切正常:
return sqlContext.read.format(file_format).load(full_path)
但是,当我在另一个脚本运行中尝试从该目录读取文件时,我收到一个错误:
java.io.FileNotFoundException: File does not exist: /hadoop/log_files/some_data.json/part-00000-26c649cb-0c0f-421f-b04a-9d6a81bb6767.json
我知道我可以通过Spark的提示找到一个解决方案:
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
然而,我想知道我失败的原因,什么是解决这样一个问题的正统方法?