您可以将数据读取为文本文件并替换
---
具有
0
并删除特殊字符或过滤掉。(我已在下面的示例中替换)
创建一个case类来表示数据
case class Data(
name: String, year: String, month: Int, tmax: Double,
tmin: Double, afdays: Int, rainmm: Double, sunhours: Double
)
读取文件
val data = spark.read.textFile("file path") //read as a text file
.map(_.replace("---", "0").replaceAll("-|#|\\*", "")) //replace special charactes
.map(_.split("\\s+"))
.map(x => // create Data object for each record
Data(x(0), x(1), x(2).toInt, x(3).toDouble, x(4).toDouble, x(5).toInt, x(6).toDouble, x(7).replace("l", "").toDouble)
)
现在你可以
Dataset[Data]
这是从文本解析的数据集。
输出:
+---------+----+-----+----+----+------+------+--------+
|name |year|month|tmax|tmin|afdays|rainmm|sunhours|
+---------+----+-----+----+----+------+------+--------+
|aberporth|1941|10 |0.0 |0.0 |0 |106.2 |0.0 |
|aberporth|1941|11 |0.0 |0.0 |0 |92.3 |0.0 |
|aberporth|1941|12 |0.0 |0.0 |0 |86.5 |0.0 |
|aberporth|1942|1 |5.8 |2.1 |0 |114.0 |58.0 |
|aberporth|1942|2 |4.2 |0.6 |0 |13.8 |80.3 |
|aberporth|1942|3 |9.7 |3.7 |0 |58.0 |117.9 |
|aberporth|1942|4 |13.1|5.3 |0 |42.5 |200.1 |
|aberporth|1942|5 |14.0|6.9 |0 |101.1 |215.1 |
|aberporth|1942|6 |16.2|9.9 |0 |2.3 |269.3 |
|aberporth|1942|7 |17.4|11.3|12 |70.2 |185.0 |
|aberporth|1942|8 |18.7|12.3|5 |78.5 |141.9 |
|aberporth|1942|9 |16.4|10.7|123 |146.8 |129.1 |
|aberporth|1942|10 |13.1|8.2 |125 |131.1 |82.1 |
+---------+----+-----+----+----+------+------+--------+
我希望这有帮助!