我有一个有两列的数据框,
+---+-------+
| id| fruit|
+---+-------+
| 0| apple|
| 1| banana|
| 2|coconut|
| 1| banana|
| 2|coconut|
+---+-------+
而且我有一个包含所有物品的通用清单,
fruitList: Seq[String] = WrappedArray(apple, coconut, banana)
现在,我想在数据框中创建一个新列,该数组包含1个、0个数组,其中1个表示存在的项,如果该项不存在该行,则表示0个。
期望输出
+---+-----------+
| id| fruitlist|
+---+-----------+
| 0| [1,0,0] |
| 1| [0,1,0] |
| 2|[0,0,1] |
| 1| [0,1,0] |
| 2|[0,0,1] |
+---+-----------+
这是我试过的,
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val df = spark.createDataFrame(Seq(
(0, "apple"),
(1, "banana"),
(2, "coconut"),
(1, "banana"),
(2, "coconut")
)).toDF("id", "fruit")
df.show
import org.apache.spark.sql.functions._
val fruitList = df.select(collect_set("fruit")).first().getAs[Seq[String]](0)
print(fruitList)
我试图用一个hotecoder来解决这个问题,但是在转换成稠密向量之后,结果是这样的,这不是我所需要的。
+---+-------+----------+-------------+---------+
| id| fruit|fruitIndex| fruitVec| vd|
+---+-------+----------+-------------+---------+
| 0| apple| 2.0| (2,[],[])|[0.0,0.0]|
| 1| banana| 1.0|(2,[1],[1.0])|[0.0,1.0]|
| 2|coconut| 0.0|(2,[0],[1.0])|[1.0,0.0]|
| 1| banana| 1.0|(2,[1],[1.0])|[0.0,1.0]|
| 2|coconut| 0.0|(2,[0],[1.0])|[1.0,0.0]|
+---+-------+----------+-------------+---------+