使用
split
功能而不是
regexp_extract
.
请检查下面的代码和执行时间
scala> df.show(false)
+--------+
|columna |
+--------+
|1000@Cat|
|1001@Dog|
|1000@Cat|
|1001@Dog|
|1001@Dog|
+--------+
scala> spark.time(df.withColumn("parsed",split($"columna","@")(1)).show(false))
+--------+------+
|columna |parsed|
+--------+------+
|1000@Cat|Cat |
|1001@Dog|Dog |
|1000@Cat|Cat |
|1001@Dog|Dog |
|1001@Dog|Dog |
+--------+------+
Time taken: 14 ms
scala> spark.time { df.withColumn("ColumnA",when(regexp_extract($"columna", "\\@(.*)", 1).equalTo(""), $"columna").otherwise(regexp_extract($"columna", "\\@(.*)", 1))).show(false) }
+-------+
|ColumnA|
+-------+
|Cat |
|Dog |
|Cat |
|Dog |
|Dog |
+-------+
Time taken: 22 ms
scala>
contains
用于检查的函数
@
列中的值
scala> spark.time(df.withColumn("parsed",when($"columna".contains("@"), lit(split($"columna","@")(1))).otherwise("")).show(false))
+--------+------+
|columna |parsed|
+--------+------+
|1000@Cat|Cat |
|1001@Dog|Dog |
|1000@Cat|Cat |
|1001@Dog|Dog |
|1001@Dog|Dog |
+--------+------+
Time taken: 14 ms