代码之家 › 专栏 › 技术社区 › Shekhar

如何从Spark数据框中删除多列?

spark-dataframe apache-spark csv

Shekhar · 技术社区 · 7 年前

我有一个CSV,其中一些列标题及其对应的值为null。我想知道如何删除具有名称的列 null ? CSV示例如下:

"name"|"age"|"city"|"null"|"null"|"null"
"abcd"|"21" |"7yhj"|"null"|"null"|"null"
"qazx"|"31" |"iuhy"|"null"|"null"|"null"
"foob"|"51" |"barx"|"null"|"null"|"null"

我想删除所有包含标题的列 无效的 这样输出数据帧将如下所示:

"name"|"age"|"city"
"abcd"|"21" |"7yhj"
"qazx"|"31" |"iuhy"
"foob"|"51" |"barx"

当我在spark中加载这个CSV时,spark会将数字附加到空列,如下所示:

"name"|"age"|"city"|"null4"|"null5"|"null6"
"abcd"|"21" |"7yhj"|"null"|"null"|"null"
"qazx"|"31" |"iuhy"|"null"|"null"|"null"
"foob"|"51" |"barx"|"null"|"null"|"null"

谢谢@MaxU的回答。我的最终解决方案是:

val filePath = "C:\\Users\\shekhar\\spark-trials\\null_column_header_test.csv"

val df = spark.read.format("csv")
.option("inferSchema", "false")
.option("header", "true")
.option("delimiter", "|")
.load(filePath)

val q = df.columns.filterNot(c => c.startsWith("null")).map(a => df(a))
// df.columns.filterNot(c => c.startsWith("null")) this part removes column names which start with null and returns array of string. each element of array represents column name

// .map(a => df(a)) converts elements of array into object of type Column
df.select(q:_*).show

1 回复 | 直到 7 年前

MaxU - stand with Ukraine 7 年前

IIUC你可以这样做:

df = df.drop(df.columns.filter(_.startsWith("null")))

推荐文章

Geoffrey · Pyspark:将数据帧保存到多个具有单个文件特定大小的镶木地板文件中

1 年前

Bruno Peixoto · Spark群集CI管道构建失败

1 年前

codebot · 将df从pandas转换为PySpark时会删除列名

1 年前

mcsilvio · 在foreach中组织联接的最佳方式是什么?

1 年前

Dhruv · 在sbt控制台上运行Spark

2 年前

Leonard · Pyspark:JSON到Pyspark数据帧

2 年前

billie class · 将列中的值重写为列表中的下一个值

2 年前

Calcutta · Google Colab中的Spark SQL在大数据上失败

2 年前

Doraemon · PySpark:使用不同值的字符串类型列创建聚合列

3 年前

OdiumPura · 使用JDBC(Sql server)查询tempview

3 年前