代码之家  ›  专栏  ›  技术社区  ›  Ahmad Senousi

pyspark:添加新列的行和值超过255列

  •  1
  • Ahmad Senousi  · 技术社区  · 6 年前

    我需要找到大约900列的行值之和我在这个链接中应用了这个函数 Spark - Sum of row values

    from functools import reduce
    
    def superSum(*cols):
       return reduce(lambda a, b: a + b, cols)
    
    add = udf(superSum)
    
    df.withColumn('total', add(*[df[x] for x in df.columns])).show()
    

    但我犯了这个错误

    Py4JJavaError: An error occurred while calling o1005.showString.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "***********\pyspark\worker.py", line 218, in main
      File "***********\pyspark\worker.py", line 147, in read_udfs
      File "<string>", line 1
    SyntaxError: more than 255 arguments
    
    1 回复  |  直到 6 年前
        1
  •  2
  •   hamza tuna    6 年前

    import operator
    from functools import reduce
    import findspark
    findspark.init() # replace with your spark path
    from pyspark import SparkConf, SparkContext
    
    from pyspark.sql import SQLContext
    from pyspark.sql import functions as F
    from pyspark.sql import Row
    
    conf = SparkConf().setMaster("local").setAppName("My App")
    sc = SparkContext(conf=conf)
    sqlContext = SQLContext(sc)
    
    
    df = sqlContext.createDataFrame([
        Row(**{str(i):0 for i in range(300)})
    ])
    
    df \
        .withColumn('total', reduce(operator.add, map(F.col, df.columns))).show()