代码之家  ›  专栏  ›  技术社区  ›  Markus

如何将字典列表转换为Spark数据帧

  •  3
  • Markus  · 技术社区  · 6 年前

    我想把字典列表转换成数据帧。以下是列表:

    mylist = 
    [
      {"type_activity_id":1,"type_activity_name":"xxx"},
      {"type_activity_id":2,"type_activity_name":"yyy"},
      {"type_activity_id":3,"type_activity_name":"zzz"}
    ]
    

    这是我的密码:

    from pyspark.sql.types import StringType
    
    df = spark.createDataFrame(mylist, StringType())
    
    df.show(2,False)
    
    +-----------------------------------------+
    |                                    value|
    +-----------------------------------------+
    |{type_activity_id=1,type_activity_id=xxx}|
    |{type_activity_id=2,type_activity_id=yyy}|
    |{type_activity_id=3,type_activity_id=zzz}|
    +-----------------------------------------+
    

    我假设我应该为每一列提供一些映射和类型,但是我不知道怎么做。

    我也试过这个:

    schema = ArrayType(
        StructType([StructField("type_activity_id", IntegerType()),
                    StructField("type_activity_name", StringType())
                    ]))
    df = spark.createDataFrame(mylist, StringType())
    df = df.withColumn("value", from_json(df.value, schema))
    

    但后来我 null 价值观:

    +-----+
    |value|
    +-----+
    | null|
    | null|
    +-----+
    
    2 回复  |  直到 5 年前
        1
  •  8
  •   Arvind    5 年前

    你可以这样做。您将得到一个包含2列的数据帧。

    mylist = [
      {"type_activity_id":1,"type_activity_name":"xxx"},
      {"type_activity_id":2,"type_activity_name":"yyy"},
      {"type_activity_id":3,"type_activity_name":"zzz"}
    ]
    
    myJson = sc.parallelize(mylist)
    myDf = sqlContext.read.json(myJson)
    

    输出:

    +----------------+------------------+
    |type_activity_id|type_activity_name|
    +----------------+------------------+
    |               1|               xxx|
    |               2|               yyy|
    |               3|               zzz|
    +----------------+------------------+
    
        2
  •  22
  •   pault Tanjin    6 年前

    spark.createDataFrame() ,但现在不赞成这样做:

    mylist = [
      {"type_activity_id":1,"type_activity_name":"xxx"},
      {"type_activity_id":2,"type_activity_name":"yyy"},
      {"type_activity_id":3,"type_activity_name":"zzz"}
    ]
    df = spark.createDataFrame(mylist)
    #UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
    #  warnings.warn("inferring schema from dict is deprecated,"
    

    正如警告信息所说,您应该使用 pyspark.sql.Row

    from pyspark.sql import Row
    spark.createDataFrame(Row(**x) for x in mylist).show(truncate=False)
    #+----------------+------------------+
    #|type_activity_id|type_activity_name|
    #+----------------+------------------+
    #|1               |xxx               |
    #|2               |yyy               |
    #|3               |zzz               |
    #+----------------+------------------+
    

    我用过这里 ** keyword argument unpacking )把字典传给 Row 建造师。

        3
  •  1
  •   anvy elizabeth    5 年前

    在Spark版本2.4中,可以直接使用 测向=spark.createDataFrame(我的列表)

    >>> mylist = [
    ...   {"type_activity_id":1,"type_activity_name":"xxx"},
    ...   {"type_activity_id":2,"type_activity_name":"yyy"},
    ...   {"type_activity_id":3,"type_activity_name":"zzz"}
    ... ]
    >>> df1=spark.createDataFrame(mylist)
    >>> df1.show()
    +----------------+------------------+
    |type_activity_id|type_activity_name|
    +----------------+------------------+
    |               1|               xxx|
    |               2|               yyy|
    |               3|               zzz|
    +----------------+------------------+
    
        4
  •  0
  •   Athar    4 年前

    我在创作时也面临着同样的问题 dataframe 从字典列表中。 我已经解决了这个问题 namedtuple .

    下面是我的代码使用提供的数据。

    from collections import namedtuple
    final_list = []
    mylist = [{"type_activity_id":1,"type_activity_name":"xxx"},
              {"type_activity_id":2,"type_activity_name":"yyy"}, 
              {"type_activity_id":3,"type_activity_name":"zzz"}
             ]
    ExampleTuple = namedtuple('ExampleTuple', ['type_activity_id', 'type_activity_name'])
    
    for my_dict in mylist:
        namedtupleobj = ExampleTuple(**my_dict)
        final_list.append(namedtupleobj)
    
    sqlContext.createDataFrame(final_list).show(truncate=False)
    

    输出

    +----------------+------------------+
    |type_activity_id|type_activity_name|
    +----------------+------------------+
    |1               |xxx               |
    |2               |yyy               |
    |3               |zzz               |
    +----------------+------------------+
    

    spark: 2.4.0
    python: 3.6
    

    没有必要有 my_list 变量。因为它是可用的,所以我用它直接创建namedtuple对象 可以创建对象。