代码之家  ›  专栏  ›  技术社区  ›  abhjt

读取ASCII字符pyspark上的csv和连接行

  •  1
  • abhjt  · 技术社区  · 7 年前

    我有以下格式的csv文件-

    id1,"When I think about the short time that we live and relate it to á
    the periods of my life when I think that I did not use this á
    short time."
    id2,"[ On days when I feel close to my partner and other friends.  á
    When I feel at peace with myself and also experience a close á
    contact with people whom I regard greatly.]"
    

    我想读pyspark。我的代码是-

    schema = StructType([
        StructField("Id", StringType()),
        StructField("Sentence", StringType()),
      ])
    
    df = sqlContext.read.format("com.databricks.spark.csv") \
            .option("header", "false") \
            .option("inferSchema", "false") \
            .option("delimiter", "\"") \
            .schema(schema) \
            .load("mycsv.csv")
    

    但我得到的结果是-

    +--------------------------------------------------------------+-------------------------------------------------------------------+
    | Id                                                           | Sentence                                                           |
    +--------------------------------------------------------------+-------------------------------------------------------------------+
    |id1,                                                          |When I think about the short time that we live and relate it to á  |
    |the periods of my life when I think that I did not use this á |null                                                               |
    |short time.                                                   |"                                                                  |
    

    ...

    我想在第二列中阅读,第一列包含 Id 和其他 Sentence . 句子应该以ASCII字符连接 á 正如我看到的,它正在读取下一行,但没有得到分隔符。

    我的输出应该如下所示-

        +--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
        | Id                                                           | Sentence                                                                 |
        +--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
        |id1,                                                          |When I think about the short time that we live and relate it to the periods of my life when I think that I did not use this short time. |
    

    在示例中,我只考虑了一个id。 我的代码需要做哪些修改?

    1 回复  |  直到 7 年前
        1
  •  1
  •   Alper t. Turker    7 年前

    如果您还没有这样做,只需将Spark更新到2.2或更高版本,并使用 multiline 选项:

    df = spark.read
        .option("header", "false") \
        .option("inferSchema", "false") \
        .option("delimiter", "\"") \
        .schema(schema) \
        .csv("mycsv.csv", multiLine=True)
    

    如果这样做,您可以删除 á 具有 regexp_replace :

    df.withColumn("Sentence", regexp_replace("Sentence", "á", "")