代码之家 › 专栏 › 技术社区 › abhjt

读取ASCII字符pyspark上的csv和连接行

pyspark-sql pyspark apache-spark

abhjt · 技术社区 · 7 年前

我有以下格式的csv文件-

id1,"When I think about the short time that we live and relate it to Ã¡
the periods of my life when I think that I did not use this Ã¡
short time."
id2,"[ On days when I feel close to my partner and other friends.  Ã¡
When I feel at peace with myself and also experience a close Ã¡
contact with people whom I regard greatly.]"

我想读pyspark。我的代码是-

schema = StructType([
    StructField("Id", StringType()),
    StructField("Sentence", StringType()),
  ])

df = sqlContext.read.format("com.databricks.spark.csv") \
        .option("header", "false") \
        .option("inferSchema", "false") \
        .option("delimiter", "\"") \
        .schema(schema) \
        .load("mycsv.csv")

但我得到的结果是-

+--------------------------------------------------------------+-------------------------------------------------------------------+
| Id                                                           | Sentence                                                           |
+--------------------------------------------------------------+-------------------------------------------------------------------+
|id1,                                                          |When I think about the short time that we live and relate it to Ã¡  |
|the periods of my life when I think that I did not use this Ã¡ |null                                                               |
|short time.                                                   |"                                                                  |

...

我想在第二列中阅读,第一列包含 Id 和其他 Sentence . 句子应该以ASCII字符连接 Ã¡ 正如我看到的,它正在读取下一行,但没有得到分隔符。

我的输出应该如下所示-

    +--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
    | Id                                                           | Sentence                                                                 |
    +--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
    |id1,                                                          |When I think about the short time that we live and relate it to the periods of my life when I think that I did not use this short time. |

在示例中,我只考虑了一个id。我的代码需要做哪些修改?

1 回复 | 直到 7 年前

Alper t. Turker 7 年前

如果您还没有这样做,只需将Spark更新到2.2或更高版本,并使用 multiline 选项:

df = spark.read
    .option("header", "false") \
    .option("inferSchema", "false") \
    .option("delimiter", "\"") \
    .schema(schema) \
    .csv("mycsv.csv", multiLine=True)

如果这样做,您可以删除 Ã¡ 具有 regexp_replace :

df.withColumn("Sentence", regexp_replace("Sentence", "Ã¡", "")