代码之家 › 专栏 › 技术社区 › Nazim Kerimbekov Gusev Slava

在CSV文件中解码UTF8文字

pandas python-3.x python

Nazim Kerimbekov Gusev Slava · 技术社区 · 6 年前

问题:

有人知道我如何改变这一切吗 b"it\\xe2\\x80\\x99s time to eat" 进入这个 it's time to eat

更多详细信息和我的代码:

大家好,

我目前正在处理一个csv文件,该文件中满是包含utf8文本的行,例如:

b“\xE2\x80\x99 s吃的时间”

最终目标是得到这样的东西:

是吃的时候了

为此,我尝试使用以下代码:

import pandas as pd


file_open = pd.read_csv("/Users/Downloads/tweets.csv")

file_open["text"]=file_open["text"].str.replace("b\'", "")

file_open["text"]=file_open["text"].str.encode('ascii').astype(str)

file_open["text"]=file_open["text"].str.replace("b\"", "")[:-1]

print(file_open["text"])

在运行代码之后,我以这一行为例打印出来:

吃东西的时间到了

我试过解决这个问题使用以下代码打开csv文件:

file_open = pd.read_csv("/Users/Downloads/tweets.csv", encoding = "utf-8")

以如下方式打印出示例行:

吃东西的时间到了

我也试过用这个解码行:

file_open["text"]=file_open["text"].str.decode('utf-8')

这给了我以下错误:

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

非常感谢你的帮助。

1 回复 | 直到 6 年前

jedwards 6 年前

b"it\\xe2\\x80\\x99s time to eat" 听起来你的文件包含转义编码。

一般来说,您可以将其转换为一个适当的python3字符串,其内容如下:

x = b"it\\xe2\\x80\\x99s time to eat"
x = x.decode('unicode-escape').encode('latin1').decode('utf8')
print(x)     # itâs time to eat

(使用 .encode('latin1') explained here )

所以,如果你使用 pd.read_csv(..., encoding="utf8") 你仍然有逃逸的字符串,你可以做如下事情:

pd.read_csv(..., encoding="unicode-escape")
# ...
# Now, your values will be strings but improperly decoded:
#    itÃ¢s time to eat
#
# So we encode to bytes then decode properly:
val = val.encode('latin1').decode('utf8')
print(val)   # itâs time to eat

但我认为最好是对整个文件执行此操作,而不是单独对每个值执行此操作,例如使用stringio(如果文件不太大):

from io import StringIO

# Read the csv file into a StringIO object
sio = StringIO()
with open('yourfile.csv', 'r', encoding='unicode-escape') as f:
    for line in f:
        line = line.encode('latin1').decode('utf8')
        sio.write(line)
sio.seek(0)    # Reset file pointer to the beginning

# Call read_csv, passing the StringIO object
df = pd.read_csv(sio, encoding="utf8")

推荐文章

Aaron Green · 我的python程序无法识别数据库的存在,即使它在那里

1 年前

danial · 如何在多个字符串的每个位置找到最频繁的字符

2 年前

Henry · 使用Python将json重新格式化为键值对

2 年前

eymentakak · json字典类型错误:字符串索引必须是整数

2 年前

Qubix · 从熊猫数据帧创建相对熵矩阵

2 年前

FÄÅ ÛÅ · 字典、列表和字符串

2 年前

OrbitDuster · 如何使用gmail api在python中打印gmail正文?

2 年前

guiguilecodeur · 如何删除我的词汇表中的重复元素

2 年前

Susheel P M · 这是关于if-else语句[关闭]

2 年前

Slartibartfast · 关于Python版本安装

2 年前