b"it\\xe2\\x80\\x99s time to eat"
听起来你的文件包含转义编码。
一般来说,您可以将其转换为一个适当的python3字符串,其内容如下:
x = b"it\\xe2\\x80\\x99s time to eat"
x = x.decode('unicode-escape').encode('latin1').decode('utf8')
print(x) # itâs time to eat
(使用
.encode('latin1')
explained here
)
所以,如果你使用
pd.read_csv(..., encoding="utf8")
你仍然有逃逸的字符串,你可以做如下事情:
pd.read_csv(..., encoding="unicode-escape")
# ...
# Now, your values will be strings but improperly decoded:
# itâs time to eat
#
# So we encode to bytes then decode properly:
val = val.encode('latin1').decode('utf8')
print(val) # itâs time to eat
但我认为最好是对整个文件执行此操作,而不是单独对每个值执行此操作,例如使用stringio(如果文件不太大):
from io import StringIO
# Read the csv file into a StringIO object
sio = StringIO()
with open('yourfile.csv', 'r', encoding='unicode-escape') as f:
for line in f:
line = line.encode('latin1').decode('utf8')
sio.write(line)
sio.seek(0) # Reset file pointer to the beginning
# Call read_csv, passing the StringIO object
df = pd.read_csv(sio, encoding="utf8")