代码之家 › 专栏 › 技术社区 › Ildar Akhmetov

如何在Python3中将字符串从cp1251转换为UTF-8?

cp1251 utf-8 python-3.x python

Ildar Akhmetov · 技术社区 · 6 年前

首先,它从使用cp1251编码的老式服务器下载HTML文件。

然后我需要将文件内容放入一个UTF-8编码的字符串中。

我在做的是:

import requests
import codecs

#getting the file
ri = requests.get('http://old.moluch.ru/_python_test/0.html')

#checking that it's in cp1251
print(ri.encoding)

#encoding using cp1251
text = ri.text
text = codecs.encode(text,'cp1251')

#decoding using utf-8 - ERROR HERE!
text = codecs.decode(text,'utf-8')

print(text)

错误如下:

Traceback (most recent call last):
  File "main.py", line 15, in <module>
    text = codecs.decode(text,'utf-8')
  File "/var/lang/lib/python3.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 43: invalid continuation byte

4 回复 | 直到 6 年前

Tomalak 6 年前

不知道你想做什么。

.text 是响应的文本,一个Python字符串。编码在Python字符串中不起任何作用。

编码只在您希望有字节流时起作用 转换为 一根绳子(或另一边)。请求模块已经为您完成了这项工作。

import requests

ri = requests.get('http://old.moluch.ru/_python_test/0.html')
print(ri.text)

例如,假设您有一个文本文件(即:字节)。那么你必须选择一个编码 open() 文件-编码的选择决定了如何将文件中的字节转换为字符。此手动步骤是必要的,因为无法知道文件字节的编码方式。

Content-Type ),所以 requests 可以了解这些信息。作为一个高级模块,它可以帮助您查看HTTP头并转换传入的字节。(如果你想用更低级的 urllib ,你必须自己解码。)

.encoding 当您使用 。文本 对你的回应。如果你使用 .raw .生的

NoorJafri 6 年前

当很多人已经回答你正在收到解码信息时 请求.get . 我来回答你现在面临的错误。

这一行:

text = codecs.encode(text,'cp1251')

text = codecs.decode(text,'utf-8')

用于检测可使用的类型:

import chardet
text = codecs.encode(text,'cp1251')
chardet.detect(text) . #output {'encoding': 'windows-1251', 'confidence': 0.99, 'language': 'Russian'}

#OR
text = codecs.encode(text,'utf-8')
chardet.detect(text) . #output {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

所以用一种格式编码然后用另一种格式解码会导致错误。

PythonSherpa 6 年前

“当您发出请求时,请求会根据HTTP报头对响应的编码进行有根据的猜测。当您访问r.text时,将使用请求猜测的文本编码

import requests

#getting the file
ri = requests.get('http://old.moluch.ru/_python_test/0.html')

text = ri.text

print(text)

对于非文本请求,还可以字节形式访问响应正文:

ri.content

请查看 requests documentation

-1

Hadi Rahjoo 6 年前

您只需在decode函数中添加一个设置即可忽略错误:

text = codecs.decode(text,'utf-8',errors='ignore')