代码之家 › 专栏 › 技术社区 › Anurag Uniyal

如何使用表示为Unicode的python对象列表

unicode python

Anurag Uniyal · 技术社区 · 15 年前

我有一个包含Unicode数据的对象,我想在它的表示中使用它例如

# -*- coding: utf-8 -*-

class A(object):

    def __unicode__(self):
        return u"Â©au"

    def __repr__(self):
        return unicode(self).encode("utf-8")

    __str__ = __repr__ 

a = A()


s1 = u"%s"%a # works
#s2 = u"%s"%[a] # gives unicode decode error
#s3 = u"%s"%unicode([a])  # gives unicode decode error

现在即使我从返回unicode 再PR 还是会出错所以问题是,如何使用这些对象的列表并从中创建另一个Unicode字符串?

平台详情:

"""
Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
'Linux-2.6.24-19-generic-i686-with-debian-lenny-sid'
"""

也不知道为什么

print a # works
print unicode(a) # works
print [a] # works
print unicode([a]) # doesn't works

python组回答 http://groups.google.com/group/comp.lang.python/browse_thread/thread/bd7ced9e4017d8de/2e0b07c761604137?lnk=gst&q=unicode#2e0b07c761604137

7 回复 | 直到 15 年前

Nico 15 年前

s1 = u"%s"%a # works

这是可行的,因为在处理“a”时,它使用其Unicode表示(即 统一码 方法)

但是,当您将它包装在一个列表中时,例如'[a]'…当您试图将该列表放入字符串时,所调用的是unicode([a])(在列表的情况下与repr相同),列表的字符串表示形式,它将使用“repr(a)”在其输出中表示您的项。这将导致一个问题,因为您传递的是包含“a”的UTF-8编码版本的“str”对象(字节字符串),当字符串格式试图将其嵌入到Unicode字符串中时,它将尝试使用hte默认编码(即ascii)将其转换回Unicode对象。因为ascii没有它试图转换的任何字符,所以它失败了。

你想做的事情必须这样做: u"%s" % repr([a]).decode('utf-8') 假设您的所有元素都编码为UTF-8(或ASCII,从Unicode的角度来看,它是一个UTF-8子集)。

为了获得更好的解决方案(如果仍然希望字符串看起来像list str),您必须使用前面建议的内容,并使用join,如下所示:

U '[%s]' % u','.join(unicode(x) for x in [a,a])

尽管这不会处理包含A对象列表的列表。

我的解释听起来很不清楚,但我希望你能理解。

saffsd 15 年前

尝试:

s2 = u"%s"%[unicode(a)]

您的主要问题是您进行的转换比预期的要多。让我们考虑一下:

s2 = u"%s"%[a] # gives unicode decode error

从 Python Documentation ,

    's'     String (converts any python object using str()).
    If the object or format provided is a unicode string, 
    the resulting string will also be unicode.

处理%s格式字符串时,将应用str([a])。此时您拥有的是一个字符串对象,其中包含一系列Unicode字节。如果您尝试打印这个,没有问题,因为字节直接传递到您的终端并由终端呈现。

>>> x = "%s" % [a]
>>> print x
[Â©au]

当您试图将其转换回Unicode时,问题就出现了。本质上,函数unicode是对包含unicode编码字节序列的字符串调用的,这就是导致ascii编解码器失败的原因。

    >>> u"%s" % x
    Traceback (most recent call last):
      File "", line 1, in 
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
    >>> unicode(x)
    Traceback (most recent call last):
      File "", line 1, in 
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)

itsadok 15 年前

首先,问问自己你想达到什么目标。如果您所需要的只是列表的可循环表示,那么只需执行以下操作:

class A(object):
    def __unicode__(self):
        return u"Â©au"
    def __repr__(self):
        return repr(unicode(self))
    __str__ = __repr__

>>> A()
u'\xa9au'
>>> [A()]
[u'\xa9au']
>>> u"%s" % [A()]
u"[u'\\xa9au']"
>>> "%s" % [A()]
"[u'\\xa9au']"
>>> print u"%s" % [A()]
[u'\xa9au']

这就是它应该如何工作的。python列表的字符串表示不是用户应该看到的,所以在其中包含转义字符是有意义的。

Alan Rowarth 15 年前

如果您想使用 unicode() 能够创建Unicode字符串的对象,请尝试如下操作:

u''.join([unicode(v) for v in [a,a]])

itsadok 15 年前

由于这个问题涉及到很多令人困惑的Unicode内容,我想我会提供一个关于这里发生了什么的分析。

这一切归结于 __unicode__ 和 __repr__ 建筑的 list 班级。基本上,它相当于:

class list(object):
    def __repr__(self):
        return "[%s]" % ", ".join(repr(e) for e in self.elements)
    def __str__(self):
        return repr(self)
    def __unicode__(self):
        return str(self).decode()

事实上, list doesn't even define the __unicode__ and __str__ methods 当你想到它的时候,这是有意义的。

当你写:

u"%s" % [a]                          # it expands to
u"%s" % unicode([a])                 # which expands to
u"%s" % repr([a]).decode()           # which expands to
u"%s" % ("[%s]" % repr(a)).decode()  # (simplified a little bit)
u"%s" % ("[%s]" % unicode(a).encode('utf-8')).decode()

最后一行是repr(a)的扩展,使用 第二乐章 在这个问题上。

因此,正如您所看到的,对象首先是以UTF-8编码的,后来才使用系统默认编码进行解码,通常不支持所有字符。

正如上面提到的其他一些答案,您可以编写自己的函数,甚至子类列表,比如:

class mylist(list):
    def __unicode__(self):
        return u"[%s]" % u", ".join(map(unicode, self))

请注意,此格式不支持Round-Trippable。它甚至可能误导:

>>> unicode(mylist([]))
u'[]'
>>> unicode(mylist(['']))
u'[]'

当然,你可以写 quote_unicode 功能使它可以往返,但现在是问问自己的时候了 what's the point . 这个 unicode 和 str 函数用于创建对用户有意义的对象的表示。对于程序员来说, repr 功能。原始列表不是用户应该看到的。这就是为什么 列表 类不实现 阿伊努埃德埃希 方法。

要更好地了解什么时候会发生什么,请使用这个小类:

class B(object):
    def __unicode__(self):
        return u"unicode"
    def __repr__(self):
        return "repr"
    def __str__(self):
        return "str"


>>> b
repr
>>> [b]
[repr]
>>> unicode(b)
u'unicode'
>>> unicode([b])
u'[repr]'

>>> print b
str
>>> print [b]
[repr]
>>> print unicode(b)
unicode
>>> print unicode([b])
[repr]

Laurence Gonsalves 15 年前

再PR 和 STR 它们都应该返回str对象,至少在python 2.6.x中是这样的。由于repr()试图将结果转换为str,所以会出现解码错误,但失败了。

我相信这在python 3.x中已经改变了。

Unknown 15 年前

# -*- coding: utf-8 -*-

class A(object):
    def __unicode__(self):
        return u"Â©au"

    def __repr__(self):
        return unicode(self).encode('ascii', 'replace')

    __str__ = __repr__

a = A()

>>> u"%s" % a
u'\xa9au'
>>> u"%s" % [a]
u'[?au]'