代码之家 › 专栏 › 技术社区 › Brad Solomon

与`sys.getsizeof不一致`

pandas python-3.x string python

Brad Solomon · 技术社区 · 6 年前

为什么 sys.getsizeof() 巨蟒更大 str 长度为1的字符串比长度为2的字符串长?(对于长度>2,关系似乎像预期的那样单调地增加。)

例子:

>>> from string import ascii_lowercase
>>> import sys

>>> strings = [ascii_lowercase[:i] for i, _ in enumerate(ascii_lowercase, 1)]
>>> strings
['a',
 'ab',
 'abc',
 'abcd',
 'abcde',
 'abcdef',
 'abcdefg',
 # ...

>>> sizes = dict(enumerate(map(sys.getsizeof, strings), 1))
>>> sizes
{1: 58,   # <--- ??
 2: 51,
 3: 52,
 4: 53,
 5: 54,
 6: 55,
 7: 56,
 8: 57,
 9: 58,
 10: 59,
 11: 60,
 12: 61,
 13: 62,
 14: 63,
 15: 64,
 16: 65,
 # ...

看来这和 str.__sizeof__ ,但我对C的了解根本不足以深入了解本案的情况。

3 回复 | 直到 6 年前

Brad Solomon 5 年前

当你 import pandas ,它能做很多事情,包括打电话 UNICODE_setitem 在所有单个ascii字母字符串上,以及在其他地方对单个ascii数字字符串执行类似的操作。

这个numpy函数调用不推荐使用的c api PyUnicode_AsUnicode 是的。

当您在cpython 3.3+中调用它时,它会缓存 wchar_t * 字符串内部结构的表示 wstr 成员,作为两个wchar_t值 w'a' 和 '\0' ,它在32位上占用8个字节- wchar_t python的构建。以及 str.__size__ 考虑到这一点。

所以,所有单字符的ascii字母和数字的字符串都被截取了,但没有比这个大8个字节的。

首先,我们知道很明显 导入熊猫 (每 Brad Solomon's answer )可能发生在 np.set_printoptions(precision=4, threshold=625, edgeitems=10) (米拉杜洛发表了一条评论,但随后被删除) ShadowRanger's answer ),但绝对不在 import numpy 是的。

其次,我们知道 'a' ,但其他单字符串呢?

为了验证前者并测试后者,我运行了以下代码:

import sys

strings = [chr(i) for i in (0, 10, 17, 32, 34, 47, 48, 57, 58, 64, 65, 90, 91, 96, 97, 102, 103, 122, 123, 130, 0x0222, 0x12345)]

sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)

import numpy as np
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)

np.set_printoptions(precision=4, threshold=625, edgeitems=10)
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)

import pandas
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)

在多个cpython安装中(但在linux或macos上都是64位cpython 3.4或更高版本),我得到了相同的结果:

{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'È¢': 76, 'ð': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'È¢': 76, 'ð': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'È¢': 76, 'ð': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 58, '9': 58, ':': 50, '@': 50, 'A': 58, 'Z': 58, '[': 50, '`': 50, 'a': 58, 'f': 58, 'g': 58, 'z': 58, '{': 50, '\x82': 74, 'È¢': 76, 'ð': 80}

所以, 导入numpy 什么也改变不了 set_printoptions (大概是米拉杜洛删除评论的原因吧),但是 导入熊猫 做。

它显然影响了ascii数字和字母,但没有其他影响。

另外,如果你改变了所有 print S至 print(sizes.values()) ,因此字符串永远不会为输出进行编码,您会得到相同的结果,这意味着要么不是缓存utf-8,要么就是缓存utf-8,但即使我们不强制它也总是这样。

很明显的可能性是,无论熊猫叫什么,都是用 legacy PyUnicode API 为所有ascii数字和字母生成单个字符串。所以这些字符串不是以紧凑的ascii格式结束的,而是以传统的就绪格式结束的,对吧?(有关这意味着什么的详细信息,请参见 the comments in the source (第三章)

不。使用我的代码 superhackyinternals ,我们可以看到它仍然是紧凑的ascii格式:

import ctypes
import sys
from internals import PyUnicodeObject

s = 'a'
print(sys.getsizeof(s))
ps = PyUnicodeObject.from_address(s)
print(ps, ps.kind, ps.length, ps.interned, ps.ascii, ps.compact, ps.ready)
addr = id(s) + PyUnicodeObject.utf8_length.offset
buf = (ctypes.c_char * 2).from_address(addr)
print(addr, bytes(buf))

import pandas
print(sys.getsizeof(s))
s = 'a'
ps = PyUnicodeObject.from_address(s)
print(ps, ps.kind, ps.length, ps.interned, ps.ascii, ps.compact, ps.ready)
addr = id(s) + PyUnicodeObject.utf8_length.offset
buf = (ctypes.c_char * 2).from_address(addr)
print(addr, bytes(buf))

我们可以看到大熊猫的大小从50变为58,但它们的领域仍然是:

<__main__.PyUnicodeObject object at 0x101bbae18> 1 1 1 1 1 1

换句话说,是 1BYTE_KIND ,长度1,凡人实习,ascii,紧凑,准备就绪。

但是,如果你看 ps.wstr ,在pandas之前是空指针,而在pandas之后是指向 乌恰 一串 w"a\0" 是的。以及 str.__sizeof__ 接受这个 wstr 考虑到大小。

所以,问题是,如何得到一个ascii压缩字符串,它有一个 wstr 价值?

简单:你打电话 皮尤尼科德 它(或其他不推荐使用的函数或宏之一,这些函数或宏访问3.2样式的本机 乌恰* 内部存储。本机内部存储实际上不存在于3.3 +中。因此,为了向后兼容,这些调用是通过动态创建存储来处理的,并将其粘贴到 wstr 成员,并调用适当的 PyUnicode_AsUCS[24] 函数解码到该存储。(除非你处理的是一个紧凑的字符串,其类型恰好与 乌恰 宽度,在这种情况下 wstr 毕竟只是指向本机存储的指针。)

你会想到的 街道Sizeof__ 理想情况下包括额外的存储空间,以及 from the source ,你可以看到。

让我们验证一下:

import ctypes
import sys
s = 'a'
print(sys.getsizeof(s))
ctypes.pythonapi.PyUnicode_AsUnicode.argtypes = [ctypes.py_object]
ctypes.pythonapi.PyUnicode_AsUnicode.restype = ctypes.c_wchar_p
print(ctypes.pythonapi.PyUnicode_AsUnicode(s))
print(sys.getsizeof(s))

泰达,我们的50比58。

那么,你怎么知道这个叫什么?

实际上有很多电话 皮尤尼科德 和 PyUnicode_AS_UNICODE 宏和其他调用它们的函数,贯穿熊猫和numpy。所以我在lldb中运行了python并在 皮尤尼科德 ,如果调用堆栈帧与上次相同,则使用跳过的脚本。

前几个调用涉及日期时间格式。还有一个只有一封信。堆栈帧是:

multiarray.cpython-36m-darwin.so`UNICODE_setitem + 296

-及以上 multiarray 它是纯蟒蛇一直到 导入熊猫 是的。所以,如果你想知道pandas在哪里调用这个函数,你需要在 pdb ,我还没做。但我想我们已经有足够的信息了。

ShadowRanger 6 年前

Python 3.3+'s str is quite a complicated structure ,最终可能以三种不同的方式存储底层数据,具体取决于字符串使用了哪些api以及字符串表示的代码点。最常见的替代表示形式是缓存的utf-8表示形式,但这仅适用于非ascii字符串,因此不适用于此处。

在本例中,我怀疑单字符字符串(作为实现细节,它是单字符字符串)的使用方式触发了遗留的创建 wchar_t* 表示(扩展使用 the legacy Py_UNICODE APIs 可能导致这种情况),而python构建使用一个4字节 wchar_t ,导致字符串比其他字符串大8个字节(对于 a 本身,还有四个 NUL 终结者)。它是单例的事实意味着,即使您可能从未触发过这样的遗留api调用,任何检索到单例引用的扩展都会影响 每个人 通过与遗留api一起使用。

就我个人而言,我根本没有在我的linux 3.6.5安装上复制(大小平滑地增加),这表明没有 乌恰 已创建表示,在我的Windows 3.6.3安装中, 'a' 只有54字节,而不是58字节(与Windows的本机两字节匹配 乌恰 )中。在这两种情况下 ipython ;可能不同 伊普顿 不同版本的依赖关系会导致您(和我)的观察结果不一致。

很明显,这个额外的开销是无关紧要的;因为单个字符串是单个的,所以增量的使用开销实际上只有4-8字节(取决于指针的宽度)。如果有几个字符串最终与遗留api一起使用,就不会破坏内存的储备。

Brad Solomon 6 年前

这似乎与ipython启动文件中的单个pandas导入有关。

我还可以在普通的python会话中重现该行为:

 ~$ python
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from string import ascii_lowercase
>>> import sys
>>> strings = [ascii_lowercase[:i] for i, _ in enumerate(ascii_lowercase, 1)]
>>> sizes = dict(enumerate(map(sys.getsizeof, strings), 1))
>>> sizes
{1: 50, 2: 51, 3: 52, 4: 53, 5: 54, 6: 55, 7: 56, 8: 57, 9: 58, 10: 59, 11: 60, 12: 61, 13: 62, 14: 63, 15: 64, 16: 65, 17: 66, 18: 67, 19: 68, 20: 69, 21: 70, 22: 71, 23: 72, 24: 73, 25: 74, 26: 75}
>>> import pandas as pd
>>> sizes = dict(enumerate(map(sys.getsizeof, strings), 1))
>>> sizes
{1: 58, 2: 51, 3: 52, 4: 53, 5: 54, 6: 55, 7: 56, 8: 57, 9: 58, 10: 59, 11: 60, 12: 61, 13: 62, 14: 63, 15: 64, 16: 65, 17: 66, 18: 67, 19: 68, 20: 69, 21: 70, 22: 71, 23: 72, 24: 73, 25: 74, 26: 75}
>>> pd.__version__
'0.23.2'