代码之家 › 专栏 › 技术社区 › Ofer Sadan

在numpy中按行获取范围

numpy python

Ofer Sadan · 技术社区 · 6 年前

我有一个函数可以生成这样的数组:

my_array = np.array([list(str(i).zfill(4)) for i in range(10000)], dtype=int)

哪些输出:

array([[0, 0, 0, 0],
       [0, 0, 0, 1],
       [0, 0, 0, 2],
       ...,
       [9, 9, 9, 7],
       [9, 9, 9, 8],
       [9, 9, 9, 9]])

你可以通过转换看到 int s到字符串和列表,然后返回到 int ,这是非常低效的,我真正需要的是更大的阵列(更大的范围)。我试图寻找numpy来找到一种更有效的方法来生成这个数组/列表,但找不到一种方法。到目前为止我最好的是 arange 其范围为1…9999,但不分为列表。

有什么想法吗?

5 回复 | 直到 6 年前

Divakar 6 年前

这是一个基于 cartesian_product_broadcasted -

import functools

def cartesian_product_ranges(shape, out_dtype='int'):
    arrays = [np.arange(s, dtype=out_dtype) for s in shape]
    broadcastable = np.ix_(*arrays)
    broadcasted = np.broadcast_arrays(*broadcastable)
    rows, cols = functools.reduce(np.multiply, broadcasted[0].shape), \
                                                  len(broadcasted)
    out = np.empty(rows * cols, dtype=out_dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    N = len(shape)
    return np.moveaxis(out.reshape((-1,) + tuple(shape)),0,-1).reshape(-1,N)

样品运行-

In [116]: cartesian_product_ranges([3,2,4])
Out[116]: 
array([[0, 0, 0],
       [0, 0, 1],
       [0, 0, 2],
       [0, 0, 3],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 2],
       [0, 1, 3],
       [1, 0, 0],
       [1, 0, 1],
       [1, 0, 2],
       [1, 0, 3],
       [1, 1, 0],
       [1, 1, 1],
       [1, 1, 2],
       [1, 1, 3],
       [2, 0, 0],
       [2, 0, 1],
       [2, 0, 2],
       [2, 0, 3],
       [2, 1, 0],
       [2, 1, 1],
       [2, 1, 2],
       [2, 1, 3]])

运行并计时 10-ranged 数组 4 科尔斯-

In [119]: cartesian_product_ranges([10]*4)
Out[119]: 
array([[0, 0, 0, 0],
       [0, 0, 0, 1],
       [0, 0, 0, 2],
       ...,
       [9, 9, 9, 7],
       [9, 9, 9, 8],
       [9, 9, 9, 9]])

In [120]: cartesian_product_ranges([10]*4).shape
Out[120]: (10000, 4)

In [121]: %timeit cartesian_product_ranges([10]*4)
10000 loops, best of 3: 105 Âµs per loop

In [122]: %timeit np.array([list(str(i).zfill(4)) for i in range(10000)], dtype=int)
100 loops, best of 3: 16.7 ms per loop

In [123]: 16700.0/105
Out[123]: 159.04761904761904

周围 160x 加速!

为了 10范围 数组 9 柱,我们可以使用较低的精度 uint8 数据类型-

In [7]: %timeit cartesian_product_ranges([10]*9, out_dtype=np.uint8)
1 loop, best of 3: 3.36 s per loop

dennlinger 6 年前

您可以使用 itertools.product 为了这个。只需提供 range(10) 作为参数,以及要作为参数的位数 repeat .

方便的是,itertools迭代器按排序顺序返回元素,因此您不必自己执行第二个排序步骤。

下面是对我的代码的评估:

import timeit


if __name__ == "__main__":
    # time run: 14.20635
    print(timeit.timeit("np.array([list(str(i).zfill(4)) for i in range(10000)], dtype=int)",
                  "import numpy as np",
                  number=1000))

    # time run: 5.00319
    print(timeit.timeit("np.array(list(itertools.product(range(10), r=4)))",
                        "import itertools; import numpy as np",
                        number=1000))

Joe 6 年前

我会用以下方法来解决这个问题 np.tile 和 np.repeat 然后试着把这些行组合起来 np.column_stack 他们。

这个纯麻木的解决方案几乎变成了一个线性,然后:

n = 10000

x = np.arange(10)

a = [np.tile(np.repeat(x, 10 ** k), n/(10 ** (k+1))) for k in range(int(np.log10(n)))]

y = np.column_stack(a[::-1]) # flip the list, first entry is rightmost row

可以这样编写一个更详细的版本来查看发生了什么。

n = 10000

x = np.arange(10)

x0 = np.tile(np.repeat(x, 1), n/10)
x1 = np.tile(np.repeat(x, 10), n/100)
x2 = np.tile(np.repeat(x, 100), n/1000)

现在用指数替换数字,并使用log10获得列数。

速度测试:

import timeit

s = """
    n = 10000
    x = np.arange(10)
    a = [np.tile(np.repeat(x, 10 ** k), n/(10 ** (k+1))) for k in range(int(np.log10(n)))]
    y = np.column_stack(a[::-1])
    """
n_runs = 100000
t = timeit.timeit(s,
                  "import numpy as np",
                  number=n_runs)

print(t, t/n_runs)

在我的低速机器上大约260秒(7岁)。

kuppern87 6 年前

一个快速的解决方案是使用 np.meshgrid 创建所有列。然后对列(例如元素123或1234)进行排序,使它们的顺序正确。然后用它们做一个数组。

n_digits = 4
digits = np.arange(10)
columns = [c.ravel() for c in np.meshgrid(*[digits]*n_digits)]
out_array = columns.sort(key=lambda x: x[int("".join(str(d) for d in range(n_digits)))])
out_array = np.array(columns).T
np.all(out_array==my_array)

Joe 6 年前

还有其他的一句话可以解决这个问题

import numpy as np
y = np.array([index for index in np.ndindex(10, 10, 10, 10)])

这似乎要慢得多。

或

import numpy as np
from sklearn.utils.extmath import cartesian

x = np.arange(10)
y = cartesian((x, x, x, x))

这似乎比公认的答案稍慢。