代码之家 › 专栏 › 技术社区 › West

在python中检查值或值列表是否是列表子集的最快方法

cython numpy list python

1

West · 技术社区 · 6 年前

我有一个很大的名单叫做 main_list ,大约有1300万个列表,每个列表都有6个数字。我正在寻找一种方法来过滤掉任何不包含特定值的列表。例如,要创建仅包含值为4和5的列表的新列表,“我的代码”的工作方式如下:

and_include = []
temp_list=[4,5]
for sett in main_list:
    if set(temp_list).issubset(sett):
        and_include.append(sett)

我对cython不是很熟悉,但我试着用这种方法实现它,编译它,但我得到了一个错误。

def andinclude(list main_list,list temp_list):
    and_include=[]
    for sett in main_list:
        if set(temp_list).issubset(sett):
            and_include.append(sett)
    return and_include

希望有更快的方法?

1 回复 | 直到 6 年前

1

2

sjw 6 年前

这是一个 numpy 解决方案:

import numpy as np

# Randomly generate 2d array of integers
np.random.seed(1)
a = np.random.randint(low=0, high=9, size=(13000000, 6))

# Use numpy indexing to filter rows
results = a[(a == 4).any(axis=1) & (a == 5).any(axis=1)]

结果:

In [35]: print(results.shape)
(3053198, 6)

In [36]: print(results[:5])
[[5 5 4 5 5 1]
 [5 5 4 3 8 6]
 [2 5 8 1 1 4]
 [0 5 4 1 1 5]
 [3 2 5 2 4 6]]

时间安排:

In [37]: %timeit results = a[(a == 4).any(axis=1) & (a == 5).any(axis=1)]
923 ms Â± 38.6 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)

如果需要将结果转换回列表列表而不是2d numpy数组,可以使用:

l = results.tolist()

这使在我的机器上运行的时间增加了大约50%,但仍然应该比任何涉及在Python列表上循环的解决方案快。

2

0

absolutelydevastated 6 年前

set(temp_list) 在一个局部变量中,所以不调用 set