代码之家 › 专栏 › 技术社区 › drjz

在python中使用list和dictionary通过数值比较优化循环的效率

python

drjz · 技术社区 · 6 年前

我有一个整数列表: candidates = [1, 2 ,3, 4 , 5, 16, 20] . 此列表可以包含100万个项目。

我有一本字典 number_ranges 它的键是一个整数,列表是一个包含最小和最大范围的对象的值。这本词典现在大约有50万个键。

{
    {5: [{"start": 0, "end": 9}]},
    {16: [{"start": 15, "end": 20}, {"start": 16, "end": 18}]}
}

我现在正在循环浏览列表:

for candidate in candidates:
    number = search_in_range(candidate, number_ranges)

在这里我检查是否有一个匹配的数字 candidates 范围 数字范围 ,如果是这样,我将返回将在以后使用的密钥。

def search_in_range(candidate, number_ranges):
    for number_range_key in number_ranges:
        for number in number_ranges[number_range_key]:
            if int(number['start']) <= candidate <= int(number['end']):
                return {"key": number_range_key, "candidate": candidate}

当我运行这个时,我看到处理列表中的1000个数字大约需要40秒。这意味着如果我有100万个数字,我需要11个多小时来处理。

('2018-12-19 16:22:47', 'Read', 1000)
('2018-12-19 16:23:30', 'Read', 2000)
('2018-12-19 16:24:10', 'Read', 3000)
('2018-12-19 16:24:46', 'Read', 4000)
('2018-12-19 16:25:26', 'Read', 5000)
('2018-12-19 16:25:59', 'Read', 6000)
('2018-12-19 16:26:39', 'Read', 7000)
('2018-12-19 16:27:28', 'Read', 8000)
('2018-12-19 16:28:15', 'Read', 9000)
('2018-12-19 16:28:57', 'Read', 10000)

预期的输出正在从返回键 数字范围 在范围和 candidate 用于查找该键的数字,即 return {"key": number_range_key, "candidate": candidate} 在功能中 search_in_range .

在python中,推荐什么方法来优化该算法?

3 回复 | 直到 6 年前

tobias_k 6 年前

你的清单 candidates 是排序的,反之亦然:循环字典 number_ranges 和使用 bisect 二元搜索匹配的候选人。这将从以下方面降低复杂性: O(n*m) 到 O(n*logm*k) 对于 n 词典, m 候选人,以及 k 平均匹配候选人。

(注:我更改了您的格式 数字范围 从A set 属于 dict 每个元素只有一个元素 口述 ,这更有意义。)

candidates = [1, 2, 3, 4, 5, 16, 20]
number_ranges = {
    5: [{"start": 0, "end": 9}],
    16: [{"start": 15, "end": 20}, {"start": 16, "end": 18}]
}

import bisect

for key, values in number_ranges.items():
    for value in values:
        start, end = value["start"], value["end"]
        lower = bisect.bisect_left(candidates, start)
        upper = bisect.bisect_right(candidates, end)
        for cand in range(lower, upper):
            res = {"key": key, "candidate": candidates[cand]}
            print(value, res)

输出:

{'start': 0, 'end': 9} {'key': 5, 'candidate': 1}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 2}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 3}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 4}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 5}
{'start': 15, 'end': 20} {'key': 16, 'candidate': 16}
{'start': 15, 'end': 20} {'key': 16, 'candidate': 20}
{'start': 16, 'end': 18} {'key': 16, 'candidate': 16}

如果 候选人 实际中没有排序,或者如果您希望结果按候选项而不是按字典排序,则可以作为预处理或后处理步骤进行排序。

Stuart Buckingham 6 年前

通过一点重组,代码就变成了一个经典的间隔树问题。

看看这个包裹 https://pypi.org/project/intervaltree/

与正常间隔树的唯一区别是,您有一些覆盖多个间隔的项,但是很容易将它们划分为单个间隔,例如16.1:“Start”:15,“End”:20,16.2:“Start”:16,“End”:18

通过使用intervaltree包,创建了一个比嵌套for循环更有效的平衡二叉搜索树。这个解决方案是搜索每个候选对象的O(logn),而for循环是O(n)。如果有1毫米以上的候选者,intervaltree包将比公认的嵌套for循环答案快得多。

Paritosh Singh 6 年前

尽管这个问题有一个公认的答案,但为了其他人的利益,我想补充一点,这种类型的场景确实有理由创建反向查找。这是一个一次性的头痛,将节省大量的实际时间,因为候选人名单增长更长。字典查找是O(1),如果需要执行多个查找,还应考虑创建反向映射。

number_ranges = [
    {5: [{"start": 0, "end": 9}]},
    {16: [{"start": 15, "end": 20}, {"start": 16, "end": 18}]}
]

from collections import defaultdict

reversed_number_ranges = defaultdict(set) #returns an empty set, avoiding key errors.


for number in number_ranges:
    for k,v in number.items(): 
        ranges = set() #create a set of values which fall within range
        for range_dict in v:
            ranges.update(range(range_dict["start"], range_dict["end"] + 1)) #assuming "end" is included. remove the +1 for right exclusive.
        for i in ranges:
            reversed_number_ranges[i].add(k) #add the key for each location in a range.


candidates = [1, 2 ,3, 4 , 5, 16, 20]

for candidate in candidates:
    print(candidate, reversed_number_ranges[candidate])

输出:

1 {5}
2 {5}
3 {5}
4 {5}
5 {5}
16 {16}
20 {16}