代码之家  ›  专栏  ›  技术社区  ›  Rick SilentGhost

对python 3.7+字典进行排序的最快方法

  •  9
  • Rick SilentGhost  · 技术社区  · 6 年前

    现在 insertion order of Python dictionaries is guaranteed 从python 3.7开始(和 in CPython 3.6 )对字典进行排序的最佳/最快方法是什么(按值和键)?

    最明显的方法可能是:

    by_key = {k: dct[k] for k in sorted(dct.keys())}
    by_value = {k: dct[k] for k in sorted(dct.keys(), key=dct.__getitem__)}
    

    有没有其他更快的方法可以做到这一点?

    注意,这个问题不是重复的,因为以前关于如何分类字典的问题已经过时了(答案基本上是, 你不能;用 collections.OrderedDict 相反 )

    1 回复  |  直到 6 年前
        1
  •  8
  •   wim    6 年前

    tl;dr:CPython 3.7中按键或值(分别)排序的最佳方法:

    {k: d[k] for k in sorted(d)}
    {k: v for k,v in sorted(d.items(), key=itemgetter(1))}
    

    在MacBook上测试 sys.version :

    3.7.0b4 (v3.7.0b4:eb96c37699, May  2 2018, 04:13:13)
    [Clang 6.0 (clang-600.0.57)]
    

    一次设置1000个浮点数:

    >>> import random
    >>> random.seed(123)
    >>> d = {random.random(): random.random() for i in range(1000)}
    

    按键对数字排序(从最佳到最差):

    >>> %timeit {k: d[k] for k in sorted(d)}
    # 296 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit {k: d[k] for k in sorted(d.keys())}
    # 306 µs ± 9.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit dict(sorted(d.items(), key=itemgetter(0)))
    # 345 µs ± 4.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(0))}
    # 359 µs ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit dict(sorted(d.items(), key=lambda kv: kv[0]))
    # 391 µs ± 8.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit dict(sorted(d.items()))
    # 409 µs ± 9.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit {k: v for k,v in sorted(d.items())}
    # 420 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[0])}
    # 432 µs ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    按值排序数字(从最佳到最差):

    >>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(1))}
    # 355 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit dict(sorted(d.items(), key=itemgetter(1)))
    # 375 µs ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[1])}
    # 393 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit dict(sorted(d.items(), key=lambda kv: kv[1]))
    # 402 µs ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit {k: d[k] for k in sorted(d, key=d.get)}
    # 404 µs ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit {k: d[k] for k in sorted(d, key=d.__getitem__)}
    # 404 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    >>> %timeit {k: d[k] for k in sorted(d, key=lambda k: d[k])}
    # 480 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    一次设置一个大的字符串听写:

    >>> import random
    >>> from pathlib import Path
    >>> from operator import itemgetter
    >>> random.seed(456)
    >>> words = Path('/usr/share/dict/words').read_text().splitlines()
    >>> random.shuffle(words)
    >>> keys = words.copy()
    >>> random.shuffle(words)
    >>> values = words.copy()
    >>> d = dict(zip(keys, values))
    >>> list(d.items())[:5]
    [('ragman', 'polemoscope'),
     ('fenite', 'anaesthetically'),
     ('pycnidiophore', 'Colubridae'),
     ('propagate', 'premiss'),
     ('postponable', 'Eriglossa')]
    >>> len(d)
    235886
    

    按键对字符串的dict排序:

    >>> %timeit {k: d[k] for k in sorted(d)}
    # 387 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit {k: d[k] for k in sorted(d.keys())}
    # 387 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit dict(sorted(d.items(), key=itemgetter(0)))
    # 461 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit dict(sorted(d.items(), key=lambda kv: kv[0]))
    # 466 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(0))}
    # 488 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[0])}
    # 536 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit dict(sorted(d.items()))
    # 661 ms ± 9.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit {k: v for k,v in sorted(d.items())}
    # 687 ms ± 5.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    按值对字符串的dict排序:

    >>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(1))}
    # 468 ms ± 5.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit dict(sorted(d.items(), key=itemgetter(1)))
    # 473 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit dict(sorted(d.items(), key=lambda kv: kv[1]))
    # 492 ms ± 9.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[1])}
    # 496 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit {k: d[k] for k in sorted(d, key=d.__getitem__)}
    # 533 ms ± 5.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit {k: d[k] for k in sorted(d, key=d.get)}
    # 544 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    >>> %timeit {k: d[k] for k in sorted(d, key=lambda k: d[k])}
    # 566 ms ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    注释 :现实数据通常包含已排序序列的长时间运行,Timsort算法可以利用这些序列。如果对dict进行排序取决于您的快速路径,那么建议在得出关于最佳方法的任何结论之前,在您自己的平台上使用您自己的典型数据进行基准测试。我准备了一个评论角色( # )每次都会产生这样的结果:IPython用户可以复制/粘贴整个代码块,以便在自己的平台上重新运行所有测试。