代码之家  ›  专栏  ›  技术社区  ›  O.rka

在熊猫中使用元组作为索引键时,如何“通过传入categories参数明确指定类别顺序”?

  •  3
  • O.rka  · 技术社区  · 6 年前

    我一直在想如何使这些元组索引键 pandas 但我有一个错误。

    我如何使用错误中的建议 pd.Categorical 是否在下面修复此错误?

    我知道我可以将其转换为字符串,但我很想知道错误消息中的建议意味着什么?

    当我用它运行时,这个工作非常好 0.22.0 . 我打开了一个 GitHub issue 如果有人想看到 0.22.0 .

    我想更新我的熊猫并妥善处理这个问题。

    用目前的熊猫0.23.4来运行:
    import sys; sys.version
    # '3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'
    import pandas as pd; pd.__version__
    # '0.23.4'
    index = [(('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 1)), (('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 2)), (('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 3)), (('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 5)), (('criterion', 'gini'), ('max_features', 'log2'), ('min_samples_leaf', 8)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 1)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 2)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 3)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 5)), (('criterion', 'gini'), ('max_features', 'sqrt'), ('min_samples_leaf', 8)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 1)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 2)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 3)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 5)), (('criterion', 'gini'), ('max_features', None), ('min_samples_leaf', 8)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 1)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 2)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 3)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 5)), (('criterion', 'gini'), ('max_features', 0.382), ('min_samples_leaf', 8)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 1)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 2)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 3)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 5)), (('criterion', 'entropy'), ('max_features', 'log2'), ('min_samples_leaf', 8)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 1)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 2)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 3)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 5)), (('criterion', 'entropy'), ('max_features', 'sqrt'), ('min_samples_leaf', 8)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 1)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 2)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 3)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 5)), (('criterion', 'entropy'), ('max_features', None), ('min_samples_leaf', 8)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 1)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 2)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 3)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 5)), (('criterion', 'entropy'), ('max_features', 0.382), ('min_samples_leaf', 8))]
    len(index)
    # 40
    pd.Index(index)
    Traceback (most recent call last):
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/algorithms.py", line 635, in factorize
        order = uniques.argsort()
    TypeError: '<' not supported between instances of 'NoneType' and 'str'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/sorting.py", line 451, in safe_sort
        sorter = values.argsort()
    TypeError: '<' not supported between instances of 'NoneType' and 'str'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 345, in __init__
        codes, categories = factorize(values, sort=True)
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/util/_decorators.py", line 178, in wrapper
        return func(*args, **kwargs)
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/algorithms.py", line 643, in factorize
        assume_unique=True)
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/sorting.py", line 455, in safe_sort
        ordered = sort_mixed(values)
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/sorting.py", line 441, in sort_mixed
        nums = np.sort(values[~str_pos])
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 847, in sort
        a.sort(axis=axis, kind=kind, order=order)
    TypeError: '<' not supported between instances of 'NoneType' and 'str'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 449, in __new__
        data, names=name or kwargs.get('names'))
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1330, in from_tuples
        return MultiIndex.from_arrays(arrays, sortorder=sortorder, names=names)
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1274, in from_arrays
        labels, levels = _factorize_from_iterables(arrays)
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2543, in _factorize_from_iterables
        return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2543, in <listcomp>
        return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2515, in _factorize_from_iterable
        cat = Categorical(values, ordered=True)
      File "/Users/jespinoz/anaconda/envs/py3_testing/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 351, in __init__
        raise TypeError("'values' is not ordered, please "
    TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument
    
    3 回复  |  直到 6 年前
        1
  •  3
  •   PMende    6 年前

    我能找到的最接近你想做的事情是: pd.DataFrame(index, dtype='category').set_index([0, 1, 2]).index

    返回以下内容:

    MultiIndex(levels=[[('criterion', 'entropy'), ('criterion', 'gini')], [('max_features', 'log2'), ('max_features', 'sqrt'), ('max_features', None), ('max_features', 0.382)], [('min_samples_leaf', 1), ('min_samples_leaf', 2), ('min_samples_leaf', 3), ('min_samples_leaf', 5), ('min_samples_leaf', 8)]],
           labels=[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
           names=[0, 1, 2])
    
        2
  •  1
  •   JohnE    6 年前

    我可能遗漏了您正试图做什么的要点,但是您似乎有一个嵌套的元组,其中每个元组的第一部分是列标题。所以我认为更明显的方法是 (a,b,c) 作为多索引值和 (x,y,z) 作为多索引名称而不是 ((x,a),(y,b),(z,c)) 作为简单的索引值。

    一般来说,如果将复杂数据类型(元组、嵌套元组、数组等)放入单个列(无论是索引列还是常规列)中,而不是简单数据类型(float、int、string等),pandas可能会有些困惑。所以99.9%的时间(或者更多!),最好不要做像将嵌套元组放入单个索引列这样的事情。在任何情况下,我都会为您的具体示例这样做:

    names = [ index[0][j][0] for j in range(3) ]
    pd.DataFrame({'x':range(40)},  
        pd.MultiIndex.from_tuples( [ (i[0][1], i[1][1], i[2][1])  for i in index ],
                                   names = names ) )
    

    数据帧的前10行(如您所见,它具有3级多索引,而不是简单的元组或字符串索引:

                                              x
    criterion max_features min_samples_leaf    
    gini      log2         1                  0
                           2                  1
                           3                  2
                           5                  3
                           8                  4
              sqrt         1                  5
                           2                  6
                           3                  7
                           5                  8
                           8                  9
    

    如果我尝试使用整个元组,而不是每对中的第二个片段,我会得到和你相同的错误…

    pd.DataFrame({'x':range(40)},  
        pd.MultiIndex.from_tuples( [ (i[0], i[1], i[2])  for i in index ],
                                   names = names ) )
    

    我想 pd.Index() 自动使用 from_tuples() 如果输入是元组(?)我这样做只是因为我习惯这样做,而不是因为我觉得这样做更好。

        3
  •  1
  •   O.rka    6 年前

    我希望错误消息能提供更多信息。多亏了以上的答案,我才弄明白这个问题。我最终做了这两个版本都兼容的工作:

    熊猫V0.23.4

    >>> pd.__version__
    '0.23.4'
    >>> index_categorical = pd.Index([*map(frozenset, index)], dtype="category")
    >>> dict(index_categorical[0])
    {'criterion': 'gini', 'max_features': 'log2', 'min_samples_leaf': 1}
    

    熊猫V0.22.0

    >>> pd.__version__
    '0.22.0'
    >>> index_categorical = pd.Index([*map(frozenset, index)], dtype="category")
    >>> dict(index_categorical[0])
    {'min_samples_leaf': 1, 'criterion': 'gini', 'max_features': 'log2'}