代码之家 › 专栏 › 技术社区 › Josh Friedlander

大熊猫“索引超过lexsort深度”警告的原因是什么?

pandas python

Josh Friedlander · 技术社区 · 6 年前

我正在索引一个大的多索引熊猫df,使用 df.loc[(key1, key2)] . 有时我会得到一个系列(如预期的那样),但有时我会得到一个数据帧。我正试图孤立导致后者的病例,但到目前为止,我所能看到的是它与 PerformanceWarning: indexing past lexsort depth may impact performance 警告。

我想复制它来发布在这里,但是我不能生成另一个给我同样警告的案例。这是我的尝试:

def random_dates(start, end, n=10):
    start_u = start.value//10**9
    end_u = end.value//10**9
    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

np.random.seed(0)
df = pd.DataFrame(np.random.random(3255000).reshape(465000,7))  # same shape as my data
df['date'] = random_dates(pd.to_datetime('1990-01-01'), pd.to_datetime('2018-01-01'), 465000)
df = df.set_index([0, 'date'])
df = df.sort_values(by=[3])  # unsort indices, just in case
df.index.lexsort_depth
> 0
df.index.is_monotonic
> False
df.loc[(0.9987185534991936, pd.to_datetime('2012-04-16 07:04:34'))]
# no warning

所以我的问题是: 什么导致了这个警告 ?我如何人工诱导它?

2 回复 | 直到 6 年前

cs95 abhishek58g 6 年前

事实上,我已经在我的文章中详细描述了这一点: Select rows in pandas MultiIndex DataFrame (在“问题3”下)。

繁殖,

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15

您会注意到第二个级别的排序不正确。

现在,尝试索引一个特定的横截面:

df.loc[pd.IndexSlice[('c', 'u')]]
PerformanceWarning: indexing past lexsort depth may impact performance.
  # encoding: utf-8

         col
one two     
c   u      9

你会看到同样的行为 xs :

df.xs(('c', 'u'), axis=0)
PerformanceWarning: indexing past lexsort depth may impact performance.
  self.interact()

         col
one two     
c   u      9

这个 docs 背靠 this timing test I once did 似乎表明处理未排序的指数会导致“减速”指数是O(n)时间,当它可能/应该是O(1)。

如果在切片前对索引进行排序,您会注意到不同之处:

df2 = df.sort_index()
df2.loc[pd.IndexSlice[('c', 'u')]]

         col
one two     
c   u      9


%timeit df.loc[pd.IndexSlice[('c', 'u')]]
%timeit df2.loc[pd.IndexSlice[('c', 'u')]]

802 Âµs Â± 12.1 Âµs per loop (mean Â± std. dev. of 7 runs, 1000 loops each)
648 Âµs Â± 20.3 Âµs per loop (mean Â± std. dev. of 7 runs, 1000 loops each)

最后,如果您想知道索引是否排序,请使用 MultiIndex.is_lexsorted .

df.index.is_lexsorted()
# False

df2.index.is_lexsorted()
# True

至于你关于如何诱导这种行为的问题,简单地排列指数就足够了。如果您的索引是唯一的,则可以执行此操作:

df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]

如果索引不唯一,请添加 cumcount ED等级第一,

df.set_index(
    df.groupby(level=list(range(len(df.index.levels)))).cumcount(), append=True) 
df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]
df2 = df2.reset_index(level=-1, drop=True)

ndrwnaguib Nikkolai Fernandez 6 年前

根据 pandas advanced indexing (Sorting a Multiindex)

在高维对象上,如果其他轴具有多个索引,则可以按级别对它们进行排序。

还有:

即使数据没有排序,索引也可以工作,但效率相当低。 (并表演) . 它还将返回数据的副本,而不是视图:

根据它们,您可能需要确保索引正确排序。