代码之家 › 专栏 › 技术社区 › MyCarta

不规则和交替箱的分类统计

binning statistics scipy numpy python

MyCarta · 技术社区 · 6 年前

这是一个更复杂的实际应用程序的简短、完整的示例。

图书馆使用 :

import numpy as np
import scipy as sp
import scipy.stats as scist
import matplotlib.pyplot as plt
from itertools import zip_longest

数据 :

我有一个数组,其中包含用开始和结束定义的不规则容器,例如( 在实际情况下,这种格式是给定的,因为它是另一个进程的输出 ):

bin_starts = np.array([0, 93, 184, 277, 368])
bin_ends = np.array([89, 178, 272, 363, 458])

我将其与:

bns = np.stack(zip_longest(bin_starts, bin_ends)).flatten()
bns
>>> array([  0,  89,  93, 178, 184, 272, 277, 363, 368, 458])

给出一个有规律的长间隔和短间隔交替的序列,所有的长度都不规则。这是给定长间隔和短间隔的草图表示:

我有一堆时间序列数据,类似于下面创建的随机数据:

# make some random example data to bin
np.random.seed(45)
x = np.arange(0,460)
y = 5+np.random.randn(460).cumsum()
plt.plot(x,y);

目标 :

我想用间隔序列来收集统计数据(平均值,百分位数, 总计 )在数据上-但只能使用较长的间隔,即草图中的黄色间隔。

假设和澄清:

长间隔的边缘永远不会重叠;换句话说,长间隔之间总是有一个短间隔。而且,第一个间隔总是很长的。

当前解决方案:

一种方法是使用 scipy.stats.binned_statistic 对所有间隔进行切片,然后将结果只保留其他间隔(即 [::2] ,就像这样(对一些统计数据有很大帮助,比如 np.percentile 正在读书 this SO answer 通过 @ali_m ):

ave = scist.binned_statistic(x, y, 
                         statistic = np.nanmean, 
                         bins=bns)[0][::2]

这给了我想要的结果:

plt.plot(np.arange(0,5), ave);

问题 : 有没有比这更好的方法(使用 Numpy , Scipy 或 Pandas )?

1 回复 | 直到 6 年前

MyCarta 6 年前

我想用一些 IntervalIndex , pd.cut , groupby 和 agg 是一种相对简单的方法来获得你想要的。

我先制作数据帧(不确定这是否是从NP数组中获取数据的最佳方法):

df = pd.DataFrame()
df['x'], df['y'] = x, y

然后可以将容器定义为元组列表:

bins = list(zip(bin_starts, bin_ends))

使用熊猫 IntervalIndex ,它有一个 from_tuples() 方法,创建要在以后使用的存储箱 cut . 这很有用,因为你不必依赖于切片 bns 数组来分离“长间隔和短间隔的定期交替序列”——相反,您可以显式定义您感兴趣的存储箱:

ii = pd.IntervalIndex.from_tuples(bins, closed='both')

这个 closed Kwarg指定是否在间隔中包含最终成员编号。例如,对于元组 (0, 89) 用 closed='both' 间隔将包括0和89(与 left , right 或 neither )

然后在数据框中使用 pd.cut() ,这是一种将值放入区间的方法。安 中间指数 对象可以使用 bin kwarg:

df['bin'] = pd.cut(df.x, bins=ii)

最后,使用 df.groupby() 和 .agg() 要想得到你想要的数据:

df.groupby('bin')['y'].agg(['mean', np.std])

输出:

                 mean       std
bin                            
[0, 89]     -4.814449  3.915259
[93, 178]   -7.019151  3.912347
[184, 272]   7.223992  5.957779
[277, 363]  15.060402  3.979746
[368, 458]  -0.644127  3.361927