代码之家 › 专栏 › 技术社区 › bhjghjh

如何隔离偏离平均值的2和3西格玛数据,然后用python在绘图中标记它们?

standard-deviation curve-fitting scipy python

bhjghjh · 技术社区 · 6 年前

我从一个数据集中读取数据,当在matplotlib中绘制时,数据集如下所示,然后使用线性回归获得最佳拟合曲线。数据示例如下所示:

# ID X Y px py pz M R
1.04826492772e-05 1.04828050287e-05 1.048233088e-05 0.000107002791008 0.000106552433081 0.000108704469007 387.02 4.81947797625e+13
1.87380963036e-05 1.87370588085e-05 1.87372620448e-05 0.000121616280029 0.000151924707761 0.00012371156585 428.77 6.54636174067e+13
3.95579877816e-05 3.95603773653e-05 3.95610756809e-05 0.000163470663023 0.000265203868883 0.000228031803626 470.74 8.66961875758e+13

我的代码如下:

# Regression Function
def regress(x, y):
    #Return a tuple of predicted y values and parameters for linear regression.
    p = sp.stats.linregress(x, y)
    b1, b0, r, p_val, stderr = p
    y_pred = sp.polyval([b1, b0], x)
    return y_pred, p

# plotting z
xz, yz = M, Y_z                              # data, non-transformed
y_pred, _ = regress(xz, np.log(yz))      # change here           # transformed input             

plt.semilogy(xz, yz, marker='o',color ='b', markersize=4,linestyle='None', label="l.o.s within R500")
plt.semilogy(xz, np.exp(y_pred), "b", label = 'best fit')  # transformed output

然而,我可以在数据中看到很多向上的分散,而最佳拟合曲线受到这些因素的影响。所以首先,我想把2和3西格玛的数据点和我的平均数据分开,并用圆圈标记出来。然后采用最佳拟合曲线,只考虑在我的平均数据的1西格玛范围内的点

python中有没有一个好的函数可以帮我完成这个任务?

除此之外,我还可以将数据与实际数据集隔离开来,例如,如果样本输入中的第三行表示2西格玛偏差,那么我是否可以将该行也作为输出,以便以后保存和调查更多内容?

非常感谢你的帮助。

1 回复 | 直到 6 年前

user115215 6 年前

下面是一些代码,它遍历给定数量的窗口中的数据,计算所述窗口中的统计数据,并将数据分为行为良好的列表和行为错误的列表。希望这有帮助。

from scipy import stats
from scipy import polyval
import numpy as np
import matplotlib.pyplot as plt

num_data = 10000
fake_data_x = np.sort(12.8+np.random.random(num_data))
fake_data_y = np.exp(fake_data_x) + np.random.normal(0,scale=50000,size=num_data)

# Regression Function
def regress(x, y):
    #Return a tuple of predicted y values and parameters for linear regression.
    p = stats.linregress(x, y)
    b1, b0, r, p_val, stderr = p
    y_pred = polyval([b1, b0], x)
    return y_pred, p

# plotting z
xz, yz = fake_data_x, fake_data_y                            # data, non-transformed
y_pred, _ = regress(xz, np.log(yz))      # change here           # transformed input             

plt.figure()
plt.semilogy(xz, yz, marker='o',color ='b', markersize=4,linestyle='None', label="l.o.s within R500")
plt.semilogy(xz, np.exp(y_pred), "b", label = 'best fit')  # transformed output
plt.show()

num_bin_intervals = 10 # approx number of averaging windows
window_boundaries = np.linspace(min(fake_data_x),max(fake_data_x),int(len(fake_data_x)/num_bin_intervals)) # window boundaries
y_good = [] # list to collect the "well-behaved" y-axis data
x_good = [] # list to collect the "well-behaved" x-axis data
y_outlier = []
x_outlier = []

for i in range(len(window_boundaries)-1):

    # create a boolean mask to select the data within the averaging window
    window_indices = (fake_data_x<=window_boundaries[i+1]) & (fake_data_x>window_boundaries[i])
    # separate the pieces of data in the window
    fake_data_x_slice = fake_data_x[window_indices]
    fake_data_y_slice = fake_data_y[window_indices]

    # calculate the mean y_value in the window
    y_mean = np.mean(fake_data_y_slice)
    y_std = np.std(fake_data_y_slice)

    # choose and select the outliers
    y_outliers = fake_data_y_slice[np.abs(fake_data_y_slice-y_mean)>=2*y_std]
    x_outliers = fake_data_x_slice[np.abs(fake_data_y_slice-y_mean)>=2*y_std]

    # choose and select the good ones
    y_goodies = fake_data_y_slice[np.abs(fake_data_y_slice-y_mean)<2*y_std]
    x_goodies = fake_data_x_slice[np.abs(fake_data_y_slice-y_mean)<2*y_std]

    # extend the lists with all the good and the bad
    y_good.extend(list(y_goodies))
    y_outlier.extend(list(y_outliers))
    x_good.extend(list(x_goodies))
    x_outlier.extend(list(x_outliers))

plt.figure()
plt.semilogy(x_good,y_good,'o')
plt.semilogy(x_outlier,y_outlier,'r*')
plt.show()