代码之家 › 专栏 › 技术社区 › excelislife

通过使用python消除异常值达到目标坡度值

statsmodels scipy pandas python-3.x python

-1

excelislife · 技术社区 · 6 年前

我有一个数据集,从中我可以消除最多两个数据点,以达到10的目标坡度。我对异常值拒绝的标准是,如果斜率在+/-5%之内,如果目标值(10),一切正常。但是,应删除上述内容。

试验数据集如下:

从图像左侧可以看出,三个坡度分别为11.6、10.5和9.4。但目标坡度为10。

在数据的右侧,我删除了向上倾斜的数据点,也就是说,不允许它达到10的目标坡度。

这只是一个构建的数据集,但其概念与我需要的最终数据集类似。

我该怎么用python做呢?在这件事上任何帮助都非常感谢。

试验数据集如下:

从图像左侧可以看出,三个坡度分别为11.6、10.5和9.4。但目标坡度为10。

在数据的右侧,我删除了向上倾斜的数据点,即不允许它达到10的目标坡度。

这只是一个构建的数据集,但其概念与我需要的最终数据集类似。

我该怎么用python做呢?在这件事上任何帮助都非常感谢。

1 回复 | 直到 6 年前

Frayal 6 年前

首先,如果您已经知道所需的坡度,可以在Python中完成这项工作,但是如果您有大量的数据,则需要小心。其次,如果标准为5%,则坡度10.5将不会被修正。

您要求的解决方案

some imports 将熊猫作为PD导入将numpy导入为np 从scipy.stats导入规范从scipy导入统计将matplotlib.pyplot导入为plt 将熊猫作为PD导入 df=read_csv('your_file.csv') 国家=“美国” 愿望坡度=10 x=df[df[‘country’]==状态][x] y=df[df[‘country’]==状态][y] “用于测试 x=[4+(i/10),对于范围(100)内的i] y=[c*11+norm.rvs()*4代表c in x] ’’ z=[abs(v-desire_slope*c)表示v,c,in-zip(y,x)] 斜率,截距,r_值,p_值,std_err=stats.lingress(x,y) 打印(斜率) 如果(abs(坡度-期望坡度)/坡度<0.05): 打印(“坡度很好”) 其他: sorted_index_pos=[索引索引,已排序的num(枚举(Z),键=lambda x:x[-1])][-2:] 打印(已排序的索引位置) del x[排序后的索引\u pos[-1]] del y[排序后的索引位置[-1]] del x[排序后的索引位置[0] del y[排序的索引位置[0] 新的斜率,截距,r_值,p_值,std_err=stats.lingress(x,y) 打印(新坡度) < /代码> 输出: 11.08066739990693 〔78, 85〕 11.026005655263733 < /代码> 为什么你需要小心首先,我们不考虑拦截,这可能是个问题。另外,如果我运行以下命令: x=[4+(i/100)for i in range(1000)] y=[c*10+norm.rvs()*4代表c in x] 斜率,截距,r_值,p_值,std_err=stats.lingress(x,y) print(“这里的坡度是:”+str(slope))。 z=[c*x中c的斜率] print(“平均值:”+str(sum(x)/len(x))) plt.绘图(x,y,'b',x,z,'r-') < /代码> 我得到以下输出: 这里的坡度是:10.04367376783041 平均值:8.995 < /代码> wich表明,这些点在斜坡两侧的分布并不均匀。如果将该点行驶得太远,可能会使数据集更不平衡,从而无法改善坡度。所以在这样做的时候要充满希望当标准为5%时,坡度10.5将不会被修正。你要的解决方案 #some imports import pandas as pd import numpy as np from scipy.stats import norm from scipy import stats import matplotlib.pyplot as plt import pandas as pd df = read_csv('your_file.csv') state = 'USA' desire_slope = 10 x = df[df['Country']==state][x] y = df[df['Country']==state][y] '''to use for test x = [ 4+(i/10) for i in range(100)] y = [c*11+norm.rvs()*4 for c in x ] ''' z = [abs(v-desire_slope*c) for v,c in zip(y,x)] slope, intercept, r_value, p_value, std_err = stats.linregress(x,y) print(slope) if(abs(slope-desire_slope)/slope<0.05): print("slope is fine") else: sorted_index_pos = [index for index, num in sorted(enumerate(z), key=lambda x: x[-1])][-2:] print(sorted_index_pos) del x[sorted_index_pos[-1]] del y[sorted_index_pos[-1]] del x[sorted_index_pos[0]] del y[sorted_index_pos[0]] new_slope, intercept, r_value, p_value, std_err = stats.linregress(x,y) print(new_slope) 产量: 11.08066739990693 [78, 85] 11.026005655263733 为什么你需要小心首先,我们不考虑拦截,这可能是个问题。另外,如果我运行以下命令: x = [ 4+(i/100) for i in range(1000)] y = [c*10+norm.rvs()*4 for c in x ] slope, intercept, r_value, p_value, std_err = stats.linregress(x,y) print("the slope here is: "+str(slope)) z = [c*slope for c in x] print("average of values: "+str(sum(x)/len(x))) plt.plot(x,y,'b',x,z,'r-') 我得到以下输出: the slope here is: 10.04367376783041 average of values: 8.995 wich表明,这些点在斜坡两侧的分布并不均匀。如果将该点行驶得太远,可能会使数据集更不平衡,从而无法改善坡度。所以在这样做的时候要充满希望