代码之家  ›  专栏  ›  技术社区  ›  Austin

将sklearn管道+嵌套交叉验证组合在一起用于KNN回归

  •  4
  • Austin  · 技术社区  · 7 年前

    sklearn.neighbors.KNeighborsRegressor 这包括:

    • 交叉验证1到20范围内的超参数K
    • 使用RMSE作为误差度量

    scikit learn中有太多不同的选项,以至于我在决定需要哪些课程时有点不知所措。

    此外 sklearn.neighborsregressor

    sklearn.pipeline.Pipeline  
    sklearn.preprocessing.Normalizer
    sklearn.model_selection.GridSearchCV
    sklearn.model_selection.cross_val_score
    
    sklearn.feature_selection.selectKBest
    OR
    sklearn.feature_selection.SelectFromModel
    

    有人能告诉我这个管道/工作流的定义是什么样子的吗?我认为应该是这样的:

    import numpy as np
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import Normalizer
    from sklearn.feature_selection import SelectKBest, f_classif
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.model_selection import cross_val_score, GridSearchCV
    
    # build regression pipeline
    pipeline = Pipeline([('normalize', Normalizer()),
                         ('kbest', SelectKBest(f_classif)),
                         ('regressor', KNeighborsRegressor())])
    
    # try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
    parameters = {'kbest__k':  list(range(1, X.shape[1]+1)),
                  'regressor__n_neighbors': list(range(1,21))}
    
    # outer cross-validation on model, inner cross-validation on hyperparameters
    scores = cross_val_score(GridSearchCV(pipeline, parameters, scoring="neg_mean_squared_error", cv=10), 
                             X, y, cv=10, scoring="neg_mean_squared_error", verbose=2)
    
    rmses = np.abs(scores)**(1/2)
    avg_rmse = np.mean(rmses)
    print(avg_rmse)
    

    它似乎没有出错,但我的一些担忧是:

    • 如果我想根据最佳RMSE选择最终模型,我应该使用吗 scoring="neg_mean_squared_error" 对于两者 cross_val_score GridSearchCV
    • SelectKBest, f_classif KNeighborsRegressor 模型
      • 哪个特征子集被选为最佳特征

    非常感谢您的帮助!

    1 回复  |  直到 7 年前
        1
  •  6
  •   seralouk    7 年前

    你的代码似乎没问题。

    scoring="neg_mean_squared_error" 对于两者 cross_val_score GridSearchCV ,我也会这样做,以确保一切正常运行,但测试这一点的唯一方法是删除其中一个,然后查看结果是否发生变化。

    SelectKBest 是一种很好的方法,但您也可以使用 SelectFromModel 或者你能找到的其他方法 here

    最后,为了获得 最佳参数

    import ...
    
    
    pipeline = Pipeline([('normalize', Normalizer()),
                         ('kbest', SelectKBest(f_classif)),
                         ('regressor', KNeighborsRegressor())])
    
    # try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
    parameters = {'kbest__k':  list(range(1, X.shape[1]+1)),
                  'regressor__n_neighbors': list(range(1,21))}
    
    # changes here
    
    grid = GridSearchCV(pipeline, parameters, cv=10, scoring="neg_mean_squared_error")
    
    grid.fit(X, y)
    
    # get the best parameters and the best estimator
    print("the best estimator is \n {} ".format(grid.best_estimator_))
    print("the best parameters are \n {}".format(grid.best_params_))
    
    # get the features scores rounded in 2 decimals
    pip_steps = grid.best_estimator_.named_steps['kbest']
    
    features_scores = ['%.2f' % elem for elem in pip_steps.scores_ ]
    print("the features scores are \n {}".format(features_scores))
    
    feature_scores_pvalues = ['%.3f' % elem for elem in pip_steps.pvalues_]
    print("the feature_pvalues is \n {} ".format(feature_scores_pvalues))
    
    # create a tuple of feature names, scores and pvalues, name it "features_selected_tuple"
    
    featurelist = ['age', 'weight']
    
    features_selected_tuple=[(featurelist[i], features_scores[i], 
    feature_scores_pvalues[i]) for i in pip_steps.get_support(indices=True)]
    
    # Sort the tuple by score, in reverse order
    
    features_selected_tuple = sorted(features_selected_tuple, key=lambda 
    feature: float(feature[1]) , reverse=True)
    
    # Print
    print 'Selected Features, Scores, P-Values'
    print features_selected_tuple
    

    使用我的数据的结果:

    the best estimator is
    Pipeline(steps=[('normalize', Normalizer(copy=True, norm='l2')), ('kbest', SelectKBest(k=2, score_func=<function f_classif at 0x0000000004ABC898>)), ('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
             metric_params=None, n_jobs=1, n_neighbors=18, p=2,
             weights='uniform'))])
    
    the best parameters are
    {'kbest__k': 2, 'regressor__n_neighbors': 18}
    
    the features scores are
    ['8.98', '8.80']
    
    the feature_pvalues is
    ['0.000', '0.000']
    
    Selected Features, Scores, P-Values
    [('correlation', '8.98', '0.000'), ('gene', '8.80', '0.000')]