代码之家  ›  专栏  ›  技术社区  ›  Outcast

带有单独训练和验证集的GridSeachCV错误地考虑了训练结果,以便最终选择最佳模型

  •  2
  • Outcast  · 技术社区  · 6 年前

    我使用 Xgboost 我的目标是在分类阈值上达到最高的精度= 0.5 .

    import numpy as np
    import pandas as pd
    import xgboost
    
    # Import datasets from edge node
    data_train = pd.read_csv('data.csv')
    data_valid = pd.read_csv('data_valid.csv')
    
    # Specify 'data_test' as validation set for the Grid Search below
    from sklearn.model_selection import PredefinedSplit
    X, y, train_valid_indices = train_valid_merge(data_train, data_valid)
    train_valid_merge_indices = PredefinedSplit(test_fold=train_valid_indices)
    
    # Define my own scoring function to see
    # if it is called for both the training and the validation sets
    from sklearn.metrics import make_scorer
    custom_scorer = make_scorer(score_func=my_precision, greater_is_better=True, needs_proba=False)
    
    # Instantiate xgboost
    from xgboost.sklearn import XGBClassifier
    classifier = XGBClassifier(random_state=0)
    
    # Small parameters' grid ONLY FOR START
    # I plan to use way bigger parameters' grids 
    parameters = {'n_estimators': [150, 175, 200]}
    
    # Execute grid search and retrieve the best classifier
    from sklearn.model_selection import GridSearchCV
    classifiers_grid = GridSearchCV(estimator=classifier, param_grid=parameters, scoring=custom_scorer,
                                       cv=train_valid_merge_indices, refit=True, n_jobs=-1)
    classifiers_grid.fit(X, y)
    

    ............................................................................

    train_valid_merge

    我想用我的训练设备训练每一个模特( data_train data_valid 列车\u有效\u合并 将我的训练和验证集连接起来,这样就可以将它们提供给 GridSeachCV 我还用了 PredefineSplit 要指定此合并集中的培训集和验证集,请执行以下操作:

    def train_valid_merge(data_train, data_valid):
    
        # Set test_fold values to -1 for training observations
        train_indices = [-1]*len(data_train)
    
        # Set test_fold values to 0 for validation observations
        valid_indices = [0]*len(data_valid)
    
        # Concatenate the indices for the training and validation sets
        train_valid_indices = train_indices + valid_indices
    
        # Concatenate data_train & data_valid
        import pandas as pd
        data = pd.concat([data_train, data_valid], axis=0, ignore_index=True)
        X = data.iloc[:, :-1].values
        y = data.iloc[:, -1].values
        return X, y, train_valid_indices
    

    ............................................................................

    custom_scorer

    我定义了自己的评分函数,它只是返回精度,看看是否对训练集和验证集都调用它:

    def my_precision(y_true, y_predict):
    
        # Check length of 'y_true' to see if it is the training or the validation set
        print(len(y_true))
    
        # Calculate precision
        from sklearn.metrics import precision_score
        precision = precision_score(y_true, y_predict, average='binary')
    
        return precision
    

    ............................................................................

    当我把整件事都办完的时候 parameters = {'n_estimators': [150, 175, 200]} print(len(y_true)) my_precision 功能:

    600
    600
    3500
    600
    3500
    3500
    

    也就是说,对训练集和验证集都调用了评分函数。但我已经测试过,评分函数不仅被调用,而且它来自训练集和验证集的结果也被用来从网格搜索中确定最佳模型(尽管我已经指定它只使用验证集结果)。

    'n_estimators': [150, 175, 200] )它考虑了训练集和验证集(2组)的得分,因此产生(3个参数)x(2组)=6个不同的网格结果。因此,它从所有这些网格结果中挑选出最佳的超参数集,因此它可能最终从训练集的结果中挑选出一个,而我只想考虑验证集(3个结果)。

    我的\u精度 使用类似的函数绕过训练集(通过将其所有精度值设置为0):

    # Remember that the training set has 3500 observations
    # and the validation set 600 observations
    if(len(y_true>600)):
        return 0
    

    我的问题如下:

    为什么自定义评分函数会同时考虑训练和验证集,以选择最佳模型,而我已经用我的 train_valid_merge_indices 是否只应根据验证集选择用于网格搜索的最佳模型?

    GridSearchCV 当模型的选择和排序将要完成时,是否只考虑验证集和模型的得分?

    1 回复  |  直到 6 年前
        1
  •  1
  •   desertnaut SKZI    6 年前

    我有一个不同的训练集和一个不同的验证集。我想在训练集上训练我的模型,并根据它在不同验证集上的性能找到最佳的超参数。

    那你两个都不需要 PredefinedSplit 也不是 GridSearchCV :

    import pandas as pd
    from xgboost.sklearn import XGBClassifier
    from sklearn.metrics import precision_score
    
    # Import datasets from edge node
    data_train = pd.read_csv('data.csv')
    data_valid = pd.read_csv('data_valid.csv')
    
    # training data & labels:
    X = data_train.iloc[:, :-1].values
    y = data_train.iloc[:, -1].values   
    
    # validation data & labels:
    X_valid = data_valid.iloc[:, :-1].values
    y_true = data_valid.iloc[:, -1].values 
    
    n_estimators = [150, 175, 200]
    perf = []
    
    for k_estimators in n_estimators:
        clf = XGBClassifier(n_estimators=k_estimators, random_state=0)
        clf.fit(X, y)
    
        y_predict = clf.predict(X_valid)
        precision = precision_score(y_true, y_predict, average='binary')
        perf.append(precision)
    

    perf 将包含验证集上各自分类器的性能。。。