我使用
Xgboost
我的目标是在分类阈值上达到最高的精度=
0.5
.
import numpy as np
import pandas as pd
import xgboost
# Import datasets from edge node
data_train = pd.read_csv('data.csv')
data_valid = pd.read_csv('data_valid.csv')
# Specify 'data_test' as validation set for the Grid Search below
from sklearn.model_selection import PredefinedSplit
X, y, train_valid_indices = train_valid_merge(data_train, data_valid)
train_valid_merge_indices = PredefinedSplit(test_fold=train_valid_indices)
# Define my own scoring function to see
# if it is called for both the training and the validation sets
from sklearn.metrics import make_scorer
custom_scorer = make_scorer(score_func=my_precision, greater_is_better=True, needs_proba=False)
# Instantiate xgboost
from xgboost.sklearn import XGBClassifier
classifier = XGBClassifier(random_state=0)
# Small parameters' grid ONLY FOR START
# I plan to use way bigger parameters' grids
parameters = {'n_estimators': [150, 175, 200]}
# Execute grid search and retrieve the best classifier
from sklearn.model_selection import GridSearchCV
classifiers_grid = GridSearchCV(estimator=classifier, param_grid=parameters, scoring=custom_scorer,
cv=train_valid_merge_indices, refit=True, n_jobs=-1)
classifiers_grid.fit(X, y)
............................................................................
train_valid_merge
我想用我的训练设备训练每一个模特(
data_train
data_valid
列车\u有效\u合并
将我的训练和验证集连接起来,这样就可以将它们提供给
GridSeachCV
我还用了
PredefineSplit
要指定此合并集中的培训集和验证集,请执行以下操作:
def train_valid_merge(data_train, data_valid):
# Set test_fold values to -1 for training observations
train_indices = [-1]*len(data_train)
# Set test_fold values to 0 for validation observations
valid_indices = [0]*len(data_valid)
# Concatenate the indices for the training and validation sets
train_valid_indices = train_indices + valid_indices
# Concatenate data_train & data_valid
import pandas as pd
data = pd.concat([data_train, data_valid], axis=0, ignore_index=True)
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
return X, y, train_valid_indices
............................................................................
custom_scorer
我定义了自己的评分函数,它只是返回精度,看看是否对训练集和验证集都调用它:
def my_precision(y_true, y_predict):
# Check length of 'y_true' to see if it is the training or the validation set
print(len(y_true))
# Calculate precision
from sklearn.metrics import precision_score
precision = precision_score(y_true, y_predict, average='binary')
return precision
............................................................................
当我把整件事都办完的时候
parameters = {'n_estimators': [150, 175, 200]}
print(len(y_true))
在
my_precision
功能:
600
600
3500
600
3500
3500
也就是说,对训练集和验证集都调用了评分函数。但我已经测试过,评分函数不仅被调用,而且它来自训练集和验证集的结果也被用来从网格搜索中确定最佳模型(尽管我已经指定它只使用验证集结果)。
'n_estimators': [150, 175, 200]
)它考虑了训练集和验证集(2组)的得分,因此产生(3个参数)x(2组)=6个不同的网格结果。因此,它从所有这些网格结果中挑选出最佳的超参数集,因此它可能最终从训练集的结果中挑选出一个,而我只想考虑验证集(3个结果)。
我的\u精度
使用类似的函数绕过训练集(通过将其所有精度值设置为0):
# Remember that the training set has 3500 observations
# and the validation set 600 observations
if(len(y_true>600)):
return 0
我的问题如下:
为什么自定义评分函数会同时考虑训练和验证集,以选择最佳模型,而我已经用我的
train_valid_merge_indices
是否只应根据验证集选择用于网格搜索的最佳模型?
GridSearchCV
当模型的选择和排序将要完成时,是否只考虑验证集和模型的得分?