代码之家  ›  专栏  ›  技术社区  ›  Stanleyrr

K倍交叉验证的准确度排序与单个模型的准确度排序不一致

  •  0
  • Stanleyrr  · 技术社区  · 6 年前

    这是我第一次运行k-fold交叉验证,我对从输出中看到的一个现象感到困惑。基本上,5倍交叉验证一致地为模型8(Adaboost分类器)和模型9(梯度增强分类器)提供了最高的准确度得分,如下所示。然而,当我使用20%的数据集作为测试数据单独运行这些ML模型时,根据混淆矩阵和AUC,模型7(随机森林分类器)总是在所有5个模型中产生最高的精度。我最初的期望是,如果我单独运行ML模型,具有高k倍交叉验证精度的ML模型也应该返回高精度。这里的情况似乎不是这样。有人能解释一下为什么我会看到这种差异吗?

    以下是我用来训练数据的ML模型:

    model6 = DecisionTreeClassifier()
    model7 = RandomForestClassifier(n_estimators=300)
    model8 = AdaBoostClassifier(n_estimators=300)
    model9 = GradientBoostingClassifier(n_estimators=300, learning_rate=1.0, max_depth=1, random_state=0)
    model10 = KNeighborsClassifier(n_neighbors=5)
    

    以下是我针对5倍CV和单个ML模型的完整代码:

    X_train, X_test, Y_train, Y_test = train_test_split(whole_data_input, whole_data_output, test_size=0.2)
    X_train.reset_index(inplace=True)
    #To remove the index column:
    X_train.drop(['index'],axis=1,inplace=True)
    
    X_test.reset_index(inplace=True)
    #To remove the index column:
    X_test.drop(['index'],axis=1,inplace=True)
    
    Y_train.reset_index(inplace=True)
    #To remove the index column:
    Y_train.drop(['index'],axis=1,inplace=True)
    
    Y_test.reset_index(inplace=True)
    #To remove the index column:
    Y_test.drop(['index'],axis=1,inplace=True)
    
    warnings.filterwarnings('ignore')
    
    model6 = DecisionTreeClassifier()
    model7 = RandomForestClassifier(n_estimators=300)
    model8 = AdaBoostClassifier(n_estimators=300)
    model9 = GradientBoostingClassifier(n_estimators=300, 
    learning_rate=1.0,max_depth=1, random_state=0)
    model10 = KNeighborsClassifier(n_neighbors=5)
    
    model6.fit(X_train, Y_train)
    model7.fit(X_train, Y_train)
    model8.fit(X_train, Y_train)
    model9.fit(X_train, Y_train)
    model10.fit(X_train, Y_train)
    
    # Perform 5-fold cross validation across different models:
    
    #Here I am calling 'whole_data['label'] instead of the 'whole_data[['label']] I created earlier because cross validation only works with this data shape:
    whole_data_output=whole_data['label']    
    
    print('THE FOLLOWING OUTPUT REPRESENT ACCURACIES OF 5-FOLD VALIDATIONS FROM VARIOUS ML MODELS:')
    print()
    scores = cross_val_score(model6, whole_data_input, whole_data_output, cv=5)
    print('Cross-validated scores for model6, Decision Tree Classifier, is:' + str(scores))
    
    print()
    scores = cross_val_score(model7, whole_data_input, whole_data_output, cv=5)
    print('Cross-validated scores for model7, Random Forest Classifier, is:' + str(scores))
    
    print()
    scores = cross_val_score(model8, whole_data_input, whole_data_output, cv=5)
    print('Cross-validated scores for model8, Adaboost Classifier, is:' + str(scores))
    
    print()
    scores = cross_val_score(model9, whole_data_input, whole_data_output, cv=5)
    print('Cross-validated scores for model9, Gradient Boosting Classifier, is:' + str(scores))
    
    print()
    scores = cross_val_score(model10, whole_data_input, whole_data_output, cv=5)
    print('Cross-validated scores for model10, K Neighbors Classifier, is:' + str(scores))
    
    print('THE FOLLOWING OUTPUT REPRESENT RESULTS FROM VARIOUS ML MODELS:')
    print()
    
    result6 = model6.predict(X_test)
    result7 = model7.predict(X_test)
    result8 = model8.predict(X_test)
    result9 = model9.predict(X_test)
    result10 = model10.predict(X_test)
    
    from sklearn.metrics import classification_report
    
    print('Classification report for model 6, decision tree classifier, is: ')
    print(confusion_matrix(Y_test,result6))
    print()
    print(classification_report(Y_test,result6))
    print()
    print("Area under curve (auc) of model6 is: ", metrics.roc_auc_score(Y_test, result6)) 
    print()
    
    print('Classification report for model 7, random forest classifier, is: ')
    print(confusion_matrix(Y_test,result7))
    print()
    print(classification_report(Y_test,result7))
    print()
    print("Area under curve (auc) of model7 is: ", metrics.roc_auc_score(Y_test, result7)) 
    print()
    
    print('Classification report for model 8, adaboost classifier, is: ')
    print(confusion_matrix(Y_test,result8))
    print()
    print(classification_report(Y_test,result8))
    print()
    print("Area under curve (auc) of model8 is: ", metrics.roc_auc_score(Y_test, result8)) 
    print()
    
    print('Classification report for model 9, gradient boosting classifier, is: ')
    print(confusion_matrix(Y_test,result9))
    print()
    print(classification_report(Y_test,result9))
    print()
    print("Area under curve (auc) of model9 is: ", metrics.roc_auc_score(Y_test, result9)) 
    print()
    
    print('Classification report for model 10, K neighbors classifier, is: ')
    print(confusion_matrix(Y_test,result10))
    print()
    print(classification_report(Y_test,result10))
    print()
    print("Area under curve (auc) of model10 is: ", metrics.roc_auc_score(Y_test, result10)) 
    print()
    

    以下输出表示各种ML模型5倍交叉验证的精度:

    Cross-validated scores for model6, Decision Tree Classifier, is:[ 0.61364665  0.75754735  0.77046902]
    
    Cross-validated scores for model7, Random Forest Classifier, is:[ 0.62463637  0.79326395  0.8073181 ]
    
    Cross-validated scores for model8, Adaboost Classifier, is:[ 0.64916931  0.81960696  0.84196916]
    
    Cross-validated scores for model9, Gradient Boosting Classifier, is:[ 0.64910466  0.82177258  0.83909235]
    
    Cross-validated scores for model10, K Neighbors Classifier, is:[ 0.61180425  0.75412115  0.73012897]
    

    以下输出表示各种ML模型的结果:

    Classification report for model 6, decision tree classifier, is: 
    [[6975 1804]
    [1893 7891]]
    
             precision    recall  f1-score   support
    
         -1       0.79      0.79      0.79      8779
          1       0.81      0.81      0.81      9784
    avg / total       0.80      0.80      0.80     18563
    
    Area under curve (auc) of model6 is:  0.800515237805
    
    Classification report for model 7, random forest classifier, is: 
    [[6883 1896]
    [1216 8568]]
    
             precision    recall  f1-score   support
    
         -1       0.85      0.78      0.82      8779
          1       0.82      0.88      0.85      9784
    avg / total       0.83      0.83      0.83     18563
    
    Area under curve (auc) of model7 is:  0.829872762782
    
    Classification report for model 8, adaboost classifier, is: 
    [[5851 2928]
    [ 891 8893]]
    
             precision    recall  f1-score   support
    
         -1       0.87      0.67      0.75      8779
          1       0.75      0.91      0.82      9784
    avg / total       0.81      0.79      0.79     18563
    
    Area under curve (auc) of model8 is:  0.787704885721
    
    Classification report for model 9, gradient boosting classifier, is: 
    [[5905 2874]
    [ 918 8866]]
    
             precision    recall  f1-score   support
    
         -1       0.87      0.67      0.76      8779
          1       0.76      0.91      0.82      9784
    avg / total       0.81      0.80      0.79     18563
    
    Area under curve (auc) of model9 is:  0.789400603089
    
    Classification report for model 10, K neighbors classifier, is: 
    [[6467 2312]
    [1666 8118]]
    
             precision    recall  f1-score   support
    
         -1       0.80      0.74      0.76      8779
          1       0.78      0.83      0.80      9784
    
    avg / total       0.79      0.79      0.79     18563
    
    Area under curve (auc) of model10 is:  0.783183129908
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   Stev    6 年前

    尝试设置 cv=StratifiedKFold(n_splits=5, shuffle=True) 在你的cross\u val\u分数中,看看它是否有区别。我的理解是 train_test_split 将在类内随机抽样,但 cross_val_score 不会(默认情况下)。

    可以使用 from sklearn.model_selection import StratifiedKFold