代码之家  ›  专栏  ›  技术社区  ›  Mike

R插入符号中随机森林的混淆矩阵

  •  5
  • Mike  · 技术社区  · 7 年前

    我有二进制YES/NO类响应的数据。使用以下代码运行RF模型。我在获取混淆矩阵结果时遇到问题。

     dataR <- read_excel("*:/*.xlsx")
     Train    <- createDataPartition(dataR$Class, p=0.7, list=FALSE)  
     training <- dataR[ Train, ]
     testing  <- dataR[ -Train, ]
    
    model_rf  <- train(  Class~.,  tuneLength=3,  data = training, method = 
    "rf",  importance=TRUE,  trControl = trainControl (method = "cv", number = 
    5))
    

    结果:

    Random Forest 
    
    3006 samples
    82 predictor
    2 classes: 'NO', 'YES' 
    
    No pre-processing
    Resampling: Cross-Validated (5 fold) 
    Summary of sample sizes: 2405, 2406, 2405, 2404, 2404 
    Addtional sampling using SMOTE
    
    Resampling results across tuning parameters:
    
     mtry  Accuracy   Kappa    
      2    0.7870921  0.2750655
      44    0.7787721  0.2419762
     87    0.7767760  0.2524898
    
    Accuracy was used to select the optimal model using  the largest value.
    The final value used for the model was mtry = 2.
    

    到目前为止还好,但当我运行此代码时:

    # Apply threshold of 0.50: p_class
    class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO")
    
    # Create confusion matrix
    p <-confusionMatrix(class_log, testing[["Class"]])
    
    ##gives the accuracy
    p$overall[1]
    

     Error in model_rf[, 1] : incorrect number of dimensions
    

    如果你们能帮我得到混乱矩阵的结果,我将不胜感激。

    4 回复  |  直到 7 年前
        1
  •  3
  •   missuse    5 年前

    为此,您需要指定 savePredictions 在里面 trainControl . 如果设置为 "final" 保存了对最佳模型的预测。通过指定 classProbs = T 还将保存每个类的概率。

    data(iris)
    iris_2 <- iris[iris$Species != "setosa",] #make a two class problem
    iris_2$Species <- factor(iris_2$Species) #drop levels
    
    library(caret)
    model_rf  <- train(Species~., tuneLength = 3, data = iris_2, method = 
                           "rf", importance = TRUE,
                       trControl = trainControl(method = "cv",
                                                number = 5,
                                                savePredictions = "final",
                                                classProbs = T))
    

    预测包括:

    model_rf$pred
    

    按照CV-fols排序,按原始数据框排序:

    model_rf$pred[order(model_rf$pred$rowIndex),2]
    

    要获得混淆矩阵:

    confusionMatrix(model_rf$pred[order(model_rf$pred$rowIndex),2], iris_2$Species)
    #output
    Confusion Matrix and Statistics
    
                Reference
    Prediction   versicolor virginica
      versicolor         46         6
      virginica           4        44
    
                   Accuracy : 0.9            
                     95% CI : (0.8238, 0.951)
        No Information Rate : 0.5            
        P-Value [Acc > NIR] : <2e-16         
    
                      Kappa : 0.8            
     Mcnemar's Test P-Value : 0.7518         
    
                Sensitivity : 0.9200         
                Specificity : 0.8800         
             Pos Pred Value : 0.8846         
             Neg Pred Value : 0.9167         
                 Prevalence : 0.5000         
             Detection Rate : 0.4600         
       Detection Prevalence : 0.5200         
          Balanced Accuracy : 0.9000         
    
           'Positive' Class : versicolor 
    

    sapply(1:40/40, function(x){
      versicolor <- model_rf$pred[order(model_rf$pred$rowIndex),4]
      class <- ifelse(versicolor >=x, "versicolor", "virginica")
      mat <- confusionMatrix(class, iris_2$Species)
      kappa <- mat$overall[2]
      res <- data.frame(prob = x, kappa = kappa)
      return(res)
    })
    

    此处未获得最高kappa threshold == 0.5 但在0.1。应小心使用,因为这可能会导致过度装配。

        2
  •  1
  •   Hardik Gupta    7 年前

    您可以尝试创建混淆矩阵并检查准确性

    m <- table(class_log, testing[["Class"]])
    m   #confusion table
    
    #Accuracy
    (sum(diag(m)))/nrow(testing)
    
        3
  •  0
  •   Samuel    7 年前

    class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO") 是执行以下测试的if-else语句:

    model_rf ,如果数字大于0.50,则返回“YES”,否则返回“NO”,并将结果保存在object中 class_log .

        4
  •  0
  •   Victor Kostyuk    7 年前

    您需要将模型应用于测试集。

    prediction.rf <- predict(model_rf, testing, type = "prob")

    class_log <- ifelse(prediction.rf > 0.50, "YES", "NO")