代码之家  ›  专栏  ›  技术社区  ›  Statmonger

使用scikit learn的多类文本分类包predict()和predict_proba()之间的结果不一致

  •  0
  • Statmonger  · 技术社区  · 5 年前

    我正在研究一个多类文本分类问题,它必须提供前5个匹配项,而不是最好的匹配项。因此,成功被定义为前5个匹配中至少有一个是正确的分类。该算法必须达到至少95%的成功率给我们如何定义成功以上。当然,我们将在数据的子集上训练我们的模型,并在剩余的子集上进行测试,以验证我们的模型的成功。

    我一直在使用pythons scikit learns predict_proba()函数来选择前5个匹配项,并使用自定义脚本计算下面的成功率,该脚本似乎在我的示例数据上运行得很好,但是,我注意到前5个成功率低于在我自己的自定义数据上使用.predict()的前1个成功率,后者在数学上是不可能的。这是因为排名第一的结果将自动包含在排名前5的结果中,因此成功率必须至少等于排名前1的成功率(如果不是更多的话)。为了解决问题,我使用predict()与predict_proba()比较前1名的成功率,以确保它们相等,并确保前5名的成功率大于前1名。

    我已经设置了下面的脚本来引导您浏览我的逻辑,看看是否在某个地方做出了错误的假设,或者我的数据是否有问题需要修复。我正在测试许多分类器和特征,但为了简单起见,您将看到我只是使用计数向量作为特征,使用logistic回归作为分类器,因为我不相信(据我所知,这是问题的一部分)。 如果有人能解释我为什么会发现这种差异,我将不胜感激。

    代码:

    # Set up environment
    from sklearn.datasets import fetch_20newsgroups
    from sklearn.linear_model import LogisticRegression
    from sklearn import metrics, model_selection
    from sklearn.feature_extraction.text import CountVectorizer
    
    import pandas as pd
    import numpy as np
    
    #Read in data and do just a bit of preprocessing
    
    # User's Location of git repository
    Git_Location = 'C:/Documents'
    
    # Set Data Location:
    data = Git_Location + 'Data.csv'
    
    # load the data
    df = pd.read_csv(data,low_memory=False,thousands=',', encoding='latin-1')
    df = df[['CODE','Description']] #select only these columns
    df = df.rename(index=float, columns={"CODE": "label", "Description": "text"})
    
    #Convert label to float so you don't need to encode for processing later on
    df['label']=df['label'].str.replace('-', '',regex=True, case = False).str.strip()
    df['label'].astype('float64', raise_on_error = True)
    
    # drop any labels with count LT 500 to build a strong model and make our testing run faster -- we will get more data later
    df = df.groupby('label').filter(lambda x : len(x)>500)
    
    #split data into testing and training
    train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df.text, df.label,test_size=0.33, random_state=6,stratify=df.label)
    
    # Other examples online use the following data types... we will do the same to remain consistent
    train_y_npar = pd.Series(train_y).values
    train_x_list = pd.Series.tolist(train_x)
    valid_x_list = pd.Series.tolist(valid_x)
    
    # cast validation datasets to dataframes to allow to merging later on
    valid_x_df = pd.DataFrame(valid_x)
    valid_y_df = pd.DataFrame(valid_y)
    
    
    # Extracting features from data
    count_vect = CountVectorizer()
    X_train_counts = count_vect.fit_transform(train_x_list)
    X_test_counts = count_vect.transform(valid_x_list)
    
    # Define the model training and validation function
    def TV_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):
    
        # fit the training dataset on the classifier
        classifier.fit(feature_vector_train, label)
    
        # predict the top n labels on validation dataset
        n = 5
        #classifier.probability = True
        probas = classifier.predict_proba(feature_vector_valid)
        predictions = classifier.predict(feature_vector_valid)
    
        #Identify the indexes of the top predictions
        top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]
    
        #then find the associated SOC code for each prediction
        top_class = classifier.classes_[top_n_predictions]
    
        #cast to a new dataframe
        top_class_df = pd.DataFrame(data=top_class)
    
        #merge it up with the validation labels and descriptions
        results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
        results = pd.merge(results, top_class_df, left_index=True, right_index=True)
    
    
        top5_conditions = [
            (results.iloc[:,0] == results[0]),
            (results.iloc[:,0] == results[1]),
            (results.iloc[:,0] == results[2]),
            (results.iloc[:,0] == results[3]),
            (results.iloc[:,0] == results[4])]
        top5_choices = [1, 1, 1, 1, 1]
    
        #Top 1 Result
        #top1_conditions = [(results['0_x'] == results[4])]
        top1_conditions = [(results.iloc[:,0] == results[4])]
        top1_choices = [1]
    
        # Create the success columns
        results['Top 5 Successes'] = np.select(top5_conditions, top5_choices, default=0)
        results['Top 1 Successes'] = np.select(top1_conditions, top1_choices, default=0)
    
        print("Are Top 5 Results greater than Top 1 Result?: ", (sum(results['Top 5 Successes'])/results.shape[0])>(metrics.accuracy_score(valid_y, predictions)))
       print("Are Top 1 Results equal from predict() and predict_proba()?: ", (sum(results['Top 1 Successes'])/results.shape[0])==(metrics.accuracy_score(valid_y, predictions)))
    
        print(" ")
        print("Details: ")
        print("Top 5 Accuracy Rate (predict_proba)= ", sum(results['Top 5 Successes'])/results.shape[0])
        print("Top 1 Accuracy Rate (predict_proba)= ", sum(results['Top 1 Successes'])/results.shape[0])
        print("Top 1 Accuracy Rate = (predict)=", metrics.accuracy_score(valid_y, predictions))
    

    使用scikit learn内置于20个新闻组数据集的输出示例(这是我的目标): 注意:我在另一个数据集上运行了这个精确的代码,并能够生成这些结果,这告诉我函数及其依赖项可以工作,因此问题一定是在数据中。

    Are Top 5 Results greater than Top 1 Result?:  True 
    Are Top 1 Results equal from predict() and predict_proba()?:  True  
    

    细节:

    Top 5 Accuracy Rate (predict_proba)=  0.9583112055231015 
    Top 1 Accuracy Rate (predict_proba)=  0.8069569835369091 
    Top 1 Accuracy Rate = (predict)= 0.8069569835369091
    

    现在运行我的数据:

    TV_model(LogisticRegression(), X_train_counts, train_y_npar, X_test_counts, valid_y_df, valid_x_df)
    

    输出:

    Are Top 5 Results greater than Top 1 Result?:  False 
    Are Top 1 Results equal from predict() and predict_proba()?:  False   
    

    细节:

    • 前5名准确率(预测概率)=0.6581632653061225
    • 前1名准确率(预测概率)=0.2010204081632653
    • 前1名准确率=(预测)=0.8091187478734263
    0 回复  |  直到 5 年前
        1
  •  0
  •   Statmonger    5 年前

    更新:找到解决方案!显然,索引在某个点上被重置了。所以我需要做的就是在测试和培训分离之后重置验证数据集索引。

    更新代码:

    # Set up environment
    from sklearn.datasets import fetch_20newsgroups
    from sklearn.linear_model import LogisticRegression
    from sklearn import metrics, model_selection
    from sklearn.feature_extraction.text import CountVectorizer
    
    import pandas as pd
    import numpy as np
    
    #Read in data and do just a bit of preprocessing
    
    # User's Location of git repository
    Git_Location = 'C:/Documents'
    
    # Set Data Location:
    data = Git_Location + 'Data.csv'
    
    # load the data
    df = pd.read_csv(data,low_memory=False,thousands=',', encoding='latin-1')
    df = df[['CODE','Description']] #select only these columns
    df = df.rename(index=float, columns={"CODE": "label", "Description": "text"})
    
    #Convert label to float so you don't need to encode for processing later on
    df['label']=df['label'].str.replace('-', '',regex=True, case = False).str.strip()
    df['label'].astype('float64', raise_on_error = True)
    
    # drop any labels with count LT 500 to build a strong model and make our testing run faster -- we will get more data later
    df = df.groupby('label').filter(lambda x : len(x)>500)
    
    #split data into testing and training
    train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df.text, df.label,test_size=0.33, random_state=6,stratify=df.label)
    
    #reset the index 
    valid_y = valid_y.reset_index(drop=True)
    valid_x = valid_x.reset_index(drop=True)
    
    # cast validation datasets to dataframes to allow to merging later on
    valid_x_df = pd.DataFrame(valid_x)
    valid_y_df = pd.DataFrame(valid_y)
    
    
    # Extracting features from data
    count_vect = CountVectorizer()
    X_train_counts = count_vect.fit_transform(train_x_list)
    X_test_counts = count_vect.transform(valid_x_list)
    
    # Define the model training and validation function
    def TV_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):
    
        # fit the training dataset on the classifier
        classifier.fit(feature_vector_train, label)
    
        # predict the top n labels on validation dataset
        n = 5
        #classifier.probability = True
        probas = classifier.predict_proba(feature_vector_valid)
        predictions = classifier.predict(feature_vector_valid)
    
        #Identify the indexes of the top predictions
        top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]
    
        #then find the associated SOC code for each prediction
        top_class = classifier.classes_[top_n_predictions]
    
        #cast to a new dataframe
        top_class_df = pd.DataFrame(data=top_class)
    
        #merge it up with the validation labels and descriptions
        results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
        results = pd.merge(results, top_class_df, left_index=True, right_index=True)
    
    
        top5_conditions = [
            (results.iloc[:,0] == results[0]),
            (results.iloc[:,0] == results[1]),
            (results.iloc[:,0] == results[2]),
            (results.iloc[:,0] == results[3]),
            (results.iloc[:,0] == results[4])]
        top5_choices = [1, 1, 1, 1, 1]
    
        #Top 1 Result
        #top1_conditions = [(results['0_x'] == results[4])]
        top1_conditions = [(results.iloc[:,0] == results[4])]
        top1_choices = [1]
    
        # Create the success columns
        results['Top 5 Successes'] = np.select(top5_conditions, top5_choices, default=0)
        results['Top 1 Successes'] = np.select(top1_conditions, top1_choices, default=0)
    
        print("Are Top 5 Results greater than Top 1 Result?: ", (sum(results['Top 5 Successes'])/results.shape[0])>(metrics.accuracy_score(valid_y, predictions)))
       print("Are Top 1 Results equal from predict() and predict_proba()?: ", (sum(results['Top 1 Successes'])/results.shape[0])==(metrics.accuracy_score(valid_y, predictions)))
    
        print(" ")
        print("Details: ")
        print("Top 5 Accuracy Rate (predict_proba)= ", sum(results['Top 5 Successes'])/results.shape[0])
        print("Top 1 Accuracy Rate (predict_proba)= ", sum(results['Top 1 Successes'])/results.shape[0])
        print("Top 1 Accuracy Rate = (predict)=", metrics.accuracy_score(valid_y, predictions))