代码之家  ›  专栏  ›  技术社区  ›  Nikhil Utane

XGBoost显示所有测试数据的相同预测

  •  0
  • Nikhil Utane  · 技术社区  · 6 年前

    我正在研究一个基于特定输入值预测输出标签的问题。 由于我没有真正的数据,我正在创建一些虚拟数据,以便在获取数据时可以准备好代码。 下面是示例数据的外观。有一组输入值,最后一列“output”是要预测的输出标签。

    input_1,input_2,input_3,input_4,input_5,input_6,input_7,input_8,input_9,input_10,input_11,input_12,input_13,input_14,input_15,input_16,input_17,input_18,input_19,input_20,input_21,input_22,input_23,input_24,input_25,input_26,input_27,input_28,input_29,input_30,input_31,input_32,output
    0.0,97.0,155,143,98,145,102,102,144,100,96,193,90,98,98,122,101,101,101,98,99,96,118,148,98,99,112,94,98,100,96.0,95,loc12
    96.0,94.0,116,99,98,105,95,101,168,101,96,108,95,98,98,96,102,98,98,99,98,98,132,150,102,101,195,104,96,97,93.0,98,loc27
    

    因为这是伪数据,所以我将输出标签设置为具有最大值的输入。 我的期望是XGBoost算法应该自己学习并正确预测输出标签。

    我写了下面的代码来训练和测试XGBoost。

    from __future__ import division
    import numpy as np
    import pandas as pd
    import scipy.sparse
    import pickle
    import xgboost as xgb
    from sklearn.preprocessing import OneHotEncoder, LabelEncoder, LabelBinarizer
    
    df=pd.read_csv("data.txt", sep=',')
    
    # Create training and validation sets
    sz = df.shape
    train = df.iloc[:int(sz[0] * 0.7), :]
    test = df.iloc[int(sz[0] * 0.7):, :]
    
    # Separate X & Y for training
    train_X = train.iloc[:, :32].values
    train_Y = train.iloc[:, 32].values
    
    # Separate X & Y for test
    test_X = test.iloc[:, :32].values
    test_Y = test.iloc[:, 32].values
    
    # Get the count of  unique output labels
    num_classes = df.output.nunique()
    
    lb = LabelBinarizer()
    train_Y = lb.fit_transform(train_Y.tolist())
    test_Y = lb.fit_transform(test_Y.tolist())
    
    # Normalize the training data
    #train_X -= np.mean(train_X, axis=0)
    #train_X /= np.std(train_X, axis=0)
    #train_X /= 255
    
    # Normalize the test data
    #test_X -= np.mean(test_X, axis=0)
    #test_X /= np.std(test_X, axis=0)
    #test_X /= 255
    
    xg_train = xgb.DMatrix(train_X, label=train_Y)
    xg_test = xgb.DMatrix(test_X, label=test_Y)
    
    # setup parameters for xgboost
    param = {}
    # use softmax multi-class classification
    param['objective'] = 'multi:softmax'
    # scale weight of positive examples
    param['eta'] = 0.1
    param['max_depth'] = 6
    param['silent'] = 1
    param['nthread'] = 4
    param['num_class'] = num_classes
    
    watchlist = [(xg_train, 'train'), (xg_test, 'test')]
    num_round = 5
    bst = xgb.train(param, xg_train, num_round, watchlist)
    #bst.dump_model('dump.raw.txt')
    # get prediction
    pred = bst.predict(xg_test)
    actual = np.argmax(test_Y, axis=1)
    error_rate = np.sum(pred != actual) / test_Y.shape[0]
    print('Test error using softmax = {}'.format(error_rate))
    
    # do the same thing again, but output probabilities
    param['objective'] = 'multi:softprob'
    bst = xgb.train(param, xg_train, num_round, watchlist)
    # Note: this convention has been changed since xgboost-unity
    # get prediction, this is in 1D array, need reshape to (ndata, nclass)
    pred_prob = bst.predict(xg_test).reshape(test_Y.shape[0], num_classes)
    pred_label = np.argmax(pred_prob, axis=1)
    actual_label = np.argmax(test_Y, axis=1)
    error_rate = np.sum(pred_label != actual_label) / test_Y.shape[0]
    print('Test error using softprob = {}'.format(error_rate))
    

    但是我观察到它总是预测标签0,即一个热编码输出中的第一个索引。

    输出:

    [0] train-merror:0.11081    test-merror:0.111076
    [1] train-merror:0.11081    test-merror:0.111076
    [2] train-merror:0.11081    test-merror:0.111076
    [3] train-merror:0.111216   test-merror:0.111076
    [4] train-merror:0.11081    test-merror:0.111076
    Test error using softmax = 0.64846954875355
    [0] train-merror:0.11081    test-merror:0.111076
    [1] train-merror:0.11081    test-merror:0.111076
    [2] train-merror:0.11081    test-merror:0.111076
    [3] train-merror:0.111216   test-merror:0.111076
    [4] train-merror:0.11081    test-merror:0.111076
    Test error using softprob = 0.64846954875355
    

    pred_prob[0:10]
    array([[0.34024397, 0.10218474, 0.07965304, 0.07965304, 0.07965304,
            0.07965304, 0.07965304, 0.07965304, 0.07965304],
           [0.34009758, 0.10257103, 0.07961877, 0.07961877, 0.07961877,
            0.07961877, 0.07961877, 0.07961877, 0.07961877],
           [0.34421352, 0.09171014, 0.08058234, 0.08058234, 0.08058234,
            0.08058234, 0.08058234, 0.08058234, 0.08058234],
           [0.33950377, 0.10413795, 0.07947975, 0.07947975, 0.07947975,
            0.07947975, 0.07947975, 0.07947975, 0.07947975],
           [0.3426607 , 0.09580766, 0.08021881, 0.08021881, 0.08021881,
            0.08021881, 0.08021881, 0.08021881, 0.08021881],
           [0.33777002, 0.10427278, 0.07970817, 0.07970817, 0.07970817,
            0.07970817, 0.07970817, 0.07970817, 0.07970817],
           [0.33733884, 0.10985068, 0.07897293, 0.07897293, 0.07897293,
            0.07897293, 0.07897293, 0.07897293, 0.07897293],
           [0.33953893, 0.10404517, 0.07948799, 0.07948799, 0.07948799,
            0.07948799, 0.07948799, 0.07948799, 0.07948799],
           [0.33987975, 0.10314585, 0.07956778, 0.07956778, 0.07956778,
            0.07956778, 0.07956778, 0.07956778, 0.07956778],
           [0.34013695, 0.10246711, 0.07962799, 0.07962799, 0.07962799,
            0.07962799, 0.07962799, 0.07962799, 0.07962799]], dtype=float32)
    

    不管我得到的是什么样的准确度,都是因为预测了标签0,这大约是数据的35%。

    我的期望是对的吗?输入特性是否太多,数据是否太少,以至于无法正确学习?

    完整代码: Here

    Here

    0 回复  |  直到 6 年前
        1
  •  0
  •   Misan    5 年前

    对于像我这样有此问题的其他人,请检查xgb.train参数:“num_boost_round”。确保它与xgb.cv相等或大致相同。 我想问题是模特儿还没受过训练,所以,停得太早了。