  Nikhil Utane  · 6 年前

    我正在研究一个基于特定输入值预测输出标签的问题。 由于我没有真正的数据,我正在创建一些虚拟数据,以便在获取数据时可以准备好代码。 下面是示例数据的外观。有一组输入值,最后一列“output”是要预测的输出标签。


    因为这是伪数据,所以我将输出标签设置为具有最大值的输入。 我的期望是XGBoost算法应该自己学习并正确预测输出标签。


    from __future__ import division
    import numpy as np
    import pandas as pd
    import scipy.sparse
    import pickle
    import xgboost as xgb
    from sklearn.preprocessing import OneHotEncoder, LabelEncoder, LabelBinarizer
    df=pd.read_csv("data.txt", sep=',')
    # Create training and validation sets
    sz = df.shape
    train = df.iloc[:int(sz[0] * 0.7), :]
    test = df.iloc[int(sz[0] * 0.7):, :]
    # Separate X & Y for training
    train_X = train.iloc[:, :32].values
    train_Y = train.iloc[:, 32].values
    # Separate X & Y for test
    test_X = test.iloc[:, :32].values
    test_Y = test.iloc[:, 32].values
    # Get the count of  unique output labels
    num_classes = df.output.nunique()
    lb = LabelBinarizer()
    train_Y = lb.fit_transform(train_Y.tolist())
    test_Y = lb.fit_transform(test_Y.tolist())
    # Normalize the training data
    #train_X -= np.mean(train_X, axis=0)
    #train_X /= np.std(train_X, axis=0)
    #train_X /= 255
    # Normalize the test data
    #test_X -= np.mean(test_X, axis=0)
    #test_X /= np.std(test_X, axis=0)
    #test_X /= 255
    xg_train = xgb.DMatrix(train_X, label=train_Y)
    xg_test = xgb.DMatrix(test_X, label=test_Y)
    # setup parameters for xgboost
    param = {}
    # use softmax multi-class classification
    param['objective'] = 'multi:softmax'
    # scale weight of positive examples
    param['eta'] = 0.1
    param['max_depth'] = 6
    param['silent'] = 1
    param['nthread'] = 4
    param['num_class'] = num_classes
    watchlist = [(xg_train, 'train'), (xg_test, 'test')]
    num_round = 5
    bst = xgb.train(param, xg_train, num_round, watchlist)
    # get prediction
    pred = bst.predict(xg_test)
    actual = np.argmax(test_Y, axis=1)
    error_rate = np.sum(pred != actual) / test_Y.shape[0]
    print('Test error using softmax = {}'.format(error_rate))
    # do the same thing again, but output probabilities
    param['objective'] = 'multi:softprob'
    bst = xgb.train(param, xg_train, num_round, watchlist)
    # Note: this convention has been changed since xgboost-unity
    # get prediction, this is in 1D array, need reshape to (ndata, nclass)
    pred_prob = bst.predict(xg_test).reshape(test_Y.shape[0], num_classes)
    pred_label = np.argmax(pred_prob, axis=1)
    actual_label = np.argmax(test_Y, axis=1)
    error_rate = np.sum(pred_label != actual_label) / test_Y.shape[0]
    print('Test error using softprob = {}'.format(error_rate))



    [0] train-merror:0.11081    test-merror:0.111076
    [1] train-merror:0.11081    test-merror:0.111076
    [2] train-merror:0.11081    test-merror:0.111076
    [3] train-merror:0.111216   test-merror:0.111076
    [4] train-merror:0.11081    test-merror:0.111076
    Test error using softmax = 0.64846954875355
    [0] train-merror:0.11081    test-merror:0.111076
    [1] train-merror:0.11081    test-merror:0.111076
    [2] train-merror:0.11081    test-merror:0.111076
    [3] train-merror:0.111216   test-merror:0.111076
    [4] train-merror:0.11081    test-merror:0.111076
    Test error using softprob = 0.64846954875355

    array([[0.34024397, 0.10218474, 0.07965304, 0.07965304, 0.07965304,
            0.07965304, 0.07965304, 0.07965304, 0.07965304],
           [0.34009758, 0.10257103, 0.07961877, 0.07961877, 0.07961877,
            0.07961877, 0.07961877, 0.07961877, 0.07961877],
           [0.34421352, 0.09171014, 0.08058234, 0.08058234, 0.08058234,
            0.08058234, 0.08058234, 0.08058234, 0.08058234],
           [0.33950377, 0.10413795, 0.07947975, 0.07947975, 0.07947975,
            0.07947975, 0.07947975, 0.07947975, 0.07947975],
           [0.3426607 , 0.09580766, 0.08021881, 0.08021881, 0.08021881,
            0.08021881, 0.08021881, 0.08021881, 0.08021881],
           [0.33777002, 0.10427278, 0.07970817, 0.07970817, 0.07970817,
            0.07970817, 0.07970817, 0.07970817, 0.07970817],
           [0.33733884, 0.10985068, 0.07897293, 0.07897293, 0.07897293,
            0.07897293, 0.07897293, 0.07897293, 0.07897293],
           [0.33953893, 0.10404517, 0.07948799, 0.07948799, 0.07948799,
            0.07948799, 0.07948799, 0.07948799, 0.07948799],
           [0.33987975, 0.10314585, 0.07956778, 0.07956778, 0.07956778,
            0.07956778, 0.07956778, 0.07956778, 0.07956778],
           [0.34013695, 0.10246711, 0.07962799, 0.07962799, 0.07962799,
            0.07962799, 0.07962799, 0.07962799, 0.07962799]], dtype=float32)



    完整代码: Here


  •  0
  •   Misan    5 年前

    对于像我这样有此问题的其他人,请检查xgb.train参数:“num_boost_round”。确保它与xgb.cv相等或大致相同。 我想问题是模特儿还没受过训练,所以,停得太早了。