我正在研究一个基于特定输入值预测输出标签的问题。
由于我没有真正的数据,我正在创建一些虚拟数据,以便在获取数据时可以准备好代码。
下面是示例数据的外观。有一组输入值,最后一列“output”是要预测的输出标签。
input_1,input_2,input_3,input_4,input_5,input_6,input_7,input_8,input_9,input_10,input_11,input_12,input_13,input_14,input_15,input_16,input_17,input_18,input_19,input_20,input_21,input_22,input_23,input_24,input_25,input_26,input_27,input_28,input_29,input_30,input_31,input_32,output
0.0,97.0,155,143,98,145,102,102,144,100,96,193,90,98,98,122,101,101,101,98,99,96,118,148,98,99,112,94,98,100,96.0,95,loc12
96.0,94.0,116,99,98,105,95,101,168,101,96,108,95,98,98,96,102,98,98,99,98,98,132,150,102,101,195,104,96,97,93.0,98,loc27
因为这是伪数据,所以我将输出标签设置为具有最大值的输入。
我的期望是XGBoost算法应该自己学习并正确预测输出标签。
我写了下面的代码来训练和测试XGBoost。
from __future__ import division
import numpy as np
import pandas as pd
import scipy.sparse
import pickle
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, LabelBinarizer
df=pd.read_csv("data.txt", sep=',')
# Create training and validation sets
sz = df.shape
train = df.iloc[:int(sz[0] * 0.7), :]
test = df.iloc[int(sz[0] * 0.7):, :]
# Separate X & Y for training
train_X = train.iloc[:, :32].values
train_Y = train.iloc[:, 32].values
# Separate X & Y for test
test_X = test.iloc[:, :32].values
test_Y = test.iloc[:, 32].values
# Get the count of unique output labels
num_classes = df.output.nunique()
lb = LabelBinarizer()
train_Y = lb.fit_transform(train_Y.tolist())
test_Y = lb.fit_transform(test_Y.tolist())
# Normalize the training data
#train_X -= np.mean(train_X, axis=0)
#train_X /= np.std(train_X, axis=0)
#train_X /= 255
# Normalize the test data
#test_X -= np.mean(test_X, axis=0)
#test_X /= np.std(test_X, axis=0)
#test_X /= 255
xg_train = xgb.DMatrix(train_X, label=train_Y)
xg_test = xgb.DMatrix(test_X, label=test_Y)
# setup parameters for xgboost
param = {}
# use softmax multi-class classification
param['objective'] = 'multi:softmax'
# scale weight of positive examples
param['eta'] = 0.1
param['max_depth'] = 6
param['silent'] = 1
param['nthread'] = 4
param['num_class'] = num_classes
watchlist = [(xg_train, 'train'), (xg_test, 'test')]
num_round = 5
bst = xgb.train(param, xg_train, num_round, watchlist)
#bst.dump_model('dump.raw.txt')
# get prediction
pred = bst.predict(xg_test)
actual = np.argmax(test_Y, axis=1)
error_rate = np.sum(pred != actual) / test_Y.shape[0]
print('Test error using softmax = {}'.format(error_rate))
# do the same thing again, but output probabilities
param['objective'] = 'multi:softprob'
bst = xgb.train(param, xg_train, num_round, watchlist)
# Note: this convention has been changed since xgboost-unity
# get prediction, this is in 1D array, need reshape to (ndata, nclass)
pred_prob = bst.predict(xg_test).reshape(test_Y.shape[0], num_classes)
pred_label = np.argmax(pred_prob, axis=1)
actual_label = np.argmax(test_Y, axis=1)
error_rate = np.sum(pred_label != actual_label) / test_Y.shape[0]
print('Test error using softprob = {}'.format(error_rate))
但是我观察到它总是预测标签0,即一个热编码输出中的第一个索引。
输出:
[0] train-merror:0.11081 test-merror:0.111076
[1] train-merror:0.11081 test-merror:0.111076
[2] train-merror:0.11081 test-merror:0.111076
[3] train-merror:0.111216 test-merror:0.111076
[4] train-merror:0.11081 test-merror:0.111076
Test error using softmax = 0.64846954875355
[0] train-merror:0.11081 test-merror:0.111076
[1] train-merror:0.11081 test-merror:0.111076
[2] train-merror:0.11081 test-merror:0.111076
[3] train-merror:0.111216 test-merror:0.111076
[4] train-merror:0.11081 test-merror:0.111076
Test error using softprob = 0.64846954875355
pred_prob[0:10]
array([[0.34024397, 0.10218474, 0.07965304, 0.07965304, 0.07965304,
0.07965304, 0.07965304, 0.07965304, 0.07965304],
[0.34009758, 0.10257103, 0.07961877, 0.07961877, 0.07961877,
0.07961877, 0.07961877, 0.07961877, 0.07961877],
[0.34421352, 0.09171014, 0.08058234, 0.08058234, 0.08058234,
0.08058234, 0.08058234, 0.08058234, 0.08058234],
[0.33950377, 0.10413795, 0.07947975, 0.07947975, 0.07947975,
0.07947975, 0.07947975, 0.07947975, 0.07947975],
[0.3426607 , 0.09580766, 0.08021881, 0.08021881, 0.08021881,
0.08021881, 0.08021881, 0.08021881, 0.08021881],
[0.33777002, 0.10427278, 0.07970817, 0.07970817, 0.07970817,
0.07970817, 0.07970817, 0.07970817, 0.07970817],
[0.33733884, 0.10985068, 0.07897293, 0.07897293, 0.07897293,
0.07897293, 0.07897293, 0.07897293, 0.07897293],
[0.33953893, 0.10404517, 0.07948799, 0.07948799, 0.07948799,
0.07948799, 0.07948799, 0.07948799, 0.07948799],
[0.33987975, 0.10314585, 0.07956778, 0.07956778, 0.07956778,
0.07956778, 0.07956778, 0.07956778, 0.07956778],
[0.34013695, 0.10246711, 0.07962799, 0.07962799, 0.07962799,
0.07962799, 0.07962799, 0.07962799, 0.07962799]], dtype=float32)
不管我得到的是什么样的准确度,都是因为预测了标签0,这大约是数据的35%。
我的期望是对的吗?输入特性是否太多,数据是否太少,以至于无法正确学习?
完整代码:
Here
Here