代码之家  ›  专栏  ›  技术社区  ›  kadee

标准化(或缩放)对带梯度树增强的回归有用吗?

  •  0
  • kadee  · 技术社区  · 6 年前

    我读到使用梯度树增强时不需要标准化(参见。 Should I need to normalize (or scale) the data for Random forest (drf) or Gradient Boosting Machine (GBM) in H2O or in general? , https://github.com/dmlc/xgboost/issues/357 ).

    然而,使用xgboost作为回归树,我发现缩放目标对预测结果的(样本内)误差有显著影响。这是什么原因?

    波士顿住房数据集示例:

    import numpy as np
    import xgboost as xgb
    from sklearn.metrics import mean_squared_error
    from sklearn.datasets import load_boston
    
    boston = load_boston()
    y = boston['target']
    X = boston['data']
    
    for scale in np.logspace(-6, 6, 7):
        xgb_model = xgb.XGBRegressor().fit(X, y / scale)
        y_predicted = xgb_model.predict(X) * scale
        print('{} (scale={})'.format(mean_squared_error(y, y_predicted), scale))
    
    2.3432734454908335 (scale=1e-06)
    2.343273977065266 (scale=0.0001)
    2.3432793874455315 (scale=0.01)
    2.290595204136888 (scale=1.0)
    2.528513393507719 (scale=100.0)
    7.228978353091473 (scale=10000.0)
    272.29640759874474 (scale=1000000.0)
    

    当使用“reg:gamma”作为目标函数(而不是默认的“reg:linear”)时,缩放y的影响变得非常大:

    for scale in np.logspace(-6, 6, 7):
        xgb_model = xgb.XGBRegressor(objective='reg:gamma').fit(X, y / scale)
        y_predicted = xgb_model.predict(X) * scale
        print('{} (scale={})'.format(mean_squared_error(y, y_predicted), scale))
    
    591.6509503519147 (scale=1e-06)
    545.8298971540023 (scale=0.0001)
    37.68688286293508 (scale=0.01)
    4.039819858716935 (scale=1.0)
    2.505477263590776 (scale=100.0)
    198.94093800190453 (scale=10000.0)
    592.1469169959003 (scale=1000000.0)
    
    0 回复  |  直到 6 年前