Xgboost bias variance trade-off and hyper-parameters tuning

Bias/variance trade-off

The following notebook presents visual explanation about how to deal with bias/variance trade-off, which is common machine learning problem.

What you will learn:

Bias and variance

There are two general types of errors made by classifiers - bias and variance errors.

Bias error is the overall difference between expected predictions made by the model and true values.

Variance error describes how much predictions for the given point vary.

The desired state is when both errors are as low as possible. The graphics taken from Scott Fortmann-Roe’s blog visualizes the issue really well. Imagine that the center of the target is the perfect model. We are iteratively repeating our experiment, recreating model and using it on the same data points.

Underfitting and overfitting

Knowing the errors introduced with bias and variance we can proceed to how these relate to training the model. We will use the plot taken from scikit-learn docs to help us visualize the underfitting and overfitting issues.

This simple example tries to fit a polynomial regression to predict future price. It’s obious to see that for $d=1$ the model is too simple (underfits the data), and for $d=6$ is just the opposite (overfitting).

For underfitting we say that model suffers from high bias (too simple) (low variance)

For overfitting we say that model suffers from high variance (over-complicated, unstable) (low bias)

How to detect it

To quantify the effects described we are going to train the model couple times for choosing different parameters value. Let’s consider that we would like to find a optimal number of trees - we don’t want the model to be very simple, but we also don’t want to over-complicate it.

The plan is as follows, we will:

  • generate complicated binary classification dataset,
  • use Scikit-learn wrapper,
  • train the model for different values of trees (n_estimators)) using stratified 10-fold CV,
  • plot train/test errors

Begin with loading required libraries and setting random seed number

1
2
3
4
5
6
7
8
9
10
11
12
13
14
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from sklearn.learning_curve import validation_curve
from sklearn.datasets import load_svmlight_files
from sklearn.cross_validation import StratifiedKFold
from sklearn.datasets import make_classification
from xgboost.sklearn import XGBClassifier
from scipy.sparse import vstack

# reproducibility
seed = 123
np.random.seed(seed)

Now generate artificial dataset

1
X, y = make_classification(n_samples=1000, n_features=20, n_informative=8, n_redundant=3, n_repeated=2, random_state=seed)

We will divide into 10 stratified folds (the same distibution of labels in each fold) for testing

1
cv = StratifiedKFold(y, n_folds=10, shuffle=True, random_state=seed)

Let’s check how the number of trees influence the predictions accuracy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
default_params = {
'objective': 'binary:logistic',
'max_depth': 1,
'learning_rate': 0.3,
'silent': 1.0
}

n_estimators_range = np.linspace(1, 200, 10).astype('int')

train_scores, test_scores = validation_curve(
XGBClassifier(**default_params),
X, y,
param_name = 'n_estimators',
param_range = n_estimators_range,
cv=cv,
scoring='accuracy'
)

Show the validation curve plot

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

fig = plt.figure(figsize=(10, 6), dpi=100)

plt.title("Validation Curve with XGBoost (eta = 0.3)")
plt.xlabel("number of trees")
plt.ylabel("Accuracy")
plt.ylim(0.7, 1.1)

plt.plot(n_estimators_range,
train_scores_mean,
label="Training score",
color="r")

plt.plot(n_estimators_range,
test_scores_mean,
label="Cross-validation score",
color="g")

plt.fill_between(n_estimators_range,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.2, color="r")

plt.fill_between(n_estimators_range,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.2, color="g")

plt.axhline(y=1, color='k', ls='dashed')

plt.legend(loc="best")
plt.show()

i = np.argmax(test_scores_mean)
print("Best cross-validation result ({0:.2f}) obtained for {1} trees".format(test_scores_mean[i], n_estimators_range[i]))

png

Best cross-validation result (0.90) obtained for 89 trees

Looking at the plot we can draw the following conclusions:

  • training score keeps growing while adding new trees, but from a certain point CV score is fixed
  • variance is lowest, and bias is high for less than 25 trees,
  • from about 25 trees, the variance is getting higher and while the CV score bias is holding steady (there is no point for adding extra trees / complexity)
  • we can see that the model is quite stable keeping variance fixed when increasing it’s complexity

We can assume that the trade-off for our model will be met at n_estimators = 50. The variance is still to big.

What we can do?

Dealing with high variance

If model is too complex try:

  • using less features (ie. feature selection),
  • using more training samples (ie. artificially generated),
  • increasing regularization (add penalties for extra complexity)

In XGBoost you can try to:

  • reduce depth of each tree (max_depth),
  • increase min_child_weight parameter,
  • increase gamma parameter,
  • add more randomness using subsample, colsample_bytree parameters,
  • increase lambda and alpha regularization parameters

Dealing with high bias

If model is too simple:

  • add more features (ie. better feature engineering),
  • more sophisticated model
  • decrease regularization

In XGBoost you can do it by:

  • increase depth of each tree (max_depth),
  • decrease min_child_weight parameter,
  • decrease gamma parameter,
  • decrease lambda and alpha regularization parameters

Let’s try to tweak a parameters a little bit. We are going to add some randomness - each tree we will use 70% randomly chosen samples and 60% randomly chosen features. This should help to reduce a variance. To decrease the bias (bigger accuracy) try adding an extra level to each tree.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
default_params = {
'objective': 'binary:logistic',
'max_depth': 2, # changed
'learning_rate': 0.3,
'silent': 1.0,
'colsample_bytree': 0.6, # added
'subsample': 0.7 # added
}

n_estimators_range = np.linspace(1, 200, 10).astype('int')

train_scores, test_scores = validation_curve(
XGBClassifier(**default_params),
X, y,
param_name = 'n_estimators',
param_range = n_estimators_range,
cv=cv,
scoring='accuracy'
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

fig = plt.figure(figsize=(10, 6), dpi=100)

plt.title("Validation Curve with XGBoost (eta = 0.3)")
plt.xlabel("number of trees")
plt.ylabel("Accuracy")
plt.ylim(0.7, 1.1)

plt.plot(n_estimators_range,
train_scores_mean,
label="Training score",
color="r")

plt.plot(n_estimators_range,
test_scores_mean,
label="Cross-validation score",
color="g")

plt.fill_between(n_estimators_range,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.2, color="r")

plt.fill_between(n_estimators_range,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.2, color="g")

plt.axhline(y=1, color='k', ls='dashed')

plt.legend(loc="best")
plt.show()

i = np.argmax(test_scores_mean)
print("Best cross-validation result ({0:.2f}) obtained for {1} trees".format(test_scores_mean[i], n_estimators_range[i]))

png

Best cross-validation result (0.92) obtained for 133 trees

We have obtained slightly less variance and decreased bias.

Hyper-parameter tuning

As you know there are plenty of tunable parameters. Each one results in different output. The question is which combination results in best output.

The following notebook will show you how to use Scikit-learn modules for figuring out the best parameters for your models.

What’s included:

Prepare data

Let’s begin with loading all required libraries in one place and setting seed number for reproducibility.

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

from xgboost.sklearn import XGBClassifier

from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedKFold

from scipy.stats import randint, uniform

# reproducibility
seed = 342
np.random.seed(seed)

Generate artificial dataset:

1
X, y = make_classification(n_samples=1000, n_features=20, n_informative=8, n_redundant=3, n_repeated=2, random_state=seed)

Define cross-validation strategy for testing. Let’s use StratifiedKFold which guarantees that target label is equally distributed across each fold:

1
cv = StratifiedKFold(y, n_folds=10, shuffle=True, random_state=seed)

In grid-search we start by defining a dictionary holding possible parameter values we want to test. All combinations will be evaluted.

1
2
3
4
5
params_grid = {
'max_depth': [1, 2, 3],
'n_estimators': [5, 10, 25, 50],
'learning_rate': np.linspace(1e-16, 1, 3)
}

Add a dictionary for fixed parameters.

1
2
3
4
params_fixed = {
'objective': 'binary:logistic',
'silent': 1
}

Create a GridSearchCV estimator. We will be looking for combination giving the best accuracy.

1
2
3
4
5
6
bst_grid = GridSearchCV(
estimator=XGBClassifier(**params_fixed, seed=seed),
param_grid=params_grid,
cv=cv,
scoring='accuracy'
)

Before running the calculations notice that $343*10=360$ models will be created to test all combinations. You should always have rough estimations about what is going to happen.

1
bst_grid.fit(X, y)
GridSearchCV(cv=sklearn.cross_validation.StratifiedKFold(labels=[0 1 ..., 1 1], n_folds=10, shuffle=True, random_state=342),
       error_score='raise',
       estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=342, silent=1, subsample=1),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [5, 10, 25, 50], 'learning_rate': array([  1.00000e-16,   5.00000e-01,   1.00000e+00]), 'max_depth': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)

Now, we can look at all obtained scores, and try to manually see what matters and what not. A quick glance looks that the largeer n_estimators then the accuracy is higher.

1
bst_grid.grid_scores_
[mean: 0.49800, std: 0.00245, params: {'learning_rate': 9.9999999999999998e-17, 'n_estimators': 5, 'max_depth': 1},
 mean: 0.49800, std: 0.00245, params: {'learning_rate': 9.9999999999999998e-17, 'n_estimators': 10, 'max_depth': 1},
 mean: 0.49800, std: 0.00245, params: {'learning_rate': 9.9999999999999998e-17, 'n_estimators': 25, 'max_depth': 1},
 mean: 0.49800, std: 0.00245, params: {'learning_rate': 9.9999999999999998e-17, 'n_estimators': 50, 'max_depth': 1},
 mean: 0.49800, std: 0.00245, params: {'learning_rate': 9.9999999999999998e-17, 'n_estimators': 5, 'max_depth': 2},
 mean: 0.49800, std: 0.00245, params: {'learning_rate': 9.9999999999999998e-17, 'n_estimators': 10, 'max_depth': 2},
 mean: 0.49800, std: 0.00245, params: {'learning_rate': 9.9999999999999998e-17, 'n_estimators': 25, 'max_depth': 2},
 mean: 0.49800, std: 0.00245, params: {'learning_rate': 9.9999999999999998e-17, 'n_estimators': 50, 'max_depth': 2},
 mean: 0.49800, std: 0.00245, params: {'learning_rate': 9.9999999999999998e-17, 'n_estimators': 5, 'max_depth': 3},
 mean: 0.49800, std: 0.00245, params: {'learning_rate': 9.9999999999999998e-17, 'n_estimators': 10, 'max_depth': 3},
 mean: 0.49800, std: 0.00245, params: {'learning_rate': 9.9999999999999998e-17, 'n_estimators': 25, 'max_depth': 3},
 mean: 0.49800, std: 0.00245, params: {'learning_rate': 9.9999999999999998e-17, 'n_estimators': 50, 'max_depth': 3},
 mean: 0.84100, std: 0.03515, params: {'learning_rate': 0.5, 'n_estimators': 5, 'max_depth': 1},
 mean: 0.87300, std: 0.03374, params: {'learning_rate': 0.5, 'n_estimators': 10, 'max_depth': 1},
 mean: 0.89200, std: 0.03375, params: {'learning_rate': 0.5, 'n_estimators': 25, 'max_depth': 1},
 mean: 0.90200, std: 0.03262, params: {'learning_rate': 0.5, 'n_estimators': 50, 'max_depth': 1},
 mean: 0.86400, std: 0.04665, params: {'learning_rate': 0.5, 'n_estimators': 5, 'max_depth': 2},
 mean: 0.89400, std: 0.04189, params: {'learning_rate': 0.5, 'n_estimators': 10, 'max_depth': 2},
 mean: 0.92200, std: 0.02584, params: {'learning_rate': 0.5, 'n_estimators': 25, 'max_depth': 2},
 mean: 0.92000, std: 0.02233, params: {'learning_rate': 0.5, 'n_estimators': 50, 'max_depth': 2},
 mean: 0.89700, std: 0.03904, params: {'learning_rate': 0.5, 'n_estimators': 5, 'max_depth': 3},
 mean: 0.92000, std: 0.02864, params: {'learning_rate': 0.5, 'n_estimators': 10, 'max_depth': 3},
 mean: 0.92300, std: 0.02193, params: {'learning_rate': 0.5, 'n_estimators': 25, 'max_depth': 3},
 mean: 0.92400, std: 0.02255, params: {'learning_rate': 0.5, 'n_estimators': 50, 'max_depth': 3},
 mean: 0.83500, std: 0.04939, params: {'learning_rate': 1.0, 'n_estimators': 5, 'max_depth': 1},
 mean: 0.86800, std: 0.03386, params: {'learning_rate': 1.0, 'n_estimators': 10, 'max_depth': 1},
 mean: 0.89500, std: 0.02720, params: {'learning_rate': 1.0, 'n_estimators': 25, 'max_depth': 1},
 mean: 0.90500, std: 0.02783, params: {'learning_rate': 1.0, 'n_estimators': 50, 'max_depth': 1},
 mean: 0.87800, std: 0.03342, params: {'learning_rate': 1.0, 'n_estimators': 5, 'max_depth': 2},
 mean: 0.90800, std: 0.04261, params: {'learning_rate': 1.0, 'n_estimators': 10, 'max_depth': 2},
 mean: 0.91000, std: 0.03632, params: {'learning_rate': 1.0, 'n_estimators': 25, 'max_depth': 2},
 mean: 0.91300, std: 0.02449, params: {'learning_rate': 1.0, 'n_estimators': 50, 'max_depth': 2},
 mean: 0.90500, std: 0.03112, params: {'learning_rate': 1.0, 'n_estimators': 5, 'max_depth': 3},
 mean: 0.91700, std: 0.02729, params: {'learning_rate': 1.0, 'n_estimators': 10, 'max_depth': 3},
 mean: 0.92700, std: 0.03342, params: {'learning_rate': 1.0, 'n_estimators': 25, 'max_depth': 3},
 mean: 0.93300, std: 0.02581, params: {'learning_rate': 1.0, 'n_estimators': 50, 'max_depth': 3}]

If there are many results, we can filter them manually to get the best combination

1
2
3
4
print("Best accuracy obtained: {0}".format(bst_grid.best_score_))
print("Parameters:")
for key, value in bst_grid.best_params_.items():
print("\t{}: {}".format(key, value))
Best accuracy obtained: 0.933
Parameters:
    learning_rate: 1.0
    n_estimators: 50
    max_depth: 3

Looking for best parameters is an iterative process. You should start with coarsed-granularity and move to to more detailed values.

When the number of parameters and their values is getting big traditional grid-search approach quickly becomes ineffective. A possible solution might be to randomly pick certain parameters from their distribution. While it’s not an exhaustive solution, it’s worth giving a shot.

Create a parameters distribution dictionary:

1
2
3
4
5
6
7
8
params_dist_grid = {
'max_depth': [1, 2, 3, 4],
'gamma': [0, 0.5, 1],
'n_estimators': randint(1, 1001), # uniform discrete random distribution
'learning_rate': uniform(), # gaussian distribution
'subsample': uniform(), # gaussian distribution
'colsample_bytree': uniform() # gaussian distribution
}

Initialize RandomizedSearchCV to randomly pick 10 combinations of parameters. With this approach you can easily control the number of tested models.

1
2
3
4
5
6
7
8
rs_grid = RandomizedSearchCV(
estimator=XGBClassifier(**params_fixed, seed=seed),
param_distributions=params_dist_grid,
n_iter=10,
cv=cv,
scoring='accuracy',
random_state=seed
)

Fit the classifier:

1
rs_grid.fit(X, y)
RandomizedSearchCV(cv=sklearn.cross_validation.StratifiedKFold(labels=[0 1 ..., 1 1], n_folds=10, shuffle=True, random_state=342),
          error_score='raise',
          estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=342, silent=1, subsample=1),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'subsample': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff81c63b400>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff827da40f0>, 'gamma': [0, 0.5, 1], 'colsample_bytree': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff81c63b748>, 'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff84c690160>, 'max_depth': [1, 2, 3, 4]},
          pre_dispatch='2*n_jobs', random_state=342, refit=True,
          scoring='accuracy', verbose=0)

One more time take a look at choosen parameters and their accuracy score:

1
rs_grid.grid_scores_
[mean: 0.80200, std: 0.02403, params: {'subsample': 0.11676744056370758, 'n_estimators': 492, 'gamma': 0, 'colsample_bytree': 0.065034396841929132, 'learning_rate': 0.82231421953113004, 'max_depth': 3},
 mean: 0.90800, std: 0.02534, params: {'subsample': 0.4325346125891868, 'n_estimators': 689, 'gamma': 1, 'colsample_bytree': 0.11848249237448605, 'learning_rate': 0.13214054942810016, 'max_depth': 1},
 mean: 0.86400, std: 0.03584, params: {'subsample': 0.15239319471904489, 'n_estimators': 392, 'gamma': 0, 'colsample_bytree': 0.37621772642449514, 'learning_rate': 0.61087022642994204, 'max_depth': 4},
 mean: 0.90100, std: 0.02794, params: {'subsample': 0.70993001900730734, 'n_estimators': 574, 'gamma': 1, 'colsample_bytree': 0.20992824607318106, 'learning_rate': 0.40898494335099522, 'max_depth': 1},
 mean: 0.91200, std: 0.02440, params: {'subsample': 0.93610608633544701, 'n_estimators': 116, 'gamma': 1, 'colsample_bytree': 0.22187963515640408, 'learning_rate': 0.82924717948414195, 'max_depth': 2},
 mean: 0.92900, std: 0.01577, params: {'subsample': 0.76526283302535481, 'n_estimators': 281, 'gamma': 0, 'colsample_bytree': 0.80580143163765727, 'learning_rate': 0.46363095388213049, 'max_depth': 4},
 mean: 0.89900, std: 0.03200, params: {'subsample': 0.1047221390561941, 'n_estimators': 563, 'gamma': 1, 'colsample_bytree': 0.4649668429588838, 'learning_rate': 0.0056355243866283988, 'max_depth': 4},
 mean: 0.89300, std: 0.02510, params: {'subsample': 0.70326840897694187, 'n_estimators': 918, 'gamma': 0.5, 'colsample_bytree': 0.50136727776346701, 'learning_rate': 0.32309692902992948, 'max_depth': 1},
 mean: 0.90300, std: 0.03573, params: {'subsample': 0.40219949856580106, 'n_estimators': 665, 'gamma': 1, 'colsample_bytree': 0.32232842572609355, 'learning_rate': 0.87857246352479834, 'max_depth': 4},
 mean: 0.88900, std: 0.02604, params: {'subsample': 0.18284845802969663, 'n_estimators': 771, 'gamma': 1, 'colsample_bytree': 0.65705813574097693, 'learning_rate': 0.44206350002617856, 'max_depth': 3}]

There are also some handy properties allowing to quickly analyze best estimator, parameters and obtained score:

1
rs_grid.best_estimator_
XGBClassifier(base_score=0.5, colsample_bylevel=1,
       colsample_bytree=0.80580143163765727, gamma=0,
       learning_rate=0.46363095388213049, max_delta_step=0, max_depth=4,
       min_child_weight=1, missing=None, n_estimators=281, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=342, silent=1,
       subsample=0.76526283302535481)
1
rs_grid.best_params_
{'colsample_bytree': 0.80580143163765727,
 'gamma': 0,
 'learning_rate': 0.46363095388213049,
 'max_depth': 4,
 'n_estimators': 281,
 'subsample': 0.76526283302535481}
1
rs_grid.best_score_
0.92900000000000005