XGBoost
Using standard interface
The following notebooks presents the basic usage of native XGBoost Python interface.
Flight-plan:
Loading libraries
Begin with loading all required libraries in one place:
1 | import numpy as np |
Loading data
We are going to use bundled Agaricus dataset which can be downloaded here.
This data set records biological attributes of different mushroom species, and the target is to predict whether it is poisonous
This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom;
It consist of 8124 instances, characterized by 22 attributes (both numeric and categorical). The target class is either 0 or 1 which means binary classification problem.
Important: XGBoost handles only numeric variables.
Luckily all the data have alreay been pre-process for us. Categorical variables have been encoded, and all instances divided into train and test datasets. You will know how to do this on your own in later lectures.
Data needs to be stored in DMatrix
object which is designed to handle sparse datasets. It can be populated in couple ways:
- using libsvm format txt file,
- using Numpy 2D array (most popular),
- using XGBoost binary buffer file
In this case we’ll use first option.
Libsvm files stores only non-zero elements in format
<label> <feature_a>:<value_a> <feature_c>:<value_c> ... <feature_z>:<value_z>
Any missing features indicate that it’s corresponding value is 0.
1 | dtrain = xgb.DMatrix('../data/agaricus.txt.train') |
Let’s examine what was loaded:
1 | print("Train dataset contains {0} rows and {1} columns".format(dtrain.num_row(), dtrain.num_col())) |
Train dataset contains 6513 rows and 127 columns
Test dataset contains 1611 rows and 127 columns
1 | print("Train possible labels: ") |
Train possible labels:
[ 0. 1.]
Test possible labels:
[ 0. 1.]
Specify training parameters
Let’s make the following assuptions and adjust algorithm parameters to it:
- we are dealing with binary classification problem (
'objective':'binary:logistic'
), - we want shallow single trees with no more than 2 levels (
'max_depth':2
), - we don’t any oupout (
'silent':1
), - we want algorithm to learn fast and aggressively (
'eta':1
), - we want to iterate only 5 rounds
1 | params = { |
Training classifier
To train the classifier we simply pass to it a training dataset, parameters list and information about number of iterations.
1 | bst = xgb.train(params, dtrain, num_rounds) |
We can also observe performance on test dataset using watchlist
1 | watchlist = [(dtest,'test'), (dtrain,'train')] # native interface only |
[0] test-error:0.042831 train-error:0.046522
[1] test-error:0.021726 train-error:0.022263
[2] test-error:0.006207 train-error:0.007063
[3] test-error:0.018001 train-error:0.0152
[4] test-error:0.006207 train-error:0.007063
Make predictions
1 | preds_prob = bst.predict(dtest) |
array([ 0.08073306, 0.92217326, 0.08073306, ..., 0.98059034,
0.01182149, 0.98059034], dtype=float32)
Calculate simple accuracy metric to verify the results. Of course validation should be performed accordingly to the dataset, but in this case accuracy is sufficient.
1 | labels = dtest.get_label() |
Predicted correctly: 1601/1611
Error: 0.0062
Using Scikit-learn Interface
The following notebook presents the alternative approach for using XGBoost algorithm.
What’s included:
Loading libraries
Begin with loading all required libraries.
1 | import numpy as np |
Loading data
We are going to use the same dataset as in previous lecture. The scikit-learn package provides a convenient function load_svmlight
capable of reading many libsvm files at once and storing them as Scipy’s sparse matrices.
1 | X_train, y_train, X_test, y_test = load_svmlight_files(('../data/agaricus.txt.train', '../data/agaricus.txt.test')) |
Examine what was loaded
1 | print("Train dataset contains {0} rows and {1} columns".format(X_train.shape[0], X_train.shape[1])) |
Train dataset contains 6513 rows and 126 columns
Test dataset contains 1611 rows and 126 columns
1 | print("Train possible labels: ") |
Train possible labels:
[ 0. 1.]
Test possible labels:
[ 0. 1.]
Specify training parameters
All the parameters are set like in the previous example
- we are dealing with binary classification problem (
'objective':'binary:logistic'
), - we want shallow single trees with no more than 2 levels (
'max_depth':2
), - we don’t any oupout (
'silent':1
), - we want algorithm to learn fast and aggressively (
'learning_rate':1
), (in naive namedeta
) - we want to iterate only 5 rounds (
n_estimators
)
1 | params = { |
Training classifier
1 | bst = XGBClassifier(**params).fit(X_train, y_train) |
Make predictions
1 | preds = bst.predict(X_test) |
array([ 0., 1., 0., ..., 1., 0., 1.])
Calculate obtained error
1 | correct = 0 |
Predicted correctly: 1601/1611
Error: 0.0062