Gradient Boosted Trees¶
Details¶
Given n feature vectors \(X = \{x_1 = (x_{11}, \ldots, x_{1p}), \ldots, x_n = (x_{n1}, \ldots, x_{np}) \}\) of \(n\) \(p\)-dimensional feature vectors and \(n\) responses \(Y = \{y_1, \ldots, y_n \}\), the problem is to build a gradient boosted trees classification or regression model.
The tree ensemble model uses M additive functions to predict the output \(\hat{y_i}=f(x)={\sum }_{k=1}^{M}{f}_{k}({x}_{i}), {f}_{k}\in F\) where \(F=\mathrm{ }\left\{f\left(x\right)={w}_{q\left(x\right)}\mathrm{ },\mathrm{ }q:{R}^{p}\to T,\mathrm{ }w\in {R}^{T}\right\}\) is the space of regression trees, \(T\) is the number of leaves in the tree, \(w\) is a leaf weights vector, \(w_i\) is a score on \(i\)-th leaf. \(q(x)\) represents the structure of each tree that maps an observation to the corresponding leaf index.
Training procedure is an iterative functional gradient descent algorithm which minimizes objective function over function space by iteratively choosing a function (regression tree) that points in the negative gradient direction. The objective function is
where \(l(f)\) is twice differentiable convex loss function and \(\Omega(f) = \gamma T + \frac{1}{2}\lambda ||w||\) is a regularization term that penalizes the complexity of the model defined by the number of leaves T and the L2 norm of the weights \(||w||\) for each tree, \(\gamma\) and \(\lambda\) are regularization parameters.
Training Stage¶
Library uses the second-order approximation of objective function
where \(g_i= \frac{\partial l({y}_{i},{\hat{y_i}}^{(k-1)})}{\partial {\hat{y_i}}^{(k-1)}}\), \(h_i= \frac{{\partial }^{2}l({y}_{i}, {\hat{y_i}}^{(k-1)})}{{\partial }^{2}{\hat{y_i}}^{(k-1)}}\) and following algorithmic framework for the training stage.
Let \(S = (X, Y)\) be the set of observations. Given the training parameters, such as the number of iterations \(M\), loss function \(l(f)\), regression tree training parameters, regularization parameters \(\gamma\) and \(\lambda\), shrinkage (learning rate) parameter \(\theta\), the algorithm does the following:
Find an initial guess \(\hat{y_i}^{(0)}\), \(i = 1, \ldots, n\)
For \(k = 1, \ldots , M\):
Update \(g_i\) and \(h_i\), \(i = 1, \ldots, n\)
Grow a regression tree \({f}_{k}\in F\) that minimizes the objective function \(-\frac{1}{2}\sum _{j=1}^{T}\frac{{G}_{j}^{2}}{{H}_{j}+\lambda }+\gamma T\), where \(G_j=\sum _{i\in {I}_{j}}{g}_{j}\), \({H}_{j}=\sum _{i\in {I}_{j}}{h}_{j}\), \({I}_{j}= \{i| ({x}_{i})=j\}\), \(j=1, \ldots, T\).
Assign an optimal weight \({w_j}^{*}= \frac{G_j}{H_j +\lambda }\) to the leaf \(j\), \(j = 1, \ldots, T\).
Apply shrinkage parameter \(\theta\) to the tree leafs and add the tree to the model
Update \(\hat{y_i}^{(k)}\)
The algorithm for growing the tree:
Generate a bootstrap training set if required (stochastic gradient boosting) as follows: select randomly without replacement \(N = f * n\) observations, where \(f\) is a fraction of observations used for training of one tree.
Start from the tree with depth \(0\).
For each leaf node in the tree:
Choose a subset of feature for split finding if required (stochastic gradient boosting).
Find the best split that maximizes the gain:
\[ \begin{align}\begin{aligned}\frac{{G}_{L}^{2}}{{H}_{L}+\lambda }+ \frac{{G}_{R}^{2}}{{H}_{R}+\lambda }- \frac{{({G}_{L}+{G}_{R})}^{2}}{{H}_{L}+ {H}_{R}+\lambda }- \gamma\\ - Stop when a termination criterion is met.\end{aligned}\end{align} \]
For more details, see [Chen2016].
The library supports the following termination criteria when growing the tree:
Minimal number of observations in a leaf node. Node t is not processed if the subset of observations is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.
Maximal tree depth. Node t is not processed, if its depth in the tree reached the predefined value.
Minimal split loss. Node t is not processed, if the best possible split is smaller than parameter \(\gamma\).
Prediction Stage¶
Given a gradient boosted trees model and vectors \((x_1, \ldots, x_r)\), the problem is to calculate the responses for those vectors. To solve the problem for each given query vector \(x_i\), the algorithm finds the leaf node in a tree in the ensemble which gives the response by that tree. Resulting response is based on an aggregation of responses from all trees in the ensemble. For detailed definition, see description of a specific algorithm.
Split Calculation Mode¶
The library supports two split calculation modes:
exact - all possible split values are examined when searching for the best split for a feature.
inexact - continuous features are bucketed into discrete bins and the possible splits are restricted by the buckets borders only.
Batch Processing¶
Gradient boosted trees classification and regression follows the general workflow described in Classification Usage Model and Regression Usage Model.
Training
For description of the input and output, refer to .
At the training stage, the gradient boosted trees batch algorithm has the following parameters:
Parameter |
Default Value |
Description |
---|---|---|
|
|
Split computation mode. Possible values:
|
|
\(50\) |
Maximal number of iterations when training the model, defines maximal number of trees in the model. |
|
\(6\) |
Maximal tree depth. If the parameter is set to \(0\) then the depth is unlimited. |
|
\(0.3\) |
Learning rate of the boosting procedure. Scales the contribution of each tree by a factor \((0, 1]\) |
|
\(0\) |
Loss regularization parameter. Minimal loss reduction required to make a further partition on a leaf node of the tree. Range: \([0, \infty)\) |
|
\(1\) |
L2 regularization parameter on weights. Range: \([0, \infty)\) |
|
\(1\) |
Fraction of the training set S used for a single tree training, \(0 < \mathrm{observationsPerTreeFraction} \leq 1\). The observations are sampled randomly without replacement. |
featuresPerNode |
\(0\) |
The number of features tried as the possible splits per node. If the parameter is set to \(0\), all features are used. |
|
\(5\) |
Minimal number of observations in the leaf node. |
|
|
If true then use memory saving (but slower) mode. |
|
SharePtr< engines:: mt19937:: Batch>() |
Pointer to the random number generator. |
|
\(256\) |
Used with inexact split method only. Maximal number of discrete bins to bucket continuous features. Increasing the number results in higher computation costs |
|
\(5\) |
Used with inexact split method only. Minimal number of observations in a bin. |