One of the techniques that has caused the most excitement in the machine learning community is *boosting*, which in essence is a process of iteratively refining, e.g. by reweighting, of estimated regression and classification functions (though it has primarily been applied to the latter), in order to improve predictive ability.

Much has been made of the remark by the late statistician Leo Breiman that boosting is “the best off-the-shelf classifier in the world,” his term *off-the-shelf* meaning that the given method can be used by nonspecialist users without special tweaking. Many analysts have indeed reported good results from the method.

In this post I will

- Briefly introduce the topic.
- Give a view of boosting that may not be well known.
- Give a surprising example.

As with some of my recent posts, this will be based on material from the book I’m writing on regression and classification.

**Intuition:**

The key point, almost always missed in technical discussions, is that **boosting is really about bias reduction**. Take the linear model, our example in this posting.

A linear model is rarely if ever exactly correct. Thus use of a linear model will result in bias; in some regions of the predictor vector X, the model will overestimate the true regression function, while in others it will underestimate — no matter how large our sample is. It thus may be profitable to try to reduce bias in regions in which our unweighted predictions are very bad, at the hopefully small sacrifice of some prediction accuracy in places where our unweighted analysis is doing well. (In the classification setting, a small loss in accuracy in estimating the conditional probability function won’t hurt our predictions at all, since our predictions won’t change.) **The reweighting (or other iterative) process is aimed at achieving a positive tradeoff of that nature.**

To motivate the notion of boosting, following is an adaptation of some material in Richard Berk’s book, *Statistical Learning from a Regression Perspective*, which by the way is one of the most thoughtful, analytic books on statistics I’ve ever seen.

Say we have fit an OLS model to our data, and through various means (see my book for some old and new methods) suspect that we have substantial model bias. Berk’s algorithm, which he points out is not boosting but is similar in spirit, is roughly as follows (I’ve made some changes):

- Fit the OLS model, naming the resulting coefficient vector b
_{0}. Calculate the residuals and their sum of squares, and set i =0. - Update i to i+1. Fit a weighted least-squares model, using as weights the absolute residuals, naming the result b
_{i}. Calculate the new residuals and sum of squares. - If i is less than the desired number of iterations k, go to Step 2.

In the end, take your final coefficient vector to be a weighted average of all the b_{i}, with the weights being inversely proportional to the sums of squared residuals.

Again, we do all this in the hope of reducing model bias where it is most important. If all goes well, our ability to predict future observations will be enhanced.

**Choice of weapons:**

R offers a nice variety of packages for boosting. We’ll use the **mboost** package here, because it is largely geared toward parametric models such as the linear. In particular, it provides us with revised coefficients, rather than just outputting a “black box” prediction machine.

Of course, like any self-respecting R package, **mboost** offers a bewildering set of arguments in its functions. But Leo Breiman was a really smart guy, extraordinarily insightful. Based on his “off-the-shelf” remark, we will simply use the default values of the arguments.

**The data:**

Fong and Ouliaris (1995) do an analysis of relations between currency rates for Canada, Germany, France, the UK and Japan (pre-European Union days). Do they move together? Let’s look at predicting the Japanese yen from the others.

This is time series data, and the authors of the above paper do a very sophisticated analysis along those lines. But we’ll just do straight linear modeling here.

After applying OLS (not shown here), we find a pretty good fit, with an adjusted R-squared value of 0.89. However, there are odd patterns in the residuals, and something disturbing occurs when we take a k-Nearest Neighbors approach.

R-squared, whether a population value or the sample estimate reported by **lm()**, is the squared correlation between Y and its predicted value. Thus R-squared can be calculated for any method of regression function estimation, not just the linear model. In particular, we can apply the concept to kNN.

We’ll use the functions from my **regtools** package.

```
> curr <- read.table('EXC.ASC',header=TRUE)
> curr1 <- curr
> curr1[,-5] <- scale(curr1[,-5])
> fout1 <- lm(Yen ~ .,data=curr1)
> library(regtools)
> xdata <- preprocessx(curr1[,-5],25,xval=TRUE)
> kout <- knnest(curr1[,5],xdata,25)
> ypredknn <- knnpred(kout,xdata$x)
> cor(ypredknn,curr1[,5])^2
[1] 0.9817131
```

This is rather troubling. It had seemed that our OLS fit was very nice, but apparently we are “leaving money on the table” — we can do substantially better than that simple linear model.

So, let’s give boosting a try. Let’s split the data into training and test sets, and compare boosting to OLS.

```
> library(mboost)
> trnidxs <- sample(1:761,500)
> predidxs <- setdiff(1:761,trnidxs)
> mbout <- glmboost(Yen ~ .,data=curr1[trnidxs,])
> lmout <- lm(Yen ~ .,data=curr1[trnidxs,])
> mbpred <- predict(mbout,curr1[predidxs,])
> lmpred <- predict(lmout,curr1[predidxs,])
> predy <- curr1[predidxs,]$Yen
> mean(abs(predy-mbpred))
[1] 14.03786
> mean(abs(predy-lmpred))
[1] 13.20589
```

Well, lo and behold, boosting actually did **worse** than OLS! Clearly we can’t generalize from this, and as mentioned, many analysts have reported big gains from boosting. And though Breiman was one of the giants of this field, the example here shows that boosting is not ready for off-the-shelf usage. On the contrary, there are also numerous reports of boosting having problems, such as bizarre cases in which the iterations of boosting seemed to be converging, only to have them suddenly diverge.

If one is willing to go slightly past “off the shelf,” one can set the number of iterations, using **boost_control()** or **mstop()**. And if one is happy to go completely past “off the shelf,” the **mboost** package has numerous other options to explore.

Once again: There are no “silver bullets” in this field.

I think you are comparing apples to oranges here, but you haven’t provided the code for the OLD results, so I can’t tell.

Are you comparing the R^2 of the OLS fit on all the data to the R^2 of boosting on the testing data?

I’m not sure what you are asking. I never computed R^2 on the boosted model.

The 0.89 value came from running lm() on the full data, curr1, as did the kNN value of 0.98.

Thanks for your nice promotion of boosting methods and for using our package mboost.

Unfortunately, you do not provide enough information to make your results reproducible. Yet, there are some serious flaws and misconceptions in your post.

1) Model-based boosting is not as such an off-the-shelf method as you think. Leo Breiman was talking about tree-based classification and not boosted linear models.

2) Model-based boosting mainly aims at model fitting with intrinsic variable selection. So if you have a data set with few variables, boosted linear models will most likely converge to the MLE or LS solution. Yet, if many candidate predictors are available, boosting can help you to fit a sparse model with a focus on prediction accuracy (as measured by your loss function).

3) In any case, the number of iterations (called mstop in mboost) is a very important tuning parameter. In fact the only real tuning parameter. The default value is just 100 iterations, which is usually far too little for a serious model. Increasing the number of iterations will most likely drastically improve your model. E.g. one could use

mstop(mbout) <- 1000

4) Still, this is not the optimal model. To tune your model, i.e., to optimize your prediction performance, you need to use approaches such as cross-validation. The easiest thing is to use 25-fold bootstrap (with little extra work one could switch to k-fold CV or subsampling, see ?cvrisk for details):

cvr <- cvrisk(mbout, grid = 1:1000)

## see if we had enough steps

plot(cvr)

## if so set model to optimal number of iterations

mstop(mbout) <- mstop(cvr)

Now, you can compare your models!

For a detailed and easy introduction to boosting models using mboost we refer you to our comprehensive tutorial

B. Hofner, A. Mayr, N. Robinzonov, M. Schmid (2014). "Model-based Boosting in R: A Hands-on Tutorial Using the R Package mboost". Computational Statistics, 29:3-35. (which is also available as vignette: https://cran.r-project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf)

Thanks very much for taking the time to write your very informative reply.

I have now edited my post to make the actions reproducible, modulo the randomness in selecting the training and test sets. (By the way, the reason I centered and scaled was not related to boosting.) If you have time to do a very non-“off the shelf” analysis, I will be very interested in seeing the results.

As I wrote in my post, the central theme was Breiman’s “best off-the-shelf” characterization. You mention that he was only talking about trees, which is plausible but certainly not how he is being quoted these days. I have lunch often with one of his old collaborators, and will see if he knows what Leo really meant.

In any case, “off the shelf” is a highly desirable trait for a procedure to have. However, you note that the only real tuning parameter in your procedure is the number of iterations, and I agree that setting that parameter is not too much to ask of users. I did try resetting the number to 1000, and got about a 2% improvement in prediction accuracy — but then things got worse with 10000 iterations.

Thanks for bringing up the point about the number of variables p. For large p (relative to the sample size n), we are in a completely different world, something I wanted to avoid in the particular example. However, I did try

glmboost()on a logit model, in which the procedure wound up retaining only 2 of 8 predictors, but again found that boosting didn’t seem to help.Why should a boosted linear model outperform your least squares model? That is a serious misconception.

In general, boosting a linear model (in low dimensions) is just another fitting algorithm as an alternative to LS. This can also be seen in the gist I created which shows the complete run of the example:

As one can see, the models are *exactly* identical, both regarding the prediction accuracy and the coefficients.

In high-dimensional settings, variable selection *might* boost the prediction accuracy of a model if the “true” model is actually sparse and the model is “correctly” specified.

To gain (even) more power, one might want to use additive models or tree-based models as extensively discussed in the literature (see e.g. the tutorial I linked in my previous comment). This might also help to boost prediction accuracy in low dimensional settings (as opposed to linear models).

Btw. 8 predictors is by far not high-dimensional.

Thanks for doing the run.

Actually, I didn’t say that 8 predictors is high-dimensional. What I said was that even with this small number of predictors, glmboost() deleted some of them (not on the currency data set).

As I said, setting the number of iterations to 1000 actually did bring some improvement over OLS. This happened repeatedly. In fact, it happens with your seed of 1234. The improvements are small but consistent.

This jibes with mboost’s aim (expressed in the docs and by you here in this discussion) to counter overfitting. Even with this small (relative to n) number of predictors, overfitting can occur, if the effect size of some predictors is small.

I’ll have a further posting on this in a few days.

Another package on boosting, xgboost, is now gaining increasing popularity on Kaggle, which also won the John Chambers Award this year.