The Method of Boosting

December 8, 2015 matloff 8 Comments

One of the techniques that has caused the most excitement in the machine learning community is boosting, which in essence is a process of iteratively refining, e.g. by reweighting, of estimated regression and classification functions (though it has primarily been applied to the latter), in order to improve predictive ability.

Much has been made of the remark by the late statistician Leo Breiman that boosting is “the best off-the-shelf classifier in the world,” his term off-the-shelf meaning that the given method can be used by nonspecialist users without special tweaking. Many analysts have indeed reported good results from the method.

In this post I will

Briefly introduce the topic.
Give a view of boosting that may not be well known.
Give a surprising example.

As with some of my recent posts, this will be based on material from the book I’m writing on regression and classification.

Intuition:

The key point, almost always missed in technical discussions, is that boosting is really about bias reduction. Take the linear model, our example in this posting.

A linear model is rarely if ever exactly correct. Thus use of a linear model will result in bias; in some regions of the predictor vector X, the model will overestimate the true regression function, while in others it will underestimate — no matter how large our sample is. It thus may be profitable to try to reduce bias in regions in which our unweighted predictions are very bad, at the hopefully small sacrifice of some prediction accuracy in places where our unweighted analysis is doing well. (In the classification setting, a small loss in accuracy in estimating the conditional probability function won’t hurt our predictions at all, since our predictions won’t change.) The reweighting (or other iterative) process is aimed at achieving a positive tradeoff of that nature.

To motivate the notion of boosting, following is an adaptation of some material in Richard Berk’s book, Statistical Learning from a Regression Perspective, which by the way is one of the most thoughtful, analytic books on statistics I’ve ever seen.

Say we have fit an OLS model to our data, and through various means (see my book for some old and new methods) suspect that we have substantial model bias. Berk’s algorithm, which he points out is not boosting but is similar in spirit, is roughly as follows (I’ve made some changes):

Fit the OLS model, naming the resulting coefficient vector b₀. Calculate the residuals and their sum of squares, and set i =0.
Update i to i+1. Fit a weighted least-squares model, using as weights the absolute residuals, naming the result b_i. Calculate the new residuals and sum of squares.
If i is less than the desired number of iterations k, go to Step 2.

In the end, take your final coefficient vector to be a weighted average of all the b_i, with the weights being inversely proportional to the sums of squared residuals.

Again, we do all this in the hope of reducing model bias where it is most important. If all goes well, our ability to predict future observations will be enhanced.

Choice of weapons:

R offers a nice variety of packages for boosting. We’ll use the mboost package here, because it is largely geared toward parametric models such as the linear. In particular, it provides us with revised coefficients, rather than just outputting a “black box” prediction machine.

Of course, like any self-respecting R package, mboost offers a bewildering set of arguments in its functions. But Leo Breiman was a really smart guy, extraordinarily insightful. Based on his “off-the-shelf” remark, we will simply use the default values of the arguments.

The data:

Fong and Ouliaris (1995) do an analysis of relations between currency rates for Canada, Germany, France, the UK and Japan (pre-European Union days). Do they move together? Let’s look at predicting the Japanese yen from the others.

This is time series data, and the authors of the above paper do a very sophisticated analysis along those lines. But we’ll just do straight linear modeling here.

After applying OLS (not shown here), we find a pretty good fit, with an adjusted R-squared value of 0.89. However, there are odd patterns in the residuals, and something disturbing occurs when we take a k-Nearest Neighbors approach.

R-squared, whether a population value or the sample estimate reported by lm(), is the squared correlation between Y and its predicted value. Thus R-squared can be calculated for any method of regression function estimation, not just the linear model. In particular, we can apply the concept to kNN.

We’ll use the functions from my regtools package.


> curr <- read.table('EXC.ASC',header=TRUE)
> curr1 <- curr
> curr1[,-5] <- scale(curr1[,-5])
> fout1 <- lm(Yen ~ .,data=curr1)
> library(regtools)
> xdata <- preprocessx(curr1[,-5],25,xval=TRUE)
> kout <- knnest(curr1[,5],xdata,25)
> ypredknn <- knnpred(kout,xdata$x)
> cor(ypredknn,curr1[,5])^2
[1] 0.9817131

This is rather troubling. It had seemed that our OLS fit was very nice, but apparently we are “leaving money on the table” — we can do substantially better than that simple linear model.

So, let’s give boosting a try. Let’s split the data into training and test sets, and compare boosting to OLS.


> library(mboost)
> trnidxs <- sample(1:761,500)
> predidxs <- setdiff(1:761,trnidxs)
> mbout <- glmboost(Yen ~ .,data=curr1[trnidxs,])
> lmout <- lm(Yen ~ .,data=curr1[trnidxs,])
> mbpred <- predict(mbout,curr1[predidxs,])
> lmpred <- predict(lmout,curr1[predidxs,])
> predy <- curr1[predidxs,]$Yen
> mean(abs(predy-mbpred))
[1] 14.03786
> mean(abs(predy-lmpred))
[1] 13.20589

Well, lo and behold, boosting actually did worse than OLS! Clearly we can’t generalize from this, and as mentioned, many analysts have reported big gains from boosting. And though Breiman was one of the giants of this field, the example here shows that boosting is not ready for off-the-shelf usage. On the contrary, there are also numerous reports of boosting having problems, such as bizarre cases in which the iterations of boosting seemed to be converging, only to have them suddenly diverge.

If one is willing to go slightly past “off the shelf,” one can set the number of iterations, using boost_control() or mstop(). And if one is happy to go completely past “off the shelf,” the mboost package has numerous other options to explore.

Once again: There are no “silver bullets” in this field.

8 thoughts on “The Method of Boosting”

Pingback: The Method of Boosting | Mubashir Qasim
Steve Lianoglou says:

December 9, 2015 at 8:18 am

I think you are comparing apples to oranges here, but you haven’t provided the code for the OLD results, so I can’t tell.

Are you comparing the R^2 of the OLS fit on all the data to the R^2 of boosting on the testing data?

Reply
1. matloff says:
  
  December 9, 2015 at 5:41 pm
  
  I’m not sure what you are asking. I never computed R^2 on the boosted model.
  
  The 0.89 value came from running lm() on the full data, curr1, as did the kNN value of 0.98.
  
  Reply
Benjamin says:

December 10, 2015 at 2:11 am

Thanks for your nice promotion of boosting methods and for using our package mboost.

Unfortunately, you do not provide enough information to make your results reproducible. Yet, there are some serious flaws and misconceptions in your post.

1) Model-based boosting is not as such an off-the-shelf method as you think. Leo Breiman was talking about tree-based classification and not boosted linear models.

2) Model-based boosting mainly aims at model fitting with intrinsic variable selection. So if you have a data set with few variables, boosted linear models will most likely converge to the MLE or LS solution. Yet, if many candidate predictors are available, boosting can help you to fit a sparse model with a focus on prediction accuracy (as measured by your loss function).

3) In any case, the number of iterations (called mstop in mboost) is a very important tuning parameter. In fact the only real tuning parameter. The default value is just 100 iterations, which is usually far too little for a serious model. Increasing the number of iterations will most likely drastically improve your model. E.g. one could use

mstop(mbout) <- 1000

4) Still, this is not the optimal model. To tune your model, i.e., to optimize your prediction performance, you need to use approaches such as cross-validation. The easiest thing is to use 25-fold bootstrap (with little extra work one could switch to k-fold CV or subsampling, see ?cvrisk for details):

cvr <- cvrisk(mbout, grid = 1:1000)
## see if we had enough steps
plot(cvr)
## if so set model to optimal number of iterations
mstop(mbout) <- mstop(cvr)

Now, you can compare your models!

For a detailed and easy introduction to boosting models using mboost we refer you to our comprehensive tutorial

B. Hofner, A. Mayr, N. Robinzonov, M. Schmid (2014). "Model-based Boosting in R: A Hands-on Tutorial Using the R Package mboost". Computational Statistics, 29:3-35. (which is also available as vignette: https://cran.r-project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf)

Reply
1. matloff says:
  
  December 10, 2015 at 10:49 am
  
  Thanks very much for taking the time to write your very informative reply.
  
  I have now edited my post to make the actions reproducible, modulo the randomness in selecting the training and test sets. (By the way, the reason I centered and scaled was not related to boosting.) If you have time to do a very non-“off the shelf” analysis, I will be very interested in seeing the results.
  
  As I wrote in my post, the central theme was Breiman’s “best off-the-shelf” characterization. You mention that he was only talking about trees, which is plausible but certainly not how he is being quoted these days. I have lunch often with one of his old collaborators, and will see if he knows what Leo really meant.
  
  In any case, “off the shelf” is a highly desirable trait for a procedure to have. However, you note that the only real tuning parameter in your procedure is the number of iterations, and I agree that setting that parameter is not too much to ask of users. I did try resetting the number to 1000, and got about a 2% improvement in prediction accuracy — but then things got worse with 10000 iterations.
  
  Thanks for bringing up the point about the number of variables p. For large p (relative to the sample size n), we are in a completely different world, something I wanted to avoid in the particular example. However, I did try glmboost() on a logit model, in which the procedure wound up retaining only 2 of 8 predictors, but again found that boosting didn’t seem to help.
  
  Reply
  1. Benjamin says:
    
    December 11, 2015 at 5:10 am
    Why should a boosted linear model outperform your least squares model? That is a serious misconception.
    
    In general, boosting a linear model (in low dimensions) is just another fitting algorithm as an alternative to LS. This can also be seen in the gist I created which shows the complete run of the example:
    
    ## R version 3.2.2 (2015-08-14) -- "Fire Safety" ## Copyright (C) 2015 The R Foundation for Statistical Computing ## Platform: x86_64-pc-linux-gnu (64-bit) ## ## R is free software and comes with ABSOLUTELY NO WARRANTY. ## You are welcome to redistribute it under certain conditions. ## Type 'license()' or 'licence()' for distribution details. ## ## Natural language support but running in an English locale ## ## R is a collaborative project with many contributors. ## Type 'contributors()' for more information and ## 'citation()' on how to cite R or R packages in publications. ## ## Type 'demo()' for some demos, 'help()' for on-line help, or ## 'help.start()' for an HTML browser interface to help. ## Type 'q()' to quit R. curr <- read.table('EXC.ASC', header=FALSE, dec = ".") colnames(curr) <- c("Canada", "German Mark", "French Fr", "Pound", "Yen") summary(curr) ## Canada German Mark French Fr Pound ## Min. :0.9682 Min. :1.597 Min. : 3.969 Min. :0.4096 ## 1st Qu.:1.1213 1st Qu.:1.875 1st Qu.: 4.505 1st Qu.:0.4933 ## Median :1.2001 Median :2.319 Median : 5.660 Median :0.5639 ## Mean :1.1951 Mean :2.266 Mean : 5.913 Mean :0.5758 ## 3rd Qu.:1.2943 3rd Qu.:2.548 3rd Qu.: 6.905 3rd Qu.:0.6542 ## Max. :1.4274 Max. :3.424 Max. :10.480 Max. :0.9445 ## Yen ## Min. :121.2 ## 1st Qu.:196.9 ## Median :233.6 ## Mean :224.9 ## 3rd Qu.:257.4 ## Max. :306.4 set.seed(1234) curr1 <- curr curr1[,-5] <- scale(curr1[,-5]) fout1 <- lm(Yen ~ .,data=curr1) library(regtools) ## Loading required package: FNN xdata <- preprocessx(curr1[,-5],25,xval=TRUE) kout <- knnest(curr1[,5],xdata,25) ypredknn <- knnpred(kout,xdata$x) cor(ypredknn,curr1[,5])^2 ## [1] 0.9817131 library(mboost) ## Loading required package: parallel ## Loading required package: stabs ## This is mboost 2.5-0. See ‘package?mboost’ and ‘news(package = "mboost")’ ## for a complete list of changes. trnidxs <- sample(1:761,500) predidxs <- setdiff(1:761,trnidxs) mbout <- glmboost(Yen ~ .,data=curr1[trnidxs,]) mstop(mbout) <- 10000 ### cross-validation does not work for large n small p as no overfitting occurs # cvr <- cvrisk(mbout, grid = 1:10000) # plot(cvr) # mstop(mbout) <- mstop(cvr) lmout <- lm(Yen ~ .,data=curr1[trnidxs,]) mbpred <- predict(mbout,curr1[predidxs,]) lmpred <- predict(lmout,curr1[predidxs,]) predy <- curr1[predidxs,]$Yen mean(abs(predy-mbpred)) ## [1] 13.4263 mean(abs(predy-lmpred)) ## [1] 13.4263 coef(mbout, off2int = TRUE) ## (Intercept) Canada `German Mark` `French Fr` Pound ## 224.517080 -4.819203 59.177071 -36.465684 -4.874037 coef(lmout) ## (Intercept) Canada `German Mark` `French Fr` Pound ## 224.517080 -4.819203 59.177071 -36.465684 -4.874037
    
    view raw
    
    Using glmboost for low-dimenstional example.md
    
    hosted with ❤ by GitHub
    
    As one can see, the models are *exactly* identical, both regarding the prediction accuracy and the coefficients.
    
    In high-dimensional settings, variable selection *might* boost the prediction accuracy of a model if the “true” model is actually sparse and the model is “correctly” specified.
    
    To gain (even) more power, one might want to use additive models or tree-based models as extensively discussed in the literature (see e.g. the tutorial I linked in my previous comment). This might also help to boost prediction accuracy in low dimensional settings (as opposed to linear models).
    
    Btw. 8 predictors is by far not high-dimensional.
    Reply
    1. matloff says:
      
      December 11, 2015 at 10:59 am
      
      Thanks for doing the run.
      
      Actually, I didn’t say that 8 predictors is high-dimensional. What I said was that even with this small number of predictors, glmboost() deleted some of them (not on the currency data set).
      
      As I said, setting the number of iterations to 1000 actually did bring some improvement over OLS. This happened repeatedly. In fact, it happens with your seed of 1234. The improvements are small but consistent.
      
      This jibes with mboost’s aim (expressed in the docs and by you here in this discussion) to counter overfitting. Even with this small (relative to n) number of predictors, overfitting can occur, if the effect size of some predictors is small.
      
      I’ll have a further posting on this in a few days.
      
      Reply
Yixuan Qiu says:

January 23, 2016 at 7:21 pm

Another package on boosting, xgboost, is now gaining increasing popularity on Kaggle, which also won the John Chambers Award this year.

Reply

	Anonymous on Just How Good Is ChatGPT in Da…
	Quantile Regression… on Quantile Regression with Rando…
	Anonymous on Quantile Regression with Rando…
	Sina Özdemir on qeML Example: Nonparametric Qu…
	Anonymous on qeML Example: Nonparametric Qu…

Mad (Data) Scientist

The Method of Boosting

8 thoughts on “The Method of Boosting”

Leave a comment Cancel reply

Musings, useful code etc. on R and data science

Share this:

Related

8 thoughts on “The Method of Boosting”

Leave a comment Cancel reply

Musings, useful code etc. on R and data science