The Method of Boosting

One of the techniques that has caused the most excitement in the machine learning community is boosting, which in essence is a process of iteratively refining, e.g. by reweighting, of estimated regression and classification functions (though it has primarily been applied to the latter), in order to improve predictive ability.

Much has been made of the remark by the late statistician Leo Breiman that boosting is “the best off-the-shelf classifier in the world,” his term off-the-shelf meaning that the given method can be used by nonspecialist users without special tweaking. Many analysts have indeed reported good results from the method.

In this post I will

  • Briefly introduce the topic.
  • Give a view of boosting that may not be well known.
  • Give a surprising example.

As with some of my recent posts, this will be based on material from the book I’m writing on regression and classification.

Intuition:

The key point, almost always missed in technical discussions, is that boosting is really about bias reduction. Take the linear model, our example in this posting.

A linear model is rarely if ever exactly correct. Thus use of a linear model will result in bias; in some regions of the predictor vector X, the model will overestimate the true regression function, while in others it will underestimate — no matter how large our sample is. It thus may be profitable to try to reduce bias in regions in which our unweighted predictions are very bad, at the hopefully small sacrifice of some prediction accuracy in places where our unweighted analysis is doing well. (In the classification setting, a small loss in accuracy in estimating the conditional probability function won’t hurt our predictions at all, since our predictions won’t change.) The reweighting (or other iterative) process is aimed at achieving a positive tradeoff of that nature.

To motivate the notion of boosting, following is an adaptation of some material in Richard Berk’s book, Statistical Learning from a Regression Perspective, which by the way is one of the most thoughtful, analytic books on statistics I’ve ever seen.

Say we have fit an OLS model to our data, and through various means (see my book for some old and new methods) suspect that we have substantial model bias. Berk’s algorithm, which he points out is not boosting but is similar in spirit, is roughly as follows (I’ve made some changes):

  1.  Fit the OLS model, naming the resulting coefficient vector b0. Calculate the residuals and their sum of squares, and set i =0.
  2.  Update i to i+1. Fit a weighted least-squares model, using as weights the absolute residuals, naming the result bi. Calculate the new residuals and sum of squares.
  3. If i is less than the desired number of iterations k, go to Step 2.

In the end, take your final coefficient vector to be a weighted average of all the bi, with the weights being inversely proportional to the sums of squared residuals.

Again, we do all this in the hope of reducing model bias where it is most important. If all goes well, our ability to predict future observations will be enhanced.

Choice of weapons:

R offers a nice variety of packages for boosting. We’ll use the mboost package here, because it is largely geared toward parametric models such as the linear. In particular, it provides us with revised coefficients, rather than just outputting a “black box” prediction machine.

Of course, like any self-respecting R package, mboost offers a bewildering set of arguments in its functions. But Leo Breiman was a really smart guy, extraordinarily insightful. Based on his “off-the-shelf” remark, we will simply use the default values of the arguments.

The data:

Fong and Ouliaris (1995) do an analysis of relations between currency rates for Canada, Germany, France, the UK and Japan (pre-European Union days). Do they move together? Let’s look at predicting the Japanese yen from the others.

This is time series data, and the authors of the above paper do a very sophisticated analysis along those lines. But we’ll just do straight linear modeling here.

After applying OLS (not shown here), we find a pretty good fit, with an adjusted R-squared value of 0.89. However, there are odd patterns in the residuals, and something disturbing occurs when we take a k-Nearest Neighbors approach.

R-squared, whether a population value or the sample estimate reported by lm(), is the squared correlation between Y and its predicted value. Thus R-squared can be calculated for any method of regression function estimation, not just the linear model. In particular, we can apply the concept to kNN.

We’ll use the functions from my regtools package.


> curr <- read.table('EXC.ASC',header=TRUE)
> curr1 <- curr
> curr1[,-5] <- scale(curr1[,-5])
> fout1 <- lm(Yen ~ .,data=curr1)
> library(regtools)
> xdata <- preprocessx(curr1[,-5],25,xval=TRUE)
> kout <- knnest(curr1[,5],xdata,25)
> ypredknn <- knnpred(kout,xdata$x)
> cor(ypredknn,curr1[,5])^2
[1] 0.9817131

This is rather troubling. It had seemed that our OLS fit was very nice, but apparently we are “leaving money on the table” — we can do substantially better than that simple linear model.

So, let’s give boosting a try. Let’s split the data into training and test sets, and compare boosting to OLS.


> library(mboost)
> trnidxs <- sample(1:761,500)
> predidxs <- setdiff(1:761,trnidxs)
> mbout <- glmboost(Yen ~ .,data=curr1[trnidxs,])
> lmout <- lm(Yen ~ .,data=curr1[trnidxs,])
> mbpred <- predict(mbout,curr1[predidxs,])
> lmpred <- predict(lmout,curr1[predidxs,])
> predy <- curr1[predidxs,]$Yen
> mean(abs(predy-mbpred))
[1] 14.03786
> mean(abs(predy-lmpred))
[1] 13.20589

Well, lo and behold, boosting actually did worse than OLS! Clearly we can’t generalize from this, and as mentioned, many analysts have reported big gains from boosting. And though Breiman was one of the giants of this field, the example here shows that boosting is not ready for off-the-shelf usage. On the contrary, there are also numerous reports of boosting having problems, such as bizarre cases in which the iterations of boosting seemed to be converging, only to have them suddenly diverge.

If one is willing to go slightly past “off the shelf,” one can set the number of iterations, using boost_control() or mstop(). And if one is happy to go completely past “off the shelf,” the mboost package has numerous other options to explore.

Once again: There are no “silver bullets” in this field.

 

 

 

8 thoughts on “The Method of Boosting”

  1. I think you are comparing apples to oranges here, but you haven’t provided the code for the OLD results, so I can’t tell.

    Are you comparing the R^2 of the OLS fit on all the data to the R^2 of boosting on the testing data?

    1. I’m not sure what you are asking. I never computed R^2 on the boosted model.

      The 0.89 value came from running lm() on the full data, curr1, as did the kNN value of 0.98.

  2. Thanks for your nice promotion of boosting methods and for using our package mboost.

    Unfortunately, you do not provide enough information to make your results reproducible. Yet, there are some serious flaws and misconceptions in your post.

    1) Model-based boosting is not as such an off-the-shelf method as you think. Leo Breiman was talking about tree-based classification and not boosted linear models.

    2) Model-based boosting mainly aims at model fitting with intrinsic variable selection. So if you have a data set with few variables, boosted linear models will most likely converge to the MLE or LS solution. Yet, if many candidate predictors are available, boosting can help you to fit a sparse model with a focus on prediction accuracy (as measured by your loss function).

    3) In any case, the number of iterations (called mstop in mboost) is a very important tuning parameter. In fact the only real tuning parameter. The default value is just 100 iterations, which is usually far too little for a serious model. Increasing the number of iterations will most likely drastically improve your model. E.g. one could use

    mstop(mbout) <- 1000

    4) Still, this is not the optimal model. To tune your model, i.e., to optimize your prediction performance, you need to use approaches such as cross-validation. The easiest thing is to use 25-fold bootstrap (with little extra work one could switch to k-fold CV or subsampling, see ?cvrisk for details):

    cvr <- cvrisk(mbout, grid = 1:1000)
    ## see if we had enough steps
    plot(cvr)
    ## if so set model to optimal number of iterations
    mstop(mbout) <- mstop(cvr)

    Now, you can compare your models!

    For a detailed and easy introduction to boosting models using mboost we refer you to our comprehensive tutorial

    B. Hofner, A. Mayr, N. Robinzonov, M. Schmid (2014). "Model-based Boosting in R: A Hands-on Tutorial Using the R Package mboost". Computational Statistics, 29:3-35. (which is also available as vignette: https://cran.r-project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf)

    1. Thanks very much for taking the time to write your very informative reply.

      I have now edited my post to make the actions reproducible, modulo the randomness in selecting the training and test sets. (By the way, the reason I centered and scaled was not related to boosting.) If you have time to do a very non-“off the shelf” analysis, I will be very interested in seeing the results.

      As I wrote in my post, the central theme was Breiman’s “best off-the-shelf” characterization. You mention that he was only talking about trees, which is plausible but certainly not how he is being quoted these days. I have lunch often with one of his old collaborators, and will see if he knows what Leo really meant.

      In any case, “off the shelf” is a highly desirable trait for a procedure to have. However, you note that the only real tuning parameter in your procedure is the number of iterations, and I agree that setting that parameter is not too much to ask of users. I did try resetting the number to 1000, and got about a 2% improvement in prediction accuracy — but then things got worse with 10000 iterations.

      Thanks for bringing up the point about the number of variables p. For large p (relative to the sample size n), we are in a completely different world, something I wanted to avoid in the particular example. However, I did try glmboost() on a logit model, in which the procedure wound up retaining only 2 of 8 predictors, but again found that boosting didn’t seem to help.

      1. Why should a boosted linear model outperform your least squares model? That is a serious misconception.

        In general, boosting a linear model (in low dimensions) is just another fitting algorithm as an alternative to LS. This can also be seen in the gist I created which shows the complete run of the example:

        ## R version 3.2.2 (2015-08-14) -- "Fire Safety"
        ## Copyright (C) 2015 The R Foundation for Statistical Computing
        ## Platform: x86_64-pc-linux-gnu (64-bit)
        ##  
        ## R is free software and comes with ABSOLUTELY NO WARRANTY.
        ## You are welcome to redistribute it under certain conditions.
        ## Type 'license()' or 'licence()' for distribution details.
        ##  
        ##   Natural language support but running in an English locale
        ##  
        ## R is a collaborative project with many contributors.
        ## Type 'contributors()' for more information and
        ## 'citation()' on how to cite R or R packages in publications.
        ##  
        ## Type 'demo()' for some demos, 'help()' for on-line help, or
        ## 'help.start()' for an HTML browser interface to help.
        ## Type 'q()' to quit R.
        
        curr <- read.table('EXC.ASC', header=FALSE, dec = ".")
        colnames(curr) <- c("Canada", "German Mark", "French Fr", "Pound", "Yen")
        summary(curr)
        ##      Canada        German Mark      French Fr          Pound       
        ##  Min.   :0.9682   Min.   :1.597   Min.   : 3.969   Min.   :0.4096  
        ##  1st Qu.:1.1213   1st Qu.:1.875   1st Qu.: 4.505   1st Qu.:0.4933  
        ##  Median :1.2001   Median :2.319   Median : 5.660   Median :0.5639  
        ##  Mean   :1.1951   Mean   :2.266   Mean   : 5.913   Mean   :0.5758  
        ##  3rd Qu.:1.2943   3rd Qu.:2.548   3rd Qu.: 6.905   3rd Qu.:0.6542  
        ##  Max.   :1.4274   Max.   :3.424   Max.   :10.480   Max.   :0.9445  
        ##       Yen       
        ##  Min.   :121.2  
        ##  1st Qu.:196.9  
        ##  Median :233.6  
        ##  Mean   :224.9  
        ##  3rd Qu.:257.4  
        ##  Max.   :306.4  
        
        set.seed(1234)
        
        curr1 <- curr
        curr1[,-5] <- scale(curr1[,-5])
        fout1 <- lm(Yen ~ .,data=curr1)
        library(regtools)
        ## Loading required package: FNN
        xdata <- preprocessx(curr1[,-5],25,xval=TRUE)
        kout <- knnest(curr1[,5],xdata,25)
        ypredknn <- knnpred(kout,xdata$x)
        cor(ypredknn,curr1[,5])^2
        ## [1] 0.9817131
         
        library(mboost)
        ## Loading required package: parallel
        ## Loading required package: stabs
        ## This is mboost 2.5-0. See ‘package?mboost’ and ‘news(package  = "mboost")’
        ## for a complete list of changes.
        
        trnidxs <- sample(1:761,500)
        predidxs <- setdiff(1:761,trnidxs)
        mbout <- glmboost(Yen ~ .,data=curr1[trnidxs,])
        mstop(mbout) <- 10000
        ### cross-validation does not work for large n small p as no overfitting occurs
        # cvr <- cvrisk(mbout, grid = 1:10000)
        # plot(cvr)
        # mstop(mbout) <- mstop(cvr)
        
        lmout <- lm(Yen ~ .,data=curr1[trnidxs,])
        mbpred <- predict(mbout,curr1[predidxs,])
        lmpred <- predict(lmout,curr1[predidxs,])
        predy <- curr1[predidxs,]$Yen
        mean(abs(predy-mbpred))
        ## [1] 13.4263
        mean(abs(predy-lmpred))
        ## [1] 13.4263
        
        coef(mbout, off2int = TRUE)
        ##   (Intercept)        Canada `German Mark`   `French Fr`         Pound 
        ##    224.517080     -4.819203     59.177071    -36.465684     -4.874037 
        coef(lmout)
        ##   (Intercept)        Canada `German Mark`   `French Fr`         Pound 
        ##    224.517080     -4.819203     59.177071    -36.465684     -4.874037 

        As one can see, the models are *exactly* identical, both regarding the prediction accuracy and the coefficients.

        In high-dimensional settings, variable selection *might* boost the prediction accuracy of a model if the “true” model is actually sparse and the model is “correctly” specified.

        To gain (even) more power, one might want to use additive models or tree-based models as extensively discussed in the literature (see e.g. the tutorial I linked in my previous comment). This might also help to boost prediction accuracy in low dimensional settings (as opposed to linear models).

        Btw. 8 predictors is by far not high-dimensional.

        1. Thanks for doing the run.

          Actually, I didn’t say that 8 predictors is high-dimensional. What I said was that even with this small number of predictors, glmboost() deleted some of them (not on the currency data set).

          As I said, setting the number of iterations to 1000 actually did bring some improvement over OLS. This happened repeatedly. In fact, it happens with your seed of 1234. The improvements are small but consistent.

          This jibes with mboost’s aim (expressed in the docs and by you here in this discussion) to counter overfitting. Even with this small (relative to n) number of predictors, overfitting can occur, if the effect size of some predictors is small.

          I’ll have a further posting on this in a few days.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.