The Generalized Method of Moments and the gmm package

An almost-as-famous alternative to the famous Maximum Likelihood Estimation is the Method of Moments. MM has always been a favorite of mine because it often requires fewer distributional assumptions than MLE, and also because MM is much easier to explain than MLE to students and consulting clients. CRAN has a package gmm that does MM, actually the Generalized Method of Moments, and in this post I’ll explain how to use it (on the elementary level, at least).

Our data set will be bodyfat, which is included in the mfp package, with measurements on 252 men. The first column of this data frame, brozek, is the percentage of body fat, which when converted to a proportion is in (0,1). That makes it a candidate for fitting a beta distribution.

Denoting the parameters of that family by α and β, the mean and variance are α / (α + β) and α β / ((α + β)2 (α + β + 1)), respectively. MM is a model of simplicity — we match these to the sample mean and variance of our data, and then solve for α and β. Of course, the solution part may not be so simple, due to nonlinearities, but gmm worries about all that for us. Yet another tribute to the vast variety of packages available on CRAN!

In our elementary usage here, the call form is

gmm(data,momentftn,start)

where data is our data in matrix or vector form, momentftn specifies the moments, and start is our initial guess for the iterative solution process. Let’s specify momentftn:

g <- function(th,x) {
  t1 <- th[1]
  t2 <- th[2]
  t12 <- t1 + t2
  meanb <- t1 / t12
  m1 <- meanb - x
  m2 <- t1*t2 / (t12^2 * (t12+1)) - (x - meanb)^2
  f <- cbind(m1,m2)
  return(f)
}

This function equates population moments to sample ones, by specifying expressions that gmm() is to set to 0. The argument th here (“theta”) will be the MM estimates (at any given iteration) of the population parameters, in this case of α and β.

The function is required to specify quantities whose averages are to be set to 0. So, in the line

m1 <- meanb – x

we are saying that we want the average of the right-hand side to be 0, where x is our data. That has the effect of setting our current iteration’s estimate of α / (α + β) to our sample mean — exactly what MM is supposed to do. The line assigning to m2 then does the same thing for variance.

So, let’s try all this out on the body fat data:

> gmm(g,x,c(alpha=0.1,beta=0.1))
Method
 twoStep 

Objective function value: 2.559645e-10 

 alpha beta 
 4.6714 19.9969 

Convergence code = 0 
> hist(bodyfat$brozek/100,xlim=c(0,1),
   probability=TRUE)
> curve(dbeta(x,4.67,20.00),add=TRUE)

The result looks like this (apologies for the lack of refinement in this quick graph, cutting off part of the top):

xBodyfat
At least visually, it seems to be a pretty good fit.

For standard errors etc., a method for the generic function vcov() is provided:

> gmmout <- gmm(g,x,c(alpha=0.1,beta=0.1))
> vcov(gmmout)
          alpha beta
alpha 0.2809361 0.9606354
beta 0.9606354 3.9266874

Happy GMM-ing!

Lots more posts coming, when I have time.

The Method of Boosting

One of the techniques that has caused the most excitement in the machine learning community is boosting, which in essence is a process of iteratively refining, e.g. by reweighting, of estimated regression and classification functions (though it has primarily been applied to the latter), in order to improve predictive ability.

Much has been made of the remark by the late statistician Leo Breiman that boosting is “the best off-the-shelf classifier in the world,” his term off-the-shelf meaning that the given method can be used by nonspecialist users without special tweaking. Many analysts have indeed reported good results from the method.

In this post I will

  • Briefly introduce the topic.
  • Give a view of boosting that may not be well known.
  • Give a surprising example.

As with some of my recent posts, this will be based on material from the book I’m writing on regression and classification.

Intuition:

The key point, almost always missed in technical discussions, is that boosting is really about bias reduction. Take the linear model, our example in this posting.

A linear model is rarely if ever exactly correct. Thus use of a linear model will result in bias; in some regions of the predictor vector X, the model will overestimate the true regression function, while in others it will underestimate — no matter how large our sample is. It thus may be profitable to try to reduce bias in regions in which our unweighted predictions are very bad, at the hopefully small sacrifice of some prediction accuracy in places where our unweighted analysis is doing well. (In the classification setting, a small loss in accuracy in estimating the conditional probability function won’t hurt our predictions at all, since our predictions won’t change.) The reweighting (or other iterative) process is aimed at achieving a positive tradeoff of that nature.

To motivate the notion of boosting, following is an adaptation of some material in Richard Berk’s book, Statistical Learning from a Regression Perspective, which by the way is one of the most thoughtful, analytic books on statistics I’ve ever seen.

Say we have fit an OLS model to our data, and through various means (see my book for some old and new methods) suspect that we have substantial model bias. Berk’s algorithm, which he points out is not boosting but is similar in spirit, is roughly as follows (I’ve made some changes):

  1.  Fit the OLS model, naming the resulting coefficient vector b0. Calculate the residuals and their sum of squares, and set i =0.
  2.  Update i to i+1. Fit a weighted least-squares model, using as weights the absolute residuals, naming the result bi. Calculate the new residuals and sum of squares.
  3. If i is less than the desired number of iterations k, go to Step 2.

In the end, take your final coefficient vector to be a weighted average of all the bi, with the weights being inversely proportional to the sums of squared residuals.

Again, we do all this in the hope of reducing model bias where it is most important. If all goes well, our ability to predict future observations will be enhanced.

Choice of weapons:

R offers a nice variety of packages for boosting. We’ll use the mboost package here, because it is largely geared toward parametric models such as the linear. In particular, it provides us with revised coefficients, rather than just outputting a “black box” prediction machine.

Of course, like any self-respecting R package, mboost offers a bewildering set of arguments in its functions. But Leo Breiman was a really smart guy, extraordinarily insightful. Based on his “off-the-shelf” remark, we will simply use the default values of the arguments.

The data:

Fong and Ouliaris (1995) do an analysis of relations between currency rates for Canada, Germany, France, the UK and Japan (pre-European Union days). Do they move together? Let’s look at predicting the Japanese yen from the others.

This is time series data, and the authors of the above paper do a very sophisticated analysis along those lines. But we’ll just do straight linear modeling here.

After applying OLS (not shown here), we find a pretty good fit, with an adjusted R-squared value of 0.89. However, there are odd patterns in the residuals, and something disturbing occurs when we take a k-Nearest Neighbors approach.

R-squared, whether a population value or the sample estimate reported by lm(), is the squared correlation between Y and its predicted value. Thus R-squared can be calculated for any method of regression function estimation, not just the linear model. In particular, we can apply the concept to kNN.

We’ll use the functions from my regtools package.


> curr <- read.table('EXC.ASC',header=TRUE)
> curr1 <- curr
> curr1[,-5] <- scale(curr1[,-5])
> fout1 <- lm(Yen ~ .,data=curr1)
> library(regtools)
> xdata <- preprocessx(curr1[,-5],25,xval=TRUE)
> kout <- knnest(curr1[,5],xdata,25)
> ypredknn <- knnpred(kout,xdata$x)
> cor(ypredknn,curr1[,5])^2
[1] 0.9817131

This is rather troubling. It had seemed that our OLS fit was very nice, but apparently we are “leaving money on the table” — we can do substantially better than that simple linear model.

So, let’s give boosting a try. Let’s split the data into training and test sets, and compare boosting to OLS.


> library(mboost)
> trnidxs <- sample(1:761,500)
> predidxs <- setdiff(1:761,trnidxs)
> mbout <- glmboost(Yen ~ .,data=curr1[trnidxs,])
> lmout <- lm(Yen ~ .,data=curr1[trnidxs,])
> mbpred <- predict(mbout,curr1[predidxs,])
> lmpred <- predict(lmout,curr1[predidxs,])
> predy <- curr1[predidxs,]$Yen
> mean(abs(predy-mbpred))
[1] 14.03786
> mean(abs(predy-lmpred))
[1] 13.20589

Well, lo and behold, boosting actually did worse than OLS! Clearly we can’t generalize from this, and as mentioned, many analysts have reported big gains from boosting. And though Breiman was one of the giants of this field, the example here shows that boosting is not ready for off-the-shelf usage. On the contrary, there are also numerous reports of boosting having problems, such as bizarre cases in which the iterations of boosting seemed to be converging, only to have them suddenly diverge.

If one is willing to go slightly past “off the shelf,” one can set the number of iterations, using boost_control() or mstop(). And if one is happy to go completely past “off the shelf,” the mboost package has numerous other options to explore.

Once again: There are no “silver bullets” in this field.

 

 

 

OVA vs. AVA in Classification Problems, via regtools

OVA and AVA? Huh?

These stand for One vs. All and All vs. All, in classification problems with more than 2 classes. To illustrate the idea, I’ll use the UCI Vertebral Column data and Letter Recognition Data, and analyze them using my regtools package.

As some of you know, I’m developing the latter in conjunction with a book I’m writing on regression and classification. The package, of course, is usable and helpful independently of the book, though the material in this post will be drawn largely from the book.

In the verterbral data there are m = 3 classes: Normal, Disk Hernia and Spondylolisthesis. The predictors are, as described on the UCI site, “six biomechanical attributes derived from the shape and orientation of the pelvis.”

Let m denote the number of classes. Consider two approaches we might take to predicting the status of the vertebral column, based on logistic regression:

  • One vs. All (OVA): Here we would fit 3 logit models to our training data, predicting each of the 3 classes, one at a time, from our 6 variables. The regtools function ovalogtrn() does this for us, yielding a 7×3 matrix, which is then used for future predictions. For a given new data point, we guess the unknown class to be whichever one has maximal probability, given the new data point; the regtools function ovalogpred() handles the details for us here.
  • All vs. All (AVA): Here we look at all possible pairs of classes. There will again be 3 of them in this case, though in general the number of pairs will be m (m-1) / 2, with that many columns in our output matrix, as opposed to just m for OVA. At any rate, for each pair we restrict our training data to just the points corresponding to one of the two classes in the pair, then run a logit analysis predicting, say, the first class of the pair.  The regtools functions avalogtrn() and avalogpred() do the work for us.

Clearly, AVA involves a lot of computation. For fixed number of predictor variables p, here is a rough time estimate. For a logit model, the computation will be proportional to the number of cases n (due to computing various sums over all cases). Say our training data is approximately balanced in terms of sizes of the classes, so that the data corresponding to class i has about n/m cases in it, Then the computation for one pair will be O(n/m), but there will be O(m2) pairs, so the total amount of computation will be O(mn) –potentially much larger than the O(n) used by OVA.

Well, then, do we benefit from that extra computation? At least at first glance, AVA would not seem to have much to offer. For instance, since each of its models uses much less than our full data, the resulting estimated coefficients will likely be less accurate than what we calculate under OVA. And if m is large, we will have so many pairs that at least some will likely be especially inaccurate. And yet some researchers claim they find AVA to work better, due to imperfections in the model used.

Let’s try it out on the vertebral column data (warning messages, signaling probabilities near 0 or 1, not shown):


> vert <- read.table('Vertebrae/column_3C.dat',
   header=FALSE)
> vert$V7 <- as.numeric(vert$V7) - 1
> trnidxs <- sample(1:310,225)
> predidxs <- setdiff(1:310,trnidxs)
> ovout <- ovalogtrn(3,vert[trnidxs,])
> predy <- ovalogpred(ovout,vert[predidxs,1:6])
> mean(predy == vert[predidxs,])
[1] 0.8823529
> avout <- avalogtrn(3,vert[trnidxs,])
> predy <-  avalogpred(3,avout,vert[predidxs,1:6])
> mean(predy == vert[predidxs,7])
[1] 0.8588235

The function ovalogtrn() requires the response (class) variable to be coded 0,1,…,m-1, hence the call to as.numeric(),

At any rate, not much difference, if any, between OVA and AVA in this example. However, the selling point of AVA is supposed to be that it may be effective when the model we are using is not approximately valid.

A good candidate for such a model is the logit appled to the letter recognition data. (I discovered this when the logit turned out to do much less well than k-Nearest Neighbors, and in retrospect it seems plausible, given the nature of the predictors.) The difference between OVA and AVA here was dramatic:



> library(regtools) 
Loading required package: FNN 
> library(mlbench) 
> data(LetterRecognition) 
> lr <- LetterRecognition 
> lr[,1] <- as.numeric(lr[,1]) - 1 
> # training and test sets 
> lrtrn <- lr[1:14000,] 
> lrtest <- lr[14001:20000,] 
> ologout <- ovalogtrn(26,lrtrn[,c(2:17,1)]) 
> ypred <- ovalogpred(ologout,lrtest[,-1]) 
> mean(ypred == lrtest[,1]) 
[1] 0.7193333 
> alogout <- avalogtrn(26,lrtrn[,c(2:17,1)])
> ypred <- avalogpred(26,alogout,lrtest[,-1])
> mean(ypred == lrtest[,1])
[1] 0.8355

So, apparently AVA fixed a poor model. Of course, it’s better to make a good model in the first place. 🙂

In fact, it turns out that adding quadratic terms to the predictors (not shown) helps a lot. Thus I don’t suggest using AVA as your go-to method. But it’s there in regtools if you want to try it.