OVA vs. AVA in Classification Problems, via regtools

OVA and AVA? Huh?

These stand for One vs. All and All vs. All, in classification problems with more than 2 classes. To illustrate the idea, I’ll use the UCI Vertebral Column data and Letter Recognition Data, and analyze them using my regtools package.

As some of you know, I’m developing the latter in conjunction with a book I’m writing on regression and classification. The package, of course, is usable and helpful independently of the book, though the material in this post will be drawn largely from the book.

In the verterbral data there are m = 3 classes: Normal, Disk Hernia and Spondylolisthesis. The predictors are, as described on the UCI site, “six biomechanical attributes derived from the shape and orientation of the pelvis.”

Let m denote the number of classes. Consider two approaches we might take to predicting the status of the vertebral column, based on logistic regression:

• One vs. All (OVA): Here we would fit 3 logit models to our training data, predicting each of the 3 classes, one at a time, from our 6 variables. The regtools function ovalogtrn() does this for us, yielding a 7×3 matrix, which is then used for future predictions. For a given new data point, we guess the unknown class to be whichever one has maximal probability, given the new data point; the regtools function ovalogpred() handles the details for us here.
• All vs. All (AVA): Here we look at all possible pairs of classes. There will again be 3 of them in this case, though in general the number of pairs will be m (m-1) / 2, with that many columns in our output matrix, as opposed to just m for OVA. At any rate, for each pair we restrict our training data to just the points corresponding to one of the two classes in the pair, then run a logit analysis predicting, say, the first class of the pair.  The regtools functions avalogtrn() and avalogpred() do the work for us.

Clearly, AVA involves a lot of computation. For fixed number of predictor variables p, here is a rough time estimate. For a logit model, the computation will be proportional to the number of cases n (due to computing various sums over all cases). Say our training data is approximately balanced in terms of sizes of the classes, so that the data corresponding to class i has about n/m cases in it, Then the computation for one pair will be O(n/m), but there will be O(m2) pairs, so the total amount of computation will be O(mn) –potentially much larger than the O(n) used by OVA.

Well, then, do we benefit from that extra computation? At least at first glance, AVA would not seem to have much to offer. For instance, since each of its models uses much less than our full data, the resulting estimated coefficients will likely be less accurate than what we calculate under OVA. And if m is large, we will have so many pairs that at least some will likely be especially inaccurate. And yet some researchers claim they find AVA to work better, due to imperfections in the model used.

Let’s try it out on the vertebral column data (warning messages, signaling probabilities near 0 or 1, not shown):

``````
> vert\$V7 <- as.numeric(vert\$V7) - 1
> trnidxs <- sample(1:310,225)
> predidxs <- setdiff(1:310,trnidxs)
> ovout <- ovalogtrn(3,vert[trnidxs,])
> predy <- ovalogpred(ovout,vert[predidxs,1:6])
> mean(predy == vert[predidxs,])
[1] 0.8823529
> avout <- avalogtrn(3,vert[trnidxs,])
> predy <-  avalogpred(3,avout,vert[predidxs,1:6])
> mean(predy == vert[predidxs,7])
[1] 0.8588235
``````

The function ovalogtrn() requires the response (class) variable to be coded 0,1,…,m-1, hence the call to as.numeric(),

At any rate, not much difference, if any, between OVA and AVA in this example. However, the selling point of AVA is supposed to be that it may be effective when the model we are using is not approximately valid.

A good candidate for such a model is the logit appled to the letter recognition data. (I discovered this when the logit turned out to do much less well than k-Nearest Neighbors, and in retrospect it seems plausible, given the nature of the predictors.) The difference between OVA and AVA here was dramatic:

``````

> library(regtools)
> library(mlbench)
> data(LetterRecognition)
> lr <- LetterRecognition
> lr[,1] <- as.numeric(lr[,1]) - 1
> # training and test sets
> lrtrn <- lr[1:14000,]
> lrtest <- lr[14001:20000,]
> ologout <- ovalogtrn(26,lrtrn[,c(2:17,1)])
> ypred <- ovalogpred(ologout,lrtest[,-1])
> mean(ypred == lrtest[,1])
[1] 0.7193333
> alogout <- avalogtrn(26,lrtrn[,c(2:17,1)])
> ypred <- avalogpred(26,alogout,lrtest[,-1])
> mean(ypred == lrtest[,1])
[1] 0.8355
``````

So, apparently AVA fixed a poor model. Of course, it’s better to make a good model in the first place. 🙂

In fact, it turns out that adding quadratic terms to the predictors (not shown) helps a lot. Thus I don’t suggest using AVA as your go-to method. But it’s there in regtools if you want to try it.