OVA and AVA? Huh?

These stand for One vs. All and All vs. All, in classification problems with more than 2 classes. To illustrate the idea, I’ll use the UCI Vertebral Column data and Letter Recognition Data, and analyze them using my **regtools **package.

As some of you know, I’m developing the latter in conjunction with a book I’m writing on regression and classification. The package, of course, is usable and helpful independently of the book, though the material in this post will be drawn largely from the book.

In the verterbral data there are m = 3 classes: Normal, Disk Hernia and Spondylolisthesis. The predictors are, as described on the UCI site, “six biomechanical attributes derived from the shape and orientation of the pelvis.”

Let m denote the number of classes. Consider two approaches we might take to predicting the status of the vertebral column, based on logistic regression:

**One vs. All (OVA):**Here we would fit 3 logit models to our training data, predicting each of the 3 classes, one at a time, from our 6 variables. The**regtools**function**ovalogtrn()**does this for us, yielding a 7×3 matrix, which is then used for future predictions. For a given new data point, we guess the unknown class to be whichever one has maximal probability, given the new data point; the**regtools**function**ovalogpred()**handles the details for us here.**All vs. All (AVA):**Here we look at all possible pairs of classes. There will again be 3 of them in this case, though in general the number of pairs will be m (m-1) / 2, with that many columns in our output matrix, as opposed to just m for OVA. At any rate, for each pair we restrict our training data to just the points corresponding to one of the two classes in the pair, then run a logit analysis predicting, say, the first class of the pair. The**regtools**functions**avalogtrn()**and**avalogpred()**do the work for us.

Clearly, AVA involves a lot of computation. For fixed number of predictor variables p, here is a rough time estimate. For a logit model, the computation will be proportional to the number of cases n (due to computing various sums over all cases). Say our training data is approximately balanced in terms of sizes of the classes, so that the data corresponding to class i has about n/m cases in it, Then the computation for one pair will be O(n/m), but there will be O(m^{2}) pairs, so the total amount of computation will be O(mn) –potentially much larger than the O(n) used by OVA.

Well, then, do we benefit from that extra computation? At least at first glance, AVA would not seem to have much to offer. For instance, since each of its models uses much less than our full data, the resulting estimated coefficients will likely be less accurate than what we calculate under OVA. And if m is large, we will have so many pairs that at least some will likely be especially inaccurate. And yet some researchers claim they find AVA to work better, due to imperfections in the model used.

Let’s try it out on the vertebral column data (warning messages, signaling probabilities near 0 or 1, not shown):

```
> vert <- read.table('Vertebrae/column_3C.dat',
header=FALSE)
> vert$V7 <- as.numeric(vert$V7) - 1
> trnidxs <- sample(1:310,225)
> predidxs <- setdiff(1:310,trnidxs)
> ovout <- ovalogtrn(3,vert[trnidxs,])
> predy <- ovalogpred(ovout,vert[predidxs,1:6])
> mean(predy == vert[predidxs,])
[1] 0.8823529
> avout <- avalogtrn(3,vert[trnidxs,])
> predy <- avalogpred(3,avout,vert[predidxs,1:6])
> mean(predy == vert[predidxs,7])
[1] 0.8588235
```

The function **ovalogtrn() **requires the response (class) variable to be coded 0,1,…,m-1, hence the call to **as.numeric()**,

At any rate, not much difference, if any, between OVA and AVA in this example. However, the selling point of AVA is supposed to be that it may be effective when the model we are using is not approximately valid.

A good candidate for such a model is the logit appled to the letter recognition data. (I discovered this when the logit turned out to do much less well than k-Nearest Neighbors, and in retrospect it seems plausible, given the nature of the predictors.) The difference between OVA and AVA here was dramatic:

```
> library(regtools)
Loading required package: FNN
> library(mlbench)
> data(LetterRecognition)
> lr <- LetterRecognition
> lr[,1] <- as.numeric(lr[,1]) - 1
> # training and test sets
> lrtrn <- lr[1:14000,]
> lrtest <- lr[14001:20000,]
> ologout <- ovalogtrn(26,lrtrn[,c(2:17,1)])
> ypred <- ovalogpred(ologout,lrtest[,-1])
> mean(ypred == lrtest[,1])
[1] 0.7193333
> alogout <- avalogtrn(26,lrtrn[,c(2:17,1)])
> ypred <- avalogpred(26,alogout,lrtest[,-1])
> mean(ypred == lrtest[,1])
[1] 0.8355
```

So, apparently AVA fixed a poor model. Of course, it’s better to make a good model in the first place. 🙂

In fact, it turns out that adding quadratic terms to the predictors (not shown) helps a lot. Thus I don’t suggest using AVA as your go-to method. But it’s there in **regtools** if you want to try it.

Wouldn’t fitting multiple binomial logistic models lead to an increase in false positives compared to using a multinomial logit model that can predict the three discrete class memberships simultaneously? It also seems like out of sample predictions would improve with a multinomial logit model since the parameter estimates would more closely reflect the functional form of the data generation process.

The multinomial logistic model’s assumptions are very restrictive. If the assumptions hold, then the standard errors of the estimated logit coefficients will be smaller. But if the assumptions don’t hold, then you introduce bias, and are in trouble.

It is not really a multiple/simultaneous inference problem. But AVA does have that risk, as I stated.