Unbalanced Data Is a Problem? No, BALANCED Data Is Worse

Say we are doing classification analysis with classes labeled 0 through m-1. Let Ni be the number of observations in class i. There is much handwringing in the machine learning literature over situations in which there is a wide variation among the Ni. I will argue here, though, that the problem is much worse in the case in which there is — artificially — little or no variation among those sample sizes.

To simplify matters, in what follows I will take m = 2, with the population class probabilities denoted by p and 1-p. Let Y be 1 or 0, according to membership in Class 1 or 0, and let X be the vector of v predictor variables.

First, what about this problem of lack of balance? If your data are a random sample from the target population, then the lack of balance is natural if p is near 0 or 1, and there really isn’t much you can do about it, short of manufacturing data. (Some have actually proposed that, in various forms.) And with a parametric model, say a logit, you may do fairly well if the model is pretty accurate over the range of X. To be sure, the lack of balance may result in substantial within-class misclassification rates even if the overall rate is low. One can try different weightings and the like, but one is pretty much stuck with it.

But at least in this unbalanced situation, you will get consistent estimators of the regression function P(Y = 1 | X = t), as the sample size grows. That’s not true for what I will call the artificially balanced case. Here the Nare typically the same or nearly so, and arise from our doing separate samplings of each of the classes. Clearly we cannot estimate p in this case, and it matters. Here’s why.

By an elementary derivation, we have that (at the population level)

P(Y | X = t) = 1 / (1 + [(1-p)/p] [f(t)/g(t)])    Eqn. (1)

where f and g are the densities of X within Classes 0 and 1.  Consider the logistic model. Equation (1) implies that

β0 + β1 t1 + … + βv tv = -ln[(1-p)/p] – ln[f(t)/g(t)]    Eqn. (2)

From this you can see that βinvolves the quantity

-ln[(1-p)/p],   Eqn. (3)

which in turn implies that if the sample sizes are chosen artificially, then our estimate of  βin the output of R’s glm() function (or any other code for logit) will be wrong. If our goal is Prediction, this will cause a definite bias. And worse, it will be a permanent bias, in the sense that we will not have consistent estimates as the sample size grows.

So, arguably the problem of (artificially) balanced data is worse than the unbalanced case.

The remedy is easy, though. Equation (2) shows that even with the artificially balanced sampling scheme, our estimates of βi WILL be consistent for i > 0 (since the within-class densities of X won’t change due to the sampling scheme). So, if we have an external estimate of p, we can just substitute it in Equation (3) to get the right value for that expression, subtract the wrong one, and then happily do our future classifications.

As an example, consider the UCI Letters data set. There, the various English letters have approximately equal sample sizes, quite counter to what we know about English. But there are good published sources for the true frequencies.

Now, what if we take a nonparametric regression approach? We can still use Equation (1) to make the proper adjustment.  For each t at which we wish to predict class membership, we do the following:

  • Estimate the left-hand side (LHS) of (1) nonparametrically, using any of the many methods on CRAN, or the version of kNN in my regtools package.
  • Solve for the estimated ratio f(t)/g(t).
  • Plug back into (1), this time with the correct value of (1-p)/p from the external source, now providing the correct value of the LHS.

More on the Heteroscedasticity Issue

In my last post, I discussed R software, including mine, that handles heteroscedastic settings for linear and nonlinear regression models. Several readers had interesting comments and questions, which I will address here.

To review: Though most books and software assume homoscedasticity, i.e. constancy of the variance of the response variable at all levels of the predictor variables, this condition seldom if ever holds in real life. This can make a mockery of the reported standard errors, and thus render confidence intervals and tests rather meaningless. But fortunately, there is asymptotic theory that gives use valid standard errors in the heteroscedastic case, due to Eickert, White and others, for the linear case. In R, this is available via the functions vcovHC() and hccm() in the CRAN packages sandwich and carrespectively.

I mentioned that I had adapted the E-W idea to the nonlinear realm, leveraging the linear-approximation core of the Gauss-Newton method for nonlinear least-squares computation, by calling vcovHC() on the linear approximation. The result is the function nlshc() in my package regtools. (Unfortunately, I erroneously used its old name, nlsvcovhc() in my post.) One simply calls nlshc() on an object of class ‘nls’ (output of R’s nls() function) to obtain standard errors that are valid under heteroscedasticity.

Achim Zeileis, one of the authors of sandwich, then made some comments on my blog post. He pointed out that one can actually call sandwich() directly on an ‘nls’ object (again, due to the linear approximation). After reading my post, he tried that, and found that the resulting standard errors were more realistic than what one gets directly from nls(), but they still were not as good as mine, where “good” means correct confidence levels for confidence intervals and so on. He speculated that this was due to the particular variant of E-W that I use. Well, here is an update.

First, Achim’s speculation was right. I tried some of the other variants of E-W, and they didn’t do as well. Of course, there may be other factors at work as well.

Upon further investigation, I found another intriguing phenomenon. I tried nlshc() on one of the NIST datasets, as follows:

regftn <- function(t,u,b1,b2,b3) b1 * t / (t + b2 * (1 + u/b3))
bstart <- list(b1=1,b2=1, b3=1)
z <- nls(v ~ regftn(S,I,b1,b2,b3),data=vmkmki,
   start=list(b1=1,b2=1, b3=1))

To my surprise, the output was an all-NAs matrix. This did not happen with hccm(), This is strange, as vcovHC() works fine otherwise, e.g. in the simulation I posted last time.

So, for the time being, in the new nlshc() I have replaced the call to vcovHC() by a call to hccm(). I’ve also added an option for the user to choose a different E-W variant, though given what we know so far, I don’t recommend it.

A couple of commenters asked about using alternatives to nls(), which in fact I had done with nlsLM(). I had done that because I had had some convergence problems with nls(), but actually have reverted to nls() in the new version of the package. (I will also make my function an S3 method later on, which I had originally decided not to do, but have changed my mind with Achim’s encouragement 🙂 )

Can You Say “Heteroscedasticity” 3 Times Fast?

Most books on regression analysis assume homoscedasticity, the situation in which Var(Y | X = t), for a response variable Y and vector of predictor variables X, is the same for all t. Yet, needless to say, almost all data in real life is heteroscedastic. For Y = human weight and X = height, say, we know that the assumption of homoscedasticity can’t be true, even approximately.

Typical books discuss assessment of that assumption using residual plots and the like, but then leave it at that. Rather few books mention Eicker-White theory, which develops valid asymptotic inference for heteroscedastic data. E-W is really nice, and guess what! — it’s long been available in R, in the sandwich and car packages on CRAN.  (Note that the latter package is intended for use with a book that does cover this topic, J. Fox and S. Weisberg, An R Companion to Applied Regression, Second Edition, Sage, 2011.) Then instead of using R’s standard vcov() function to obtain estimated variances and covariances of the estimates of the βi, we use vcovHC() and hccm(), respectively.

One can make a similar derivation for nonlinear regression, which is available as the function nlshc() in my regtools package. The package will be associated with my own book on regression, currently in progress. (The package is currently in progress as well, but already contains several useful functions.) The rest of this post is adapted from an example in the book.

The model I chose for this simple example was E(Y | X = t) = 1 / t’β, where the distributions of the quantities can be seen in the following simulation code:

sim <- function(n,nreps) {
 b <- 1:2
 res <- replicate(nreps,{
 x <- matrix(rexp(2*n),ncol=2)
 meany <- 1 / (x %*% b)
 y <- meany + (runif(n) - 0.5) * meany
 xy <- cbind(x,y)
 xy <- data.frame(xy)
 nlout <- nls(X3 ~ 1 / (b1*X1+b2*X2),
 data=xy,start=list(b1 = 1,b2=1)) 
 b <- coef(nlout)
 vc <- vcov(nlout)
 vchc <- nlsvcovhc(nlout)
 z1 <- (b[1] - 1) / sqrt(vc[1,1])
 z2 <- (b[1] - 1) / sqrt(vchc[1,1])
 print(mean(res[1,] < 1.28))
 print(mean(res[2,] < 1.28))

The results were striking:

> sim(250,2500)
[1] 0.6188
[1] 0.9096

Since the true value should be 0.90 (1.28 is the 0.90 quantile of N(0,1)), the Eicker-White method is doing an outstanding job here — and R’s built-in nonlinear regression function, nls() is not. The latter function’s reported standard errors are way, way off the mark.

New R Software/Methodology for Handling Missing Dat

I’ve added some missing-data software to my regtools package on GitHub. In this post, I’ll give an overview of missing-data methodology, and explain what the software does. For details, see my JSM paper, jointly authored with my student Xiao (Max) Gu.

There is a long history of development of techniques for handling missing data. See the famous book by Little and Rubin (currently second edition, third due out in December). The main methods in use today fall into two classes:

  • Complete-cases (CC): (Also known at listwise deletion.) This approach is simple — just delete any record for which at least one of the variables has a missing (NA, in R) value.
  • Multiple imputation (MI): These methods involve estimating the conditional distribution of a missing variable from the others, and then sampling from that distribution via simulation. Multiple alternate versions of the data matrix are generated, with the NA values replaced by values that might have been the missing one.

In our work, we revisited, and broadened the scope of, another class of methods, which had been considered in the early years of missing-data research but pretty much abandoned for reasons to be explained shortly:

  • Available cases (AC): Also known as pairwise deletion.) If the statistical method involves computation involving, say, various pairs of variables, include in such a calculation any observation for which this pair is intact, regardless of whether the other variables are intact. The same holds for triples of variables and so on.

The early work on AC involved linear regression analysis. Say we are predicting a scalar Y from a vector X. The classic OLS estimator is (U’U)-1U’V, where U is the matrix of X values and V is the vector of Y values in our data. But if we center our data, that expression consists of the inverse of the sample covariance matrix of X, times the sample covariance of X and Y.

The key point is then that covariances only involve products of pairs of variables. As a simple example, say we are predicting human weight from height and age. Under AC, estimation of the covariance between weight and height can be done by using all records in the data for which the weight and height values are intact, even if age is missing. AC thus makes more thorough use of the data than does CC, and thus AC should be statistically more accurate than CC.

However, CC and AC make more stringent assumptions (concerning the mechanism underlying missingness) than does MI. Hence the popularity of MI. For R, for instance, the packages mi, mice and Amelia and others handle missing data in general,

We used Amelia as our representative MI method. Unfortunately, it is very long-running. In a PCA computation that we ran, for example, CC and AC took 0.0111 and 1.967 seconds, respectively, while MI had a run time of  92.928 seconds. And statistically, it was not performing any better than CC and AC, so we did not include it in our empirical investigations, though we did analyze it otherwise.

Our experiments involved taking real data sets, then randomly inserting NA values, thus generating many versions of the original data. One of the data sets, for instance, was from the 2008 census, consisting of all programmers and engineers in Silicon Valley. (There were about 20,000 in the PUMS sample. This data set is included in the regtools package.) The following table shows the variances of CC and AC estimates of the first regression coefficient (the means were almost identical, essentially no bias):

NA rate CC var. AC var.
0.01 0.4694873 0.1387395
0.05 2.998764 0.7655222
0.10 8.821311 1.530692

As you can see, AC had much better accuracy than CC on this real data set, and in fact was better than CC on the other 3 real data sets we tried as well.

But what about the famous MCAR assumption underlying CC and AC, which is stricter than the MAR assumption of MI methods? We argue in the paper (too involved to even summarize here) that this may be much less of an issue than has been supposed.

One contribution of our work is to extend AC to non-covariance settings, namely log-linear models.

Please try the software (in the functions lmac(), pcac() and loglinac() in the package), and let me know your thoughts.