Unbalanced Data Is a Problem? No, BALANCED Data Is Worse

Say we are doing classification analysis with classes labeled 0 through m-1. Let Ni be the number of observations in class i. There is much handwringing in the machine learning literature over situations in which there is a wide variation among the Ni. I will argue here, though, that the problem is much worse in the case in which there is — artificially — little or no variation among those sample sizes.

To simplify matters, in what follows I will take m = 2, with the population class probabilities denoted by p and 1-p. Let Y be 1 or 0, according to membership in Class 1 or 0, and let X be the vector of v predictor variables.

First, what about this problem of lack of balance? If your data are a random sample from the target population, then the lack of balance is natural if p is near 0 or 1, and there really isn’t much you can do about it, short of manufacturing data. (Some have actually proposed that, in various forms.) And with a parametric model, say a logit, you may do fairly well if the model is pretty accurate over the range of X. To be sure, the lack of balance may result in substantial within-class misclassification rates even if the overall rate is low. One can try different weightings and the like, but one is pretty much stuck with it.

But at least in this unbalanced situation, you will get consistent estimators of the regression function P(Y = 1 | X = t), as the sample size grows. That’s not true for what I will call the artificially balanced case. Here the Nare typically the same or nearly so, and arise from our doing separate samplings of each of the classes. Clearly we cannot estimate p in this case, and it matters. Here’s why.

By an elementary derivation, we have that (at the population level)

P(Y | X = t) = 1 / (1 + [(1-p)/p] [f(t)/g(t)])    Eqn. (1)

where f and g are the densities of X within Classes 0 and 1.  Consider the logistic model. Equation (1) implies that

β0 + β1 t1 + … + βv tv = -ln[(1-p)/p] – ln[f(t)/g(t)]    Eqn. (2)

From this you can see that βinvolves the quantity

-ln[(1-p)/p],   Eqn. (3)

which in turn implies that if the sample sizes are chosen artificially, then our estimate of  βin the output of R’s glm() function (or any other code for logit) will be wrong. If our goal is Prediction, this will cause a definite bias. And worse, it will be a permanent bias, in the sense that we will not have consistent estimates as the sample size grows.

So, arguably the problem of (artificially) balanced data is worse than the unbalanced case.

The remedy is easy, though. Equation (2) shows that even with the artificially balanced sampling scheme, our estimates of βi WILL be consistent for i > 0 (since the within-class densities of X won’t change due to the sampling scheme). So, if we have an external estimate of p, we can just substitute it in Equation (3) to get the right value for that expression, subtract the wrong one, and then happily do our future classifications.

As an example, consider the UCI Letters data set. There, the various English letters have approximately equal sample sizes, quite counter to what we know about English. But there are good published sources for the true frequencies.

Now, what if we take a nonparametric regression approach? We can still use Equation (1) to make the proper adjustment.  For each t at which we wish to predict class membership, we do the following:

  • Estimate the left-hand side (LHS) of (1) nonparametrically, using any of the many methods on CRAN, or the version of kNN in my regtools package.
  • Solve for the estimated ratio f(t)/g(t).
  • Plug back into (1), this time with the correct value of (1-p)/p from the external source, now providing the correct value of the LHS.

5 thoughts on “Unbalanced Data Is a Problem? No, BALANCED Data Is Worse”

  1. If the estimated probabilities are not the important, as for prediction purposes, one can also choose the threshold for the classification adaptively. It could have the same effect of re-estimating the intercept term.

    1. Since the stated goal of my post is Prediction, I meant the estimated probabilities to be used in threshholds (either formal or informal).
      You may well be correct that experimenting with various threshhold values might have a similar effect. My objection to that, though, is the word might. In fact, it is my chief objection to ML research in general, which to me consists too much of “Well, we thought of this ad hoc method, and tried it on some data sets, where it seemed to work well…” I think methodology that has a firm model-based foundation is much preferable.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.