Say we are doing classification analysis with classes labeled 0 through m-1. Let N_{i} be the number of observations in class i. There is much handwringing in the machine learning literature over situations in which there is a wide variation among the N_{i}. I will argue here, though, that the problem is much worse in the case in which there is — artificially — little or no variation among those sample sizes.

To simplify matters, in what follows I will take m = 2, with the population class probabilities denoted by p and 1-p. Let Y be 1 or 0, according to membership in Class 1 or 0, and let X be the vector of v predictor variables.

First, what about this problem of lack of balance? If your data are a random sample from the target population, then the lack of balance is natural if p is near 0 or 1, and there really isn’t much you can do about it, short of manufacturing data. (Some have actually proposed that, in various forms.) And with a parametric model, say a logit, you may do fairly well if the model is pretty accurate over the range of X. To be sure, the lack of balance may result in substantial within-class misclassification rates even if the overall rate is low. One can try different weightings and the like, but one is pretty much stuck with it.

But at least in this unbalanced situation, you will get consistent estimators of the regression function P(Y = 1 | X = t), as the sample size grows. That’s not true for what I will call the artificially balanced case. Here the N_{i }are typically the same or nearly so, and arise from our doing separate samplings of each of the classes. Clearly we cannot estimate p in this case, and it matters. Here’s why.

By an elementary derivation, we have that (at the population level)

P(Y | X = t) = 1 / (1 + [(1-p)/p] [f(t)/g(t)]) Eqn. (1)

where f and g are the densities of X within Classes 0 and 1. Consider the logistic model. Equation (1) implies that

β_{0} + β_{1} t_{1} + … + β_{v} t_{v} = -ln[(1-p)/p] – ln[f(t)/g(t)] Eqn. (2)

From this you can see that β_{0 }involves the quantity

-ln[(1-p)/p], Eqn. (3)

which in turn implies that if the sample sizes are chosen artificially, then our estimate of β_{0 }in the output of R’s **glm()** function (or any other code for logit) will be wrong. If our goal is Prediction, this will cause a definite bias. And worse, it will be a permanent bias, in the sense that we will not have consistent estimates as the sample size grows.

So, arguably the problem of (artificially) *balanced* data is worse than the *unbalanced* case.

The remedy is easy, though. Equation (2) shows that even with the artificially balanced sampling scheme, our estimates of β_{i} WILL be consistent for i > 0 (since the within-class densities of X won’t change due to the sampling scheme). So, if we have an external estimate of p, we can just substitute it in Equation (3) to get the right value for that expression, subtract the wrong one, and then happily do our future classifications.

As an example, consider the UCI Letters data set. There, the various English letters have approximately equal sample sizes, quite counter to what we know about English. But there are good published sources for the true frequencies.

Now, what if we take a nonparametric regression approach? We can still use Equation (1) to make the proper adjustment. For each t at which we wish to predict class membership, we do the following:

- Estimate the left-hand side (LHS) of (1) nonparametrically, using any of the many methods on CRAN, or the version of kNN in my
**regtools**package. - Solve for the estimated ratio f(t)/g(t).
- Plug back into (1), this time with the correct value of (1-p)/p from the external source, now providing the correct value of the LHS.

If the estimated probabilities are not the important, as for prediction purposes, one can also choose the threshold for the classification adaptively. It could have the same effect of re-estimating the intercept term.

Since the stated goal of my post is Prediction, I meant the estimated probabilities to be used in threshholds (either formal or informal).

You may well be correct that experimenting with various threshhold values might have a similar effect. My objection to that, though, is the word

might. In fact, it is my chief objection to ML research in general, which to me consists too much of “Well, we thought of thisad hocmethod, and tried it on some data sets, where it seemed to work well…” I think methodology that has a firm model-based foundation is much preferable.