Unbalanced Data Is a Problem? No, BALANCED Data Is Worse

Say we are doing classification analysis with classes labeled 0 through m-1. Let Ni be the number of observations in class i. There is much handwringing in the machine learning literature over situations in which there is a wide variation among the Ni. I will argue here, though, that the problem is much worse in the case in which there is — artificially — little or no variation among those sample sizes.

To simplify matters, in what follows I will take m = 2, with the population class probabilities denoted by p and 1-p. Let Y be 1 or 0, according to membership in Class 1 or 0, and let X be the vector of v predictor variables.

First, what about this problem of lack of balance? If your data are a random sample from the target population, then the lack of balance is natural if p is near 0 or 1, and there really isn’t much you can do about it, short of manufacturing data. (Some have actually proposed that, in various forms.) And with a parametric model, say a logit, you may do fairly well if the model is pretty accurate over the range of X. To be sure, the lack of balance may result in substantial within-class misclassification rates even if the overall rate is low. One can try different weightings and the like, but one is pretty much stuck with it.

But at least in this unbalanced situation, you will get consistent estimators of the regression function P(Y = 1 | X = t), as the sample size grows. That’s not true for what I will call the artificially balanced case. Here the Nare typically the same or nearly so, and arise from our doing separate samplings of each of the classes. Clearly we cannot estimate p in this case, and it matters. Here’s why.

By an elementary derivation, we have that (at the population level)

P(Y | X = t) = 1 / (1 + [(1-p)/p] [f(t)/g(t)])    Eqn. (1)

where f and g are the densities of X within Classes 0 and 1.  Consider the logistic model. Equation (1) implies that

β0 + β1 t1 + … + βv tv = -ln[(1-p)/p] – ln[f(t)/g(t)]    Eqn. (2)

From this you can see that

β0 = -ln[(1-p)/p],   Eqn. (3)

which in turn implies that if the sample sizes are chosen artificially, then our estimate of  βin the output of R’s glm() function (or any other code for logit) will be wrong. If our goal is Prediction, this will cause a definite bias. And worse, it will be a permanent bias, in the sense that we will not have consistent estimates as the sample size grows.

So, arguably the problem of (artificially) balanced data is worse than the unbalanced case.

The remedy is easy, though. Equation (2) shows that even with the artificially balanced sampling scheme, our estimates of βi WILL be consistent for i > 0 (since the within-class densities of X won’t change due to the sampling scheme). So, if we have an external estimate of p, we can just substitute it in Equation (3) to get the right intercept term, and then happily do our future classifications.

As an example, consider the UCI Letters data set. There, the various English letters have approximately equal sample sizes, quite counter to what we know about English. But there are good published sources for the true frequencies.

Now, what if we take a nonparametric regression approach? We can still use Equation (1) to make the proper adjustment.  For each t at which we wish to predict class membership, we do the following:

  • Estimate the left-hand side (LHS) of (1) nonparametrically, using any of the many methods on CRAN, or the version of kNN in my regtools package.
  • Solve for the estimated ratio f(t)/g(t).
  • Plug back into (1), this time with the correct value of (1-p)/p from the external source, now providing the correct value of the LHS.

More on the Heteroscedasticity Issue

In my last post, I discussed R software, including mine, that handles heteroscedastic settings for linear and nonlinear regression models. Several readers had interesting comments and questions, which I will address here.

To review: Though most books and software assume homoscedasticity, i.e. constancy of the variance of the response variable at all levels of the predictor variables, this condition seldom if ever holds in real life. This can make a mockery of the reported standard errors, and thus render confidence intervals and tests rather meaningless. But fortunately, there is asymptotic theory that gives use valid standard errors in the heteroscedastic case, due to Eickert, White and others, for the linear case. In R, this is available via the functions vcovHC() and hccm() in the CRAN packages sandwich and carrespectively.

I mentioned that I had adapted the E-W idea to the nonlinear realm, leveraging the linear-approximation core of the Gauss-Newton method for nonlinear least-squares computation, by calling vcovHC() on the linear approximation. The result is the function nlshc() in my package regtools. (Unfortunately, I erroneously used its old name, nlsvcovhc() in my post.) One simply calls nlshc() on an object of class ‘nls’ (output of R’s nls() function) to obtain standard errors that are valid under heteroscedasticity.

Achim Zeileis, one of the authors of sandwich, then made some comments on my blog post. He pointed out that one can actually call sandwich() directly on an ‘nls’ object (again, due to the linear approximation). After reading my post, he tried that, and found that the resulting standard errors were more realistic than what one gets directly from nls(), but they still were not as good as mine, where “good” means correct confidence levels for confidence intervals and so on. He speculated that this was due to the particular variant of E-W that I use. Well, here is an update.

First, Achim’s speculation was right. I tried some of the other variants of E-W, and they didn’t do as well. Of course, there may be other factors at work as well.

Upon further investigation, I found another intriguing phenomenon. I tried nlshc() on one of the NIST datasets, as follows:

regftn <- function(t,u,b1,b2,b3) b1 * t / (t + b2 * (1 + u/b3))
bstart <- list(b1=1,b2=1, b3=1)
z <- nls(v ~ regftn(S,I,b1,b2,b3),data=vmkmki,
   start=list(b1=1,b2=1, b3=1))

To my surprise, the output was an all-NAs matrix. This did not happen with hccm(), This is strange, as vcovHC() works fine otherwise, e.g. in the simulation I posted last time.

So, for the time being, in the new nlshc() I have replaced the call to vcovHC() by a call to hccm(). I’ve also added an option for the user to choose a different E-W variant, though given what we know so far, I don’t recommend it.

A couple of commenters asked about using alternatives to nls(), which in fact I had done with nlsLM(). I had done that because I had had some convergence problems with nls(), but actually have reverted to nls() in the new version of the package. (I will also make my function an S3 method later on, which I had originally decided not to do, but have changed my mind with Achim’s encouragement :-) )

Can You Say “Heteroscedasticity” 3 Times Fast?

Most books on regression analysis assume homoscedasticity, the situation in which Var(Y | X = t), for a response variable Y and vector of predictor variables X, is the same for all t. Yet, needless to say, almost all data in real life is heteroscedastic. For Y = human weight and X = height, say, we know that the assumption of homoscedasticity can’t be true, even approximately.

Typical books discuss assessment of that assumption using residual plots and the like, but then leave it at that. Rather few books mention Eicker-White theory, which develops valid asymptotic inference for heteroscedastic data. E-W is really nice, and guess what! — it’s long been available in R, in the sandwich and car packages on CRAN.  (Note that the latter package is intended for use with a book that does cover this topic, J. Fox and S. Weisberg, An R Companion to Applied Regression, Second Edition, Sage, 2011.) Then instead of using R’s standard vcov() function to obtain estimated variances and covariances of the estimates of the βi, we use vcovHC() and hccm(), respectively.

One can make a similar derivation for nonlinear regression, which is available as the function nlshc() in my regtools package. The package will be associated with my own book on regression, currently in progress. (The package is currently in progress as well, but already contains several useful functions.) The rest of this post is adapted from an example in the book.

The model I chose for this simple example was E(Y | X = t) = 1 / t’β, where the distributions of the quantities can be seen in the following simulation code:

sim <- function(n,nreps) {
 b <- 1:2
 res <- replicate(nreps,{
 x <- matrix(rexp(2*n),ncol=2)
 meany <- 1 / (x %*% b)
 y <- meany + (runif(n) - 0.5) * meany
 xy <- cbind(x,y)
 xy <- data.frame(xy)
 nlout <- nls(X3 ~ 1 / (b1*X1+b2*X2),
 data=xy,start=list(b1 = 1,b2=1)) 
 b <- coef(nlout)
 vc <- vcov(nlout)
 vchc <- nlsvcovhc(nlout)
 z1 <- (b[1] - 1) / sqrt(vc[1,1])
 z2 <- (b[1] - 1) / sqrt(vchc[1,1])
 print(mean(res[1,] < 1.28))
 print(mean(res[2,] < 1.28))

The results were striking:

> sim(250,2500)
[1] 0.6188
[1] 0.9096

Since the true value should be 0.90 (1.28 is the 0.90 quantile of N(0,1)), the Eicker-White method is doing an outstanding job here — and R’s built-in nonlinear regression function, nls() is not. The latter function’s reported standard errors are way, way off the mark.

New R Software/Methodology for Handling Missing Dat

I’ve added some missing-data software to my regtools package on GitHub. In this post, I’ll give an overview of missing-data methodology, and explain what the software does. For details, see my JSM paper, jointly authored with my student Xiao (Max) Gu.

There is a long history of development of techniques for handling missing data. See the famous book by Little and Rubin (currently second edition, third due out in December). The main methods in use today fall into two classes:

  • Complete-cases (CC): (Also known at listwise deletion.) This approach is simple — just delete any record for which at least one of the variables has a missing (NA, in R) value.
  • Multiple imputation (MI): These methods involve estimating the conditional distribution of a missing variable from the others, and then sampling from that distribution via simulation. Multiple alternate versions of the data matrix are generated, with the NA values replaced by values that might have been the missing one.

In our work, we revisited, and broadened the scope of, another class of methods, which had been considered in the early years of missing-data research but pretty much abandoned for reasons to be explained shortly:

  • Available cases (AC): Also known as pairwise deletion.) If the statistical method involves computation involving, say, various pairs of variables, include in such a calculation any observation for which this pair is intact, regardless of whether the other variables are intact. The same holds for triples of variables and so on.

The early work on AC involved linear regression analysis. Say we are predicting a scalar Y from a vector X. The classic OLS estimator is (U’U)-1U’V, where U is the matrix of X values and V is the vector of Y values in our data. But if we center our data, that expression consists of the inverse of the sample covariance matrix of X, times the sample covariance of X and Y.

The key point is then that covariances only involve products of pairs of variables. As a simple example, say we are predicting human weight from height and age. Under AC, estimation of the covariance between weight and height can be done by using all records in the data for which the weight and height values are intact, even if age is missing. AC thus makes more thorough use of the data than does CC, and thus AC should be statistically more accurate than CC.

However, CC and AC make more stringent assumptions (concerning the mechanism underlying missingness) than does MI. Hence the popularity of MI. For R, for instance, the packages mi, mice and Amelia and others handle missing data in general,

We used Amelia as our representative MI method. Unfortunately, it is very long-running. In a PCA computation that we ran, for example, CC and AC took 0.0111 and 1.967 seconds, respectively, while MI had a run time of  92.928 seconds. And statistically, it was not performing any better than CC and AC, so we did not include it in our empirical investigations, though we did analyze it otherwise.

Our experiments involved taking real data sets, then randomly inserting NA values, thus generating many versions of the original data. One of the data sets, for instance, was from the 2008 census, consisting of all programmers and engineers in Silicon Valley. (There were about 20,000 in the PUMS sample. This data set is included in the regtools package.) The following table shows the variances of CC and AC estimates of the first regression coefficient (the means were almost identical, essentially no bias):

NA rate CC var. AC var.
0.01 0.4694873 0.1387395
0.05 2.998764 0.7655222
0.10 8.821311 1.530692

As you can see, AC had much better accuracy than CC on this real data set, and in fact was better than CC on the other 3 real data sets we tried as well.

But what about the famous MCAR assumption underlying CC and AC, which is stricter than the MAR assumption of MI methods? We argue in the paper (too involved to even summarize here) that this may be much less of an issue than has been supposed.

One contribution of our work is to extend AC to non-covariance settings, namely log-linear models.

Please try the software (in the functions lmac(), pcac() and loglinac() in the package), and let me know your thoughts.

Partools 1.1.4

Partools 1.1.4 is now on GitHub.

The main change this time is enhancement of the debugging facilities (which work not only for partools but also the cluster-based portion of R’s parallel package in general). As some of you know, I place huge importance on debugging, so much so that I wrote a book on it (The Art of Debugging with GDB, DDD, and Eclipse}, N. Matloff and P. Salzman, NSP, 2008).

But debugging parallel code is hard, especially in the parallel package. The problem is that your R code running on the cluster nodes does so without having a terminal window associated with it. I’ve had various tools for dealing with that in partools from the beginning, but in the latest version their effectiveness is greatly enhanced by adding mechanisms involving R’s dump.frames(). This was Hadley’s idea, for which I am quite grateful. I’ve had a lot of fun using the enhanced debugging tools myself.

This also inspired me to finally add a debugging vignette to the package, which I had long planned to do but hadn’t gotten around to.

I also thank Gábor Csárdi for cleaning up the DESCRIPTION file.

I have more enhancements to partools in the pipeline. One of them involves k-NN nonparametric regression, using Software Alchemy but in a different way than you might think. Actually, I’ve already done this before, in my freqparcoord package with Yingkang Xie, but I’ll do a little tweaking before adding it here. This too is something I’ve been planning to do for a while but hadn’t gotten around to. What inspired me to give it a higher priority was a paper that I recently ran across by some researchers at Stanford and UCB, which establishes nice theoretical properties for a Software Alchemy-type approach for another kind  of nonparametric regression estimation, kernel ridge regression.

partools: a Sensible R Package for Large Data Sets

As I mentioned recently, the new, greatly extended version of my partools package is now on CRAN. (The current version on CRAN is 1.1.3, whereas at the time of my previous announcement it was only 1.1.1. Note that Unix is NOT required.)

It is my contention that for most R users who work with large data,  partools — or methods like it — is a better, simpler, far more convenient approach than Hadoop and Spark.  If you are an R user and, like most Hadoop/Spark users, don’t have a mega cluster (thousands of nodes), partools is a sensible alternative to Hadoop and Spark.

I’ll introduce partools usage in this post. I encourage comments (pro or con, here or in private). In particular, for those of you attending the JSM next week, I’d be happy to discuss the package in person, and hear your comments, feature requests and so on.

Why do I refer to partools as “sensible”? Consider:

  • Hadoop and Spark are quite difficult to install and configure, especially for non-computer systems experts. By contrast, partools just uses ordinary R; there is nothing to set up.
  • Spark, currently much favored  by many over Hadoop, involves new, complex and abstract programming paradigms, even under the R interface, SparkR. By contrast, again, partools just uses ordinary R.
  • Hadoop and Spark, especially the latter, have excellent fault tolerance features. If you have a cluster consisting of thousands of nodes, the possibility of disk failure must be considered. But otherwise, the fault tolerance of Hadoop and Spark are just slowing down your computation, often radically so.  (You could also do your own fault tolerance, ranging from simple backup to sophisticated systems such as Xtreemfs.)

What Hadoop and Spark get right is to base computation on distributed files. Instead of storing data in a monolithic file x, it is stored in chunks, say x.01, x.02,…, which can greatly reduce network overhead in the computation. The partools package also adopts this philosophy.

Overview of  partools:

  • There is no “magic.”  The package merely consists of short, simple uitiliies that make use of R’s parallel package.
  • The key philosophy is Keep It Distributed (KID). Under KID, one does many distributed operations,, with a collective operation being doing occasionally, when needed.

Sample partools (PT) session (see package vignette for details, including code, output):

  • 16-core machine.
  • Flight delay data, 2008. Distributed file created previously from monolithic one  via PT’s filesplit().
  • Called PT’s fileread(), causing each cluster node to read its chunk of the big file.
  •  Called PT’s distribagg() to find max values of DepDelay, ArrDelay, Airtime. 15.952 seconds, vs. 249.634 for R’s serial aggregate().
  • Interested in Sunday evening flights.  Each node performs that filtering op, assigning to data frame sundayeve. Note that that is a distributed data frame, in keeping with KID.
  • Continue with KID, but if later we want to un-distribute that data frame, we could call PT’s distribgetrows().
  • Performed a linear regression analysis, predicting ArrDelay from DepDelay and Distance, using Software Alchemy, via PT’s calm() function. Took 18.396 seconds, vs. 76.225 for ordinary lm(). (See my new book, Parallel Computation for Data Science, for details on Software Alchemy.)
  • Did a distributed na.omit() to each chunk, using parallel‘s clusterEvalQ(). Took 2.352 seconds, compared to 9.907 it would have needed if not distributed.
  • Performed PCA. Took 8.949 seconds for PT’s caprcomp(), vs. 58.444 for the non-distributed case.
  • Calculated interquartile range for each of 12 variables, taking 2.587 seconds, compared to 29.584 for the non-distributed case.
  • Performed a more elaborate  distributed na.omit(), in time 9.293, compared to 55.032 in the serial case.

Again, see the vignette for details on the above, on how to deal with files that don’t fit into memory etc.


Get every new post delivered to your Inbox.

Join 135 other followers