New R Software/Methodology for Handling Missing Dat

I’ve added some missing-data software to my regtools package on GitHub. In this post, I’ll give an overview of missing-data methodology, and explain what the software does. For details, see my JSM paper, jointly authored with my student Xiao (Max) Gu.

There is a long history of development of techniques for handling missing data. See the famous book by Little and Rubin (currently second edition, third due out in December). The main methods in use today fall into two classes:

  • Complete-cases (CC): (Also known at listwise deletion.) This approach is simple — just delete any record for which at least one of the variables has a missing (NA, in R) value.
  • Multiple imputation (MI): These methods involve estimating the conditional distribution of a missing variable from the others, and then sampling from that distribution via simulation. Multiple alternate versions of the data matrix are generated, with the NA values replaced by values that might have been the missing one.

In our work, we revisited, and broadened the scope of, another class of methods, which had been considered in the early years of missing-data research but pretty much abandoned for reasons to be explained shortly:

  • Available cases (AC): Also known as pairwise deletion.) If the statistical method involves computation involving, say, various pairs of variables, include in such a calculation any observation for which this pair is intact, regardless of whether the other variables are intact. The same holds for triples of variables and so on.

The early work on AC involved linear regression analysis. Say we are predicting a scalar Y from a vector X. The classic OLS estimator is (U’U)-1U’V, where U is the matrix of X values and V is the vector of Y values in our data. But if we center our data, that expression consists of the inverse of the sample covariance matrix of X, times the sample covariance of X and Y.

The key point is then that covariances only involve products of pairs of variables. As a simple example, say we are predicting human weight from height and age. Under AC, estimation of the covariance between weight and height can be done by using all records in the data for which the weight and height values are intact, even if age is missing. AC thus makes more thorough use of the data than does CC, and thus AC should be statistically more accurate than CC.

However, CC and AC make more stringent assumptions (concerning the mechanism underlying missingness) than does MI. Hence the popularity of MI. For R, for instance, the packages mi, mice and Amelia and others handle missing data in general,

We used Amelia as our representative MI method. Unfortunately, it is very long-running. In a PCA computation that we ran, for example, CC and AC took 0.0111 and 1.967 seconds, respectively, while MI had a run time of  92.928 seconds. And statistically, it was not performing any better than CC and AC, so we did not include it in our empirical investigations, though we did analyze it otherwise.

Our experiments involved taking real data sets, then randomly inserting NA values, thus generating many versions of the original data. One of the data sets, for instance, was from the 2008 census, consisting of all programmers and engineers in Silicon Valley. (There were about 20,000 in the PUMS sample. This data set is included in the regtools package.) The following table shows the variances of CC and AC estimates of the first regression coefficient (the means were almost identical, essentially no bias):

NA rate CC var. AC var.
0.01 0.4694873 0.1387395
0.05 2.998764 0.7655222
0.10 8.821311 1.530692

As you can see, AC had much better accuracy than CC on this real data set, and in fact was better than CC on the other 3 real data sets we tried as well.

But what about the famous MCAR assumption underlying CC and AC, which is stricter than the MAR assumption of MI methods? We argue in the paper (too involved to even summarize here) that this may be much less of an issue than has been supposed.

One contribution of our work is to extend AC to non-covariance settings, namely log-linear models.

Please try the software (in the functions lmac(), pcac() and loglinac() in the package), and let me know your thoughts.


6 thoughts on “New R Software/Methodology for Handling Missing Dat”

    1. The package is still in its infancy, thus no vignette for now. I am developing it in conjunction with a book I’m writing, though of course the package can be used independently of the book.

      As I add more and more functions to regtools, I will occasionally discuss them here in the blog.

  1. An interesting post and it’s good to see the topic revisited. Testing by inserting NA’s at random does not correspond to problems with missing values in practice in my experience, there are often special structures that need to be accounted for. A good first step to get an overview is to plot the patterns of missings. Try the visna function in extracat (and then get on with your modelling).

    1. Yes, the problem of structural NAs is a very important one, and difficult to deal with well. I also like the comment, “and then get on with your modeling,” a point many of us need to keep in mind 🙂 .

      Good point about visna(). And though you are being modest to mention it, let me mention for everyone that there is material on that function in Antony’s excellent new book, Graphical Data Analysis in R.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s