Some Comments on Donoho’s “50 Years of Data Science”

An old friend recently called my attention to a thoughtful essay by Stanford statistics professor David Donoho, titled “50 Years of Data Science.” Given the keen interest these days in data science, the essay is quite timely. The work clearly shows that Donoho is not only a grandmaster theoretician, but also a statistical philosopher. The paper should be required reading in all Stat and CS Departments. But as a CS person with deep roots in statistics, I believe there are a few points Donaho should have developed more, which I will discuss here, as well as other points on which his essay really shines.

Though no one seems to claim to know what data science is — not even on an “I know it when I see it” basis — everyone seems to agree that it is roughly a combination of statistics and computer science. Fine, but what does that mean? Let’s take the computer science aspect first.

By CS here, I mean facility with computers, and by that in turn I mean more than programming. By happenstance, I was in a conversation today with some CS colleagues as to whether material on computer networks should be required for CS majors. One of my colleagues said there were equally deserving topics, such as Hadoop. My reply was that Hadoop is SLOW (so much so that many are predicting its imminent demise), and maximizing its performance involves, inter alia, an understanding of…computer networks. Donoho doesn’t cover this point about computation (nor, it seems, do most data science curricula), limiting himself to programming languages and libraries.

But he does a fine job on the latter. I was pleased that his essay contains quite a bit of material on R, such as the work of Yihui Xie and Hadley Wickham. That a top theoretician devotes so much space in a major position paper to R is a fine tribute to the status R has attained in this profession.

(In that context, I feel compelled to note that in attending a talk at Rice in 2012 I was delighted to see Manny Parzen, 86 years old and one of the pioneers of modern statistical theory, interrupt his theoretical talk with a brief exposition on the NINE different definitions of quantile available in calls to R’s quantile() function. Bravo!)

Donoho notes, however, that the Berkeley data science curriculum uses Python instead of R. He surmises that this is due to Python handling Big Data better than R, but I suspect it has more to do with the CS people at UCB being the main ones designing the curriculum, acting on a general preference in CS for the “more elegant” language Python.

But is Python the better tool than R for Big Data? Many would say so, I think, but a good case can be made for R. For instance, to my knowledge there is nothing in Python like CRAN’s bigmemory package, giving a direct R interface to shared memory at the C++ level. (I also have a parallel computation package, Rdsm, that runs on top of bigmemory.)

Regretably, the Donoho essay contains only the briefest passing reference to parallel computation. But again, he is not alone. Shouldn’t a degree in data science, ostensibly aimed in part at Big Data, include at least some knowledge of parallel computation? I haven’t seen any that do. Note, though, that coverage of such material would again require some knowledge of computer system infrastructure, and thus being at odds with the “a little of this, a little of that, but nothing in depth” philosophy taken so far in data science curricula.

One topic I was surprised to see the essay omit was the fact that so much data today is not in the nice “rectangular” — observations in rows, variables in equal numbers of columns — form that most methodology assumes. Ironically, Donoho highlights Hadley Wickham’s plyr package, as rectangular as can be. Arguably, data science students ought to be exposed more to sophisticated use of R’s tapply(), for instance.

Now turning to the stat aspect of Data Science, a key theme in the essay is, to borrow from Marie Davidian, aren’t WE (statistics people) Data Science? Donoho does an excellent job here of saying the answer is Yes (or if not completely Yes, close enough so that the answer could be Yes with a little work). I particularly liked this gem:

It is striking how, when I review a presentation on today’s data science, in which statistics is superficially given pretty short shrift, I can’t avoid noticing that the underlying tools, examples, and ideas which are being taught as data science were all literally invented by someone trained in Ph.D. statistics, and in many cases the actual software being used was developed by someone with an MA or Ph.D. in statistics. The accumulated efforts of statisticians over centuries are just too overwhelming to be papered over completely, and can’t be hidden in the teaching, research, and exercise of Data Science.

Yes! Not only does it succinctly show that there is indeed value to theory, but also it illustrates that point that many statisticians are not computer wimps after all. Who needs data science? :-)

I believe that Donoho, citing Leo Breiman, is too quick to concede the prediction field to Machine Learning. As we all know, prediction has been part of Statistics since its inception, literally for centuries. Granted, modern math stat has an exquisitely developed theory of estimation, but I have seen too many Machine Learning people, ones I otherwise respect highly, make the absurd statement, “ML is different from statistics, because we do prediction.”

Indeed, one of Donoho’s most salient points is that having MORE methods available for prediction is not the same as doing BETTER prediction. Indeed, he shows the results of some experiments he conducted with Jiashun Jin on some standard real data sets, in which a very simple predictor is compared to various “fancy” ones:

Boosting, Random Forests and so on are dramatically more complex and have correspondingly higher charisma in the Machine Learning community. But against a series of pre-existing benchmarks developed in the Machine Learning community, the charismatic methods do not outperform the homeliest of procedures…

This would certainly be a shock to most students in ML courses — and to some of their instructors.

Maybe the “R people” (i.e. Stat Departments) have as much to contribute to data science as the “Python people” (CS) after all.

The Generalized Method of Moments and the gmm package

An almost-as-famous alternative to the famous Maximum Likelihood Estimation is the Method of Moments. MM has always been a favorite of mine because it often requires fewer distributional assumptions than MLE, and also because MM is much easier to explain than MLE to students and consulting clients. CRAN has a package gmm that does MM, actually the Generalized Method of Moments, and in this post I’ll explain how to use it (on the elementary level, at least).

Our data set will be bodyfat, which is included in the mfp package, with measurements on 252 men. The first column of this data frame, brozek, is the percentage of body fat, which when converted to a proportion is in (0,1). That makes it a candidate for fitting a beta distribution.

Denoting the parameters of that family by α and β, the mean and variance are α / (α + β) and α β / ((α + β)2 (α + β + 1)), respectively. MM is a model of simplicity — we match these to the sample mean and variance of our data, and then solve for α and β. Of course, the solution part may not be so simple, due to nonlinearities, but gmm worries about all that for us. Yet another tribute to the vast variety of packages available on CRAN!

In our elementary usage here, the call form is


where data is our data in matrix or vector form, momentftn specifies the moments, and start is our initial guess for the iterative solution process. Let’s specify momentftn:

g <- function(th,x) {
  t1 <- th[1]
  t2 <- th[2]
  t12 <- t1 + t2
  meanb <- t1 / t12
  m1 <- meanb - x
  m2 <- t1*t2 / (t12^2 * (t12+1)) - (x - meanb)^2
  f <- cbind(m1,m2)

This function equates population moments to sample ones, by specifying expressions that gmm() is to set to 0. The argument th here (“theta”) will be the MM estimates (at any given iteration) of the population parameters, in this case of α and β.

The function is required to specify quantities whose averages are to be set to 0. So, in the line

m1 <- meanb – x

we are saying that we want the average of the right-hand side to be 0, where x is our data. That has the effect of setting our current iteration’s estimate of α / (α + β) to our sample mean — exactly what MM is supposed to do. The line assigning to m2 then does the same thing for variance.

So, let’s try all this out on the body fat data:

> gmm(g,x,c(alpha=0.1,beta=0.1))

Objective function value: 2.559645e-10 

 alpha beta 
 4.6714 19.9969 

Convergence code = 0 
> hist(bodyfat$brozek/100,xlim=c(0,1),
> curve(dbeta(x,4.67,20.00),add=TRUE)

The result looks like this (apologies for the lack of refinement in this quick graph, cutting off part of the top):

At least visually, it seems to be a pretty good fit.

For standard errors etc., a method for the generic function vcov() is provided:

> gmmout <- gmm(g,x,c(alpha=0.1,beta=0.1))
> vcov(gmmout)
          alpha beta
alpha 0.2809361 0.9606354
beta 0.9606354 3.9266874

Happy GMM-ing!

Lots more posts coming, when I have time.

The Method of Boosting

One of the techniques that has caused the most excitement in the machine learning community is boosting, which in essence is a process of iteratively refining, e.g. by reweighting, of estimated regression and classification functions (though it has primarily been applied to the latter), in order to improve predictive ability.

Much has been made of the remark by the late statistician Leo Breiman that boosting is “the best off-the-shelf classifier in the world,” his term off-the-shelf meaning that the given method can be used by nonspecialist users without special tweaking. Many analysts have indeed reported good results from the method.

In this post I will

  • Briefly introduce the topic.
  • Give a view of boosting that may not be well known.
  • Give a surprising example.

As with some of my recent posts, this will be based on material from the book I’m writing on regression and classification.


The key point, almost always missed in technical discussions, is that boosting is really about bias reduction. Take the linear model, our example in this posting.

A linear model is rarely if ever exactly correct. Thus use of a linear model will result in bias; in some regions of the predictor vector X, the model will overestimate the true regression function, while in others it will underestimate — no matter how large our sample is. It thus may be profitable to try to reduce bias in regions in which our unweighted predictions are very bad, at the hopefully small sacrifice of some prediction accuracy in places where our unweighted analysis is doing well. (In the classification setting, a small loss in accuracy in estimating the conditional probability function won’t hurt our predictions at all, since our predictions won’t change.) The reweighting (or other iterative) process is aimed at achieving a positive tradeoff of that nature.

To motivate the notion of boosting, following is an adaptation of some material in Richard Berk’s book, Statistical Learning from a Regression Perspective, which by the way is one of the most thoughtful, analytic books on statistics I’ve ever seen.

Say we have fit an OLS model to our data, and through various means (see my book for some old and new methods) suspect that we have substantial model bias. Berk’s algorithm, which he points out is not boosting but is similar in spirit, is roughly as follows (I’ve made some changes):

  1.  Fit the OLS model, naming the resulting coefficient vector b0. Calculate the residuals and their sum of squares, and set i =0.
  2.  Update i to i+1. Fit a weighted least-squares model, using as weights the absolute residuals, naming the result bi. Calculate the new residuals and sum of squares.
  3. If i is less than the desired number of iterations k, go to Step 2.

In the end, take your final coefficient vector to be a weighted average of all the bi, with the weights being inversely proportional to the sums of squared residuals.

Again, we do all this in the hope of reducing model bias where it is most important. If all goes well, our ability to predict future observations will be enhanced.

Choice of weapons:

R offers a nice variety of packages for boosting. We’ll use the mboost package here, because it is largely geared toward parametric models such as the linear. In particular, it provides us with revised coefficients, rather than just outputting a “black box” prediction machine.

Of course, like any self-respecting R package, mboost offers a bewildering set of arguments in its functions. But Leo Breiman was a really smart guy, extraordinarily insightful. Based on his “off-the-shelf” remark, we will simply use the default values of the arguments.

The data:

Fong and Ouliaris (1995) do an analysis of relations between currency rates for Canada, Germany, France, the UK and Japan (pre-European Union days). Do they move together? Let’s look at predicting the Japanese yen from the others.

This is time series data, and the authors of the above paper do a very sophisticated analysis along those lines. But we’ll just do straight linear modeling here.

After applying OLS (not shown here), we find a pretty good fit, with an adjusted R-squared value of 0.89. However, there are odd patterns in the residuals, and something disturbing occurs when we take a k-Nearest Neighbors approach.

R-squared, whether a population value or the sample estimate reported by lm(), is the squared correlation between Y and its predicted value. Thus R-squared can be calculated for any method of regression function estimation, not just the linear model. In particular, we can apply the concept to kNN.

We’ll use the functions from my regtools package.

> curr <- read.table('EXC.ASC',header=TRUE)
> curr1 <- curr
> curr1[,-5] <- scale(curr1[,-5])
> fout1 <- lm(Yen ~ .,data=curr1)
> library(regtools)
> xdata <- preprocessx(curr1[,-5],25,xval=TRUE)
> kout <- knnest(curr1[,5],xdata,25)
> ypredknn <- knnpred(kout,xdata$x)
> cor(ypredknn,curr1[,5])^2
[1] 0.9817131

This is rather troubling. It had seemed that our OLS fit was very nice, but apparently we are “leaving money on the table” — we can do substantially better than that simple linear model.

So, let’s give boosting a try. Let’s split the data into training and test sets, and compare boosting to OLS.

> library(mboost)
> trnidxs <- sample(1:761,500)
> predidxs <- setdiff(1:761,trnidxs)
> mbout <- glmboost(Yen ~ .,data=curr1[trnidxs,])
> lmout <- lm(Yen ~ .,data=curr1[trnidxs,])
> mbpred <- predict(mbout,curr1[predidxs,])
> lmpred <- predict(lmout,curr1[predidxs,])
> predy <- curr1[predidxs,]$Yen
> mean(abs(predy-mbpred))
[1] 14.03786
> mean(abs(predy-lmpred))
[1] 13.20589

Well, lo and behold, boosting actually did worse than OLS! Clearly we can’t generalize from this, and as mentioned, many analysts have reported big gains from boosting. And though Breiman was one of the giants of this field, the example here shows that boosting is not ready for off-the-shelf usage. On the contrary, there are also numerous reports of boosting having problems, such as bizarre cases in which the iterations of boosting seemed to be converging, only to have them suddenly diverge.

If one is willing to go slightly past “off the shelf,” one can set the number of iterations, using boost_control() or mstop(). And if one is happy to go completely past “off the shelf,” the mboost package has numerous other options to explore.

Once again: There are no “silver bullets” in this field.




OVA vs. AVA in Classification Problems, via regtools

OVA and AVA? Huh?

These stand for One vs. All and All vs. All, in classification problems with more than 2 classes. To illustrate the idea, I’ll use the UCI Vertebral Column data and Letter Recognition Data, and analyze them using my regtools package.

As some of you know, I’m developing the latter in conjunction with a book I’m writing on regression and classification. The package, of course, is usable and helpful independently of the book, though the material in this post will be drawn largely from the book.

In the verterbral data there are m = 3 classes: Normal, Disk Hernia and Spondylolisthesis. The predictors are, as described on the UCI site, “six biomechanical attributes derived from the shape and orientation of the pelvis.”

Let m denote the number of classes. Consider two approaches we might take to predicting the status of the vertebral column, based on logistic regression:

  • One vs. All (OVA): Here we would fit 3 logit models to our training data, predicting each of the 3 classes, one at a time, from our 6 variables. The regtools function ovalogtrn() does this for us, yielding a 7×3 matrix, which is then used for future predictions. For a given new data point, we guess the unknown class to be whichever one has maximal probability, given the new data point; the regtools function ovalogpred() handles the details for us here.
  • All vs. All (AVA): Here we look at all possible pairs of classes. There will again be 3 of them in this case, though in general the number of pairs will be m (m-1) / 2, with that many columns in our output matrix, as opposed to just m for OVA. At any rate, for each pair we restrict our training data to just the points corresponding to one of the two classes in the pair, then run a logit analysis predicting, say, the first class of the pair.  The regtools functions avalogtrn() and avalogpred() do the work for us.

Clearly, AVA involves a lot of computation. For fixed number of predictor variables p, here is a rough time estimate. For a logit model, the computation will be proportional to the number of cases n (due to computing various sums over all cases). Say our training data is approximately balanced in terms of sizes of the classes, so that the data corresponding to class i has about n/m cases in it, Then the computation for one pair will be O(n/m), but there will be O(m2) pairs, so the total amount of computation will be O(mn) –potentially much larger than the O(n) used by OVA.

Well, then, do we benefit from that extra computation? At least at first glance, AVA would not seem to have much to offer. For instance, since each of its models uses much less than our full data, the resulting estimated coefficients will likely be less accurate than what we calculate under OVA. And if m is large, we will have so many pairs that at least some will likely be especially inaccurate. And yet some researchers claim they find AVA to work better, due to imperfections in the model used.

Let’s try it out on the vertebral column data (warning messages, signaling probabilities near 0 or 1, not shown):

> vert <- read.table('Vertebrae/column_3C.dat',
> vert$V7 <- as.numeric(vert$V7) - 1
> trnidxs <- sample(1:310,225)
> predidxs <- setdiff(1:310,trnidxs)
> ovout <- ovalogtrn(3,vert[trnidxs,])
> predy <- ovalogpred(ovout,vert[predidxs,1:6])
> mean(predy == vert[predidxs,])
[1] 0.8823529
> avout <- avalogtrn(3,vert[trnidxs,])
> predy <-  avalogpred(3,avout,vert[predidxs,1:6])
> mean(predy == vert[predidxs,7])
[1] 0.8588235

The function ovalogtrn() requires the response (class) variable to be coded 0,1,…,m-1, hence the call to as.numeric(),

At any rate, not much difference, if any, between OVA and AVA in this example. However, the selling point of AVA is supposed to be that it may be effective when the model we are using is not approximately valid.

A good candidate for such a model is the logit appled to the letter recognition data. (I discovered this when the logit turned out to do much less well than k-Nearest Neighbors, and in retrospect it seems plausible, given the nature of the predictors.) The difference between OVA and AVA here was dramatic:

> library(regtools) 
Loading required package: FNN 
> library(mlbench) 
> data(LetterRecognition) 
> lr <- LetterRecognition 
> lr[,1] <- as.numeric(lr[,1]) - 1 
> # training and test sets 
> lrtrn <- lr[1:14000,] 
> lrtest <- lr[14001:20000,] 
> ologout <- ovalogtrn(26,lrtrn[,c(2:17,1)]) 
> ypred <- ovalogpred(ologout,lrtest[,-1]) 
> mean(ypred == lrtest[,1]) 
[1] 0.7193333 
> alogout <- avalogtrn(26,lrtrn[,c(2:17,1)])
> ypred <- avalogpred(26,alogout,lrtest[,-1])
> mean(ypred == lrtest[,1])
[1] 0.8355

So, apparently AVA fixed a poor model. Of course, it’s better to make a good model in the first place. :-)

In fact, it turns out that adding quadratic terms to the predictors (not shown) helps a lot. Thus I don’t suggest using AVA as your go-to method. But it’s there in regtools if you want to try it.

Back to the BLAS Issue

A few days ago, I wrote here about how some researchers, such Art Owen and Katelyn Gao at Stanford and Patrick Perry at NYU, have been using an old, old statistical technique — random effects models — for a new, new application — recommender systems. In addition to describing their approach to that problem, I also used this setting as an example of my partools package, in particular the parallel computation method I call Software Alchemy (SA). I illustrated SA on some movie ratings data, and got a nice speedup of the lmer() function in the package lme4. The run time for SA was a little over 3 minutes, compared to nearing 20 minutes without.

However, yesterday Doug Bates, one of the developers of lme4, wrote a comment on my blog, reporting that the run time for the same problem on his desktop machine was about 3 minutes, compared to my 20. This was quite intriguing, and I replied that I would investigate. I also mentioned that I would make a new post out of the matter, as I suspected it would be of general interest. Hence the post you are now reading.

I re-ran my code on the same machine I had used earlier, a PC running Linux in our student lab. I’ll skip the specs here, but suffice it to say it is an ordinary machine, about a year old. At any rate, I got basically the same run times as before.

It then occurred to me that Doug would probably have installed a fancy BLAS — the Basic Linear Algebra Subroutines that run R’s matrix ops — on his machine, whereas our student machine would just be running stock R, and therein lay the source of the big time discrepancy. You may recall my posting here on OpenBLAS, one of the fast alternatives to the BLAS version that comes with R.

So, I switched to a machine on which I had installed OpenBLAS. This machine actually is slower than the other one in terms of clock speed, but it does have 16 cores, and a hyperthreading degree of 2, so that 32 threads might profitably run simultaneously. (The first machine was quadcore with hyperthreading degree 2, but as noted, I didn’t have OpenBLAS installed there.) So, this second machine would make a fine testbed for assessing the performance of lme4 under OpenBLAS.

Sure enough, on this nominally slower machine, the problem ran in only about 8 minutes, not 20. And when I ran the top command from another shell window, I saw via the “% CPU” column that indeed many cores were at work, with the number fluctuating in the range 10-32. Remember, running k cores doesn’t necessarily mean a speedup of k, and often we get much less than that, but you can see that running a good BLAS can work wonders for the speed.

Note that, as mentioned briefly in my last post, SA can achieve superlinear speedup in some instances — MORE than speedup k for k cores. As explained in my book, this occurs when the time complexity for n data points is more than O(n). In my lme4 example, there was about a 6X speedup for just 4 cores.




Partools, Recommender Systems and More

Recently I attended a talk by Stanford’s Art Owen, presenting work done with his student, Katelyn Gao. This talk touched on a number of my interests, both mathematical and computational. What particularly struck me was that Art and Katelyn are applying a very old — many would say very boring — method to a very modern, trendy application: recommender systems.  (See also a recent paper by P.O. Perry.)

The typical example of a recommender system involves movie review data, such as that in Netflix and MovieLens, with various users rating various movies. We have a matrix, with rows representing users and columns for movies, but it is a very sparse matrix, because most users don’t rate most movies. So, we’d like to guess the missing elements, a process sometimes termed the matrix completion problem.

There are many methods for this, such as matrix factorization, as exemplified in the R package NMF. I much prefer the approach taken by Art and Katelyn, as it is more statistical and thus potentially can provide more insight into the underlying processes.

I was inspired by their research to start a project that takes a new approach to random effects models in general. I’ll explain that later, but let’s first address the major computational problems that can occur with random effects models when applied to big data. Note carefully that not only is this a problem with long run times, but also with memory limitations. For example, Pennell and Dunson wrote, in Fitting Semiparametric Random Effects Models to Large Data Sets, Biostatistics, 2007, “SAS PROC MIXED ran out of memory when we attempted to fit a model with random smoking effects.”

One of the features of my partools package is what I call Software Alchemy, with the term alluding to the fact that the procedure turns a statistical algorithm into a statistically equivalent version that is easily parallelized. The idea is simple: Break i.i.d. data into chunks, apply the original algorithm to each chunk, then average the results. It’s easy to prove that the statistical accuracy of this method matches that of the original one, and one not only gets a speedup but also may solve the memory problem, since each chunk is smaller. See my paper on this, to appear soon in the Journal of Statistical Software, but also outlined in my book on parallel computation in data science.  (Chunking has been tried in random effects contexts before; see references in the Perry paper above.)

Here is an example of all this, applied to the MovieLens data, 1-million records version. I’ll use Art and Katelyn’s model here, which writes the rating by user i of movie j as

Yij = μ + ai + bj + εij

where μ is an unknown constant and ai + bj + εij are independent, mean-0 variables (not necessarily normal) with variances σa2, σb2 and σε2. One can also add covariates, but we’ll keep things simple here. In addition to estimating the variances, one can estimate the ai and bj, and thus predict the missing Yij. (See the b component in the output of lmer(), to be discussed below. It lists the estimated ai and bj, in that order.)

Art and Katelyn (as well as Perry) use the Method of Moments rather than the standard methods for random effects models, so that (a) they reduce computation, by deriving explicit solutions and (b) are not tied to assumptions of normality. I use MM in my project too, but here I’ll use the famous R package for random effects models, lme4.  I ran the experiments here on a quad-core machine, using an R parallel cluster of that size.

I first ran without chunking:
> system.time(lmerout <- lmer(V3 ~ (1|V1) +  (1|V2),ml1m))
user system elapsed
1105.129 3.692 1108.946
> getME(lmerout,'theta')
V1.(Intercept) V2.(Intercept)
0.3994240 0.6732281

Not good — almost 20 minutes! So, I turned to partools.The main Software Alchemy function is cabase(). For a given statistical algorithm, one needs to specify the function that implements the algorithm and a function to extract the estimated parameters from the output of the algorithm. There is already a wrapper for lm() in the package, so I just modified it for lmer():


calmer <- function (cls, lmerargs) 
 ovf <- function(u) {
    tmp <- paste("lmer(", lmerargs, ")", collapse = "")
 estf <- function(lmerout) {
 cabase(cls, ovf, estf)

I had earlier used partools function distribsplit() to distribute my data frame m1m to the cluster nodes, in chunks of the same name. So, the call to cabase() does the estimation on each chunk, collects the results, and averages them. Here is what happened:

> system.time(print(calmer(cls,'V3 ~ (1|V1) + (1|V2),ml1m')))
 V2.(Intercept) V1.(Intercept)
[1,] 0.6935344 0.3921150
[2,] 0.6623718 0.3979066
[3,] 0.6505459 0.3967739
[4,] 0.6825985 0.4103681

V2.(Intercept) V1.(Intercept)
 0.6722626 0.3992909

 user system elapsed
 0.812 0.080 189.744

We got an almost 6-fold improvement! And note that we used only 4 cores; this is the superlinear effect that I discuss in my Software Alchemy paper. Just as important, we got essentially the same estimates as from the non-chunked computation. (Please note: The chunked version can be shown to have the same asymptotic variance as the original one; neither is an “approximation” to the other.)

But wait a minute. Didn’t I say that Software Alchemy is intended for i.i.d. data? I did. In the paper I said that the theory could easily be extended to the case of independent but non identically-distributed data, but it would be difficult to state general conditions for this.

Random effects models generally do NOT assume the data are identically distributed. In Art and Katelyn’s paper, for instance, they define indicator variables zij representing that user i rated movie j, and treat these as constants. So, for example, the row counts — numbers of movies rated by each user — are treated as constants.

It’s my contention that such quantities should be treated as random, just like other user behavior. (We haven’t brought in variables such as movie genre here, but if we did, a similar statement would hold.) And once we do that, then voila!, we now have i.i.d. data, and Software Alchemy applies.

In my paper on arXiv,  I also show why it doesn’t matter: The MM estimators are going to turn out essentially indentical, regardless of assuming fixed or random quantities of this type.

Moreover, the advantage of doing things this way goes beyond simply being able to use Software Alchemy.  Those quantities may be of statistical interest in their own right. For instance, consider a study of children within families. In the classical approach, the number of kids in a family in a random effects model would be treated as fixed. But research has shown that family size is negatively correlated with income, and the number of kids should be an important variable — a random variable — in the analysis.

Now, back to MM. The analysis will compute a lot of quantities involving within-user variances and so on. If so, it will help to group together rows or columns of our input data frame, e.g. group by user and group by movie. How might we do this using par tools?

The R function split() comes to mind, but there is an issue of combining the outputs of split(),  which are R lists, from the various cluster nodes. Note in particular that some user, for example, may have data only at one of the nodes. But this is exactly the situation that the partools function addlists() is designed to handle!

That helps, but still we have to worry about overhead incurred by shipping so much data back and forth between the worker nodes and manager node in the cluster. Also, what about memory? Again, we should try (may need) to avoid holding the entire  data set in memory at one time.  The  partools function filesplit() helps in that regard, but this continues to be a computational challenge. It may be that the only good solutions involve C.

A New Method for Statistical Disclosure Limitation, I

The Statistical Disclosure Limitation (SDL) problem involves modifying a data set in such a manner that statistical analysis on the modified data is reasonably close to that performed on the original data, while preserving the privacy of individuals in the data set. For instance, we might have a medical data set on which we want to allow researchers to do their statistical analyses but not violate the privacy of the patients in the study.

In this posting, I’ll briefly explain what SDL is, and then describe a new method that Pat Tendick and I are proposing. Our paper is available as arxiv:1510.04406 and R code to implement the method is available on GitHub. See the paper for details.

This is a very difficult problem, one that arguably has not been fully solved, in spite of decades of work by some really sharp people. Some of the common methods are: adding mean-0 noise to each variable; finding pairs of similar records and then swapping their values of the sensitive variables; and (in the case in which all variables are categorical), suppressing cells that contain just 1 or a few cases.

As an example of the noise addition method, consider a patient data set that includes the variables Age and Income. Suppose a nefarious user of the data happens to have external knowledge that James is the oldest patient in the study. The Bad Guy can then issue a query asking for the income of the oldest patient (not mentioning James), thus revealing James’ salary. But if the public version of the data has had noise added, James’s listed income will not be his real one, and he may well not be the oldest listed patient anymore anyway.

Given the  importance of this topic — JSM 2015 had 3 separate sessions devoted to it — it is surprising that rather little public-domain software is available. The only R package I know of is sdcMicro on CRAN (which by the way includes an excellent vignette from which you can learn a lot about SDL).  NISS has the Java-based WebSwap (from whose docs you can also learn about SDL).

Aside from the availability of software, one big concern with many SDL methods is that the multivariate structure of the data may be distorted in the modification process. This is crucial, since most statistical analyses are multivariate in nature, e.g. regression, PCA etc., and thus a major distortion in the multivariate structure can result in seriously misleading estimates.

In the noise addition method, this can be achieved by setting the noise covariance matrix to that of the original data, but for the other methods maintaining the proper multivariate structure is a challenge.

While arguably noise addition works well for data consisting only of continuous variables, and data swapping and cell suppression are often acceptable for the purely-categorical case, the mixed continuous-categorical setting is tough.

Our new method achieves both of the above goals. It (a) is applicable to any kind of data, including the mixed continuous-categorical case, and (b) maintains the correct multivariate structure. Rather counterintuitively, our method achieves (b) while actually treating the variables as (conditionally) independent.

The method has several tuning parameters. In some modern statistical methods, tuning parameters are a real pain, but in SDL, the more tuning parameters the better! The database administrator needs to have as many ways as possible to develop a public form of the database that has both good statistical accuracy and good privacy protection.

As an example, I took some Census data for the year 2000 (5% PUMS), involving programmers in Silicon Valley. In order to simulate an employee database, I sampled 5000 records, keeping the variables WageIncome, Age, Gender, WeeksWorked, MSDegree and PhD. See our paper for details, but here is a quick overview.

First, to see that goal (b) above has been maintained reasonably well, I ran a linear regression analysis, predicting WageIncome from the other variables. I did this twice, once for the original data and once for the modified set, for a given combination of values of the tuning parameters. Here are the estimated regression coefficients:

data Age Gender WeeksWorked MS PhD
orig. 447.2 -9591.7 1286.4 17333.0 21291.3
modif. 466.1 -8423.2 1270.7 18593.9 22161.4


This is not bad. Each pair of coefficients is within one original standard error of the other (not shown). The database administrator could try lots of other combinations of the tuning parameters, and likely get even closer. But what about privacy?

In the original data set, there was exactly one female worker with age under 31:

> p1[p1$sex==2 & p1$phd==1 & p1$age < 31,]
          age sex wkswrkd ms phd wageinc
7997 30.79517 2 52 0 1 100000

Which such workers, if any, are listed in the modified data?

> p1pc  p1pc[p1pc$sex==2 & p1pc$phd==1 &         p1pc$age < 31,]
           age sex wkswrkd ms phd wageinc
12522 30.5725   2      52  0   1   50000

There is only one person listed in the released data of the given description (female, PhD, age under 31). But she is listed as having an income of $50,000 rather than $100,000. In fact, it is a different person, worker number 12522, not 7997. (Of course, ID numbers would be suppressed.)

So what happened to worker number 7997?
> which(rownames(p1p) == 7997)
[1] 3236
> p1p[3236,]
age sex wkswrkd ms phd wageinc
7997 31.9746 1 52 0 1 100000

Ah, she became a man! That certainly hides her. Under another luck of the draw, her record may have become all NA values.

In this way, the database administrator can set up a number of statistical analysis test cases, and a number of records at high risk of identification, and then try various combinations of the tuning parameters in order to obtain a modified data set that achieves a desired balance between statistical fidelity and privacy.


Get every new post delivered to your Inbox.

Join 149 other followers