Back to the BLAS Issue

A few days ago, I wrote here about how some researchers, such Art Owen and Katelyn Gao at Stanford and Patrick Perry at NYU, have been using an old, old statistical technique — random effects models — for a new, new application — recommender systems. In addition to describing their approach to that problem, I also used this setting as an example of my partools package, in particular the parallel computation method I call Software Alchemy (SA). I illustrated SA on some movie ratings data, and got a nice speedup of the lmer() function in the package lme4. The run time for SA was a little over 3 minutes, compared to nearing 20 minutes without.

However, yesterday Doug Bates, one of the developers of lme4, wrote a comment on my blog, reporting that the run time for the same problem on his desktop machine was about 3 minutes, compared to my 20. This was quite intriguing, and I replied that I would investigate. I also mentioned that I would make a new post out of the matter, as I suspected it would be of general interest. Hence the post you are now reading.

I re-ran my code on the same machine I had used earlier, a PC running Linux in our student lab. I’ll skip the specs here, but suffice it to say it is an ordinary machine, about a year old. At any rate, I got basically the same run times as before.

It then occurred to me that Doug would probably have installed a fancy BLAS — the Basic Linear Algebra Subroutines that run R’s matrix ops — on his machine, whereas our student machine would just be running stock R, and therein lay the source of the big time discrepancy. You may recall my posting here on OpenBLAS, one of the fast alternatives to the BLAS version that comes with R.

So, I switched to a machine on which I had installed OpenBLAS. This machine actually is slower than the other one in terms of clock speed, but it does have 16 cores, and a hyperthreading degree of 2, so that 32 threads might profitably run simultaneously. (The first machine was quadcore with hyperthreading degree 2, but as noted, I didn’t have OpenBLAS installed there.) So, this second machine would make a fine testbed for assessing the performance of lme4 under OpenBLAS.

Sure enough, on this nominally slower machine, the problem ran in only about 8 minutes, not 20. And when I ran the top command from another shell window, I saw via the “% CPU” column that indeed many cores were at work, with the number fluctuating in the range 10-32. Remember, running k cores doesn’t necessarily mean a speedup of k, and often we get much less than that, but you can see that running a good BLAS can work wonders for the speed.

Note that, as mentioned briefly in my last post, SA can achieve superlinear speedup in some instances — MORE than speedup k for k cores. As explained in my book, this occurs when the time complexity for n data points is more than O(n). In my lme4 example, there was about a 6X speedup for just 4 cores.





Partools, Recommender Systems and More

Recently I attended a talk by Stanford’s Art Owen, presenting work done with his student, Katelyn Gao. This talk touched on a number of my interests, both mathematical and computational. What particularly struck me was that Art and Katelyn are applying a very old — many would say very boring — method to a very modern, trendy application: recommender systems.  (See also a recent paper by P.O. Perry.)

The typical example of a recommender system involves movie review data, such as that in Netflix and MovieLens, with various users rating various movies. We have a matrix, with rows representing users and columns for movies, but it is a very sparse matrix, because most users don’t rate most movies. So, we’d like to guess the missing elements, a process sometimes termed the matrix completion problem.

There are many methods for this, such as matrix factorization, as exemplified in the R package NMF. I much prefer the approach taken by Art and Katelyn, as it is more statistical and thus potentially can provide more insight into the underlying processes.

I was inspired by their research to start a project that takes a new approach to random effects models in general. I’ll explain that later, but let’s first address the major computational problems that can occur with random effects models when applied to big data. Note carefully that not only is this a problem with long run times, but also with memory limitations. For example, Pennell and Dunson wrote, in Fitting Semiparametric Random Effects Models to Large Data Sets, Biostatistics, 2007, “SAS PROC MIXED ran out of memory when we attempted to fit a model with random smoking effects.”

One of the features of my partools package is what I call Software Alchemy, with the term alluding to the fact that the procedure turns a statistical algorithm into a statistically equivalent version that is easily parallelized. The idea is simple: Break i.i.d. data into chunks, apply the original algorithm to each chunk, then average the results. It’s easy to prove that the statistical accuracy of this method matches that of the original one, and one not only gets a speedup but also may solve the memory problem, since each chunk is smaller. See my paper on this, to appear soon in the Journal of Statistical Software, but also outlined in my book on parallel computation in data science.  (Chunking has been tried in random effects contexts before; see references in the Perry paper above.)

Here is an example of all this, applied to the MovieLens data, 1-million records version. I’ll use Art and Katelyn’s model here, which writes the rating by user i of movie j as

Yij = μ + ai + bj + εij

where μ is an unknown constant and ai + bj + εij are independent, mean-0 variables (not necessarily normal) with variances σa2, σb2 and σε2. One can also add covariates, but we’ll keep things simple here. In addition to estimating the variances, one can estimate the ai and bj, and thus predict the missing Yij. (See the b component in the output of lmer(), to be discussed below. It lists the estimated ai and bj, in that order.)

Art and Katelyn (as well as Perry) use the Method of Moments rather than the standard methods for random effects models, so that (a) they reduce computation, by deriving explicit solutions and (b) are not tied to assumptions of normality. I use MM in my project too, but here I’ll use the famous R package for random effects models, lme4.  I ran the experiments here on a quad-core machine, using an R parallel cluster of that size.

I first ran without chunking:
> system.time(lmerout <- lmer(V3 ~ (1|V1) +  (1|V2),ml1m))
user system elapsed
1105.129 3.692 1108.946
> getME(lmerout,'theta')
V1.(Intercept) V2.(Intercept)
0.3994240 0.6732281

Not good — almost 20 minutes! So, I turned to partools.The main Software Alchemy function is cabase(). For a given statistical algorithm, one needs to specify the function that implements the algorithm and a function to extract the estimated parameters from the output of the algorithm. There is already a wrapper for lm() in the package, so I just modified it for lmer():


calmer <- function (cls, lmerargs) 
 ovf <- function(u) {
    tmp <- paste("lmer(", lmerargs, ")", collapse = "")
 estf <- function(lmerout) {
 cabase(cls, ovf, estf)

I had earlier used partools function distribsplit() to distribute my data frame m1m to the cluster nodes, in chunks of the same name. So, the call to cabase() does the estimation on each chunk, collects the results, and averages them. Here is what happened:

> system.time(print(calmer(cls,'V3 ~ (1|V1) + (1|V2),ml1m')))
 V2.(Intercept) V1.(Intercept)
[1,] 0.6935344 0.3921150
[2,] 0.6623718 0.3979066
[3,] 0.6505459 0.3967739
[4,] 0.6825985 0.4103681

V2.(Intercept) V1.(Intercept)
 0.6722626 0.3992909

 user system elapsed
 0.812 0.080 189.744

We got an almost 6-fold improvement! And note that we used only 4 cores; this is the superlinear effect that I discuss in my Software Alchemy paper. Just as important, we got essentially the same estimates as from the non-chunked computation. (Please note: The chunked version can be shown to have the same asymptotic variance as the original one; neither is an “approximation” to the other.)

But wait a minute. Didn’t I say that Software Alchemy is intended for i.i.d. data? I did. In the paper I said that the theory could easily be extended to the case of independent but non identically-distributed data, but it would be difficult to state general conditions for this.

Random effects models generally do NOT assume the data are identically distributed. In Art and Katelyn’s paper, for instance, they define indicator variables zij representing that user i rated movie j, and treat these as constants. So, for example, the row counts — numbers of movies rated by each user — are treated as constants.

It’s my contention that such quantities should be treated as random, just like other user behavior. (We haven’t brought in variables such as movie genre here, but if we did, a similar statement would hold.) And once we do that, then voila!, we now have i.i.d. data, and Software Alchemy applies.

In my paper on arXiv,  I also show why it doesn’t matter: The MM estimators are going to turn out essentially indentical, regardless of assuming fixed or random quantities of this type.

Moreover, the advantage of doing things this way goes beyond simply being able to use Software Alchemy.  Those quantities may be of statistical interest in their own right. For instance, consider a study of children within families. In the classical approach, the number of kids in a family in a random effects model would be treated as fixed. But research has shown that family size is negatively correlated with income, and the number of kids should be an important variable — a random variable — in the analysis.

Now, back to MM. The analysis will compute a lot of quantities involving within-user variances and so on. If so, it will help to group together rows or columns of our input data frame, e.g. group by user and group by movie. How might we do this using par tools?

The R function split() comes to mind, but there is an issue of combining the outputs of split(),  which are R lists, from the various cluster nodes. Note in particular that some user, for example, may have data only at one of the nodes. But this is exactly the situation that the partools function addlists() is designed to handle!

That helps, but still we have to worry about overhead incurred by shipping so much data back and forth between the worker nodes and manager node in the cluster. Also, what about memory? Again, we should try (may need) to avoid holding the entire  data set in memory at one time.  The  partools function filesplit() helps in that regard, but this continues to be a computational challenge. It may be that the only good solutions involve C.