Partools, Recommender Systems and More

Recently I attended a talk by Stanford’s Art Owen, presenting work done with his student, Katelyn Gao. This talk touched on a number of my interests, both mathematical and computational. What particularly struck me was that Art and Katelyn are applying a very old — many would say very boring — method to a very modern, trendy application: recommender systems.  (See also a recent paper by P.O. Perry.)

The typical example of a recommender system involves movie review data, such as that in Netflix and MovieLens, with various users rating various movies. We have a matrix, with rows representing users and columns for movies, but it is a very sparse matrix, because most users don’t rate most movies. So, we’d like to guess the missing elements, a process sometimes termed the matrix completion problem.

There are many methods for this, such as matrix factorization, as exemplified in the R package NMF. I much prefer the approach taken by Art and Katelyn, as it is more statistical and thus potentially can provide more insight into the underlying processes.

I was inspired by their research to start a project that takes a new approach to random effects models in general. I’ll explain that later, but let’s first address the major computational problems that can occur with random effects models when applied to big data. Note carefully that not only is this a problem with long run times, but also with memory limitations. For example, Pennell and Dunson wrote, in Fitting Semiparametric Random Effects Models to Large Data Sets, Biostatistics, 2007, “SAS PROC MIXED ran out of memory when we attempted to fit a model with random smoking effects.”

One of the features of my partools package is what I call Software Alchemy, with the term alluding to the fact that the procedure turns a statistical algorithm into a statistically equivalent version that is easily parallelized. The idea is simple: Break i.i.d. data into chunks, apply the original algorithm to each chunk, then average the results. It’s easy to prove that the statistical accuracy of this method matches that of the original one, and one not only gets a speedup but also may solve the memory problem, since each chunk is smaller. See my paper on this, to appear soon in the Journal of Statistical Software, but also outlined in my book on parallel computation in data science.  (Chunking has been tried in random effects contexts before; see references in the Perry paper above.)

Here is an example of all this, applied to the MovieLens data, 1-million records version. I’ll use Art and Katelyn’s model here, which writes the rating by user i of movie j as

Yij = μ + ai + bj + εij

where μ is an unknown constant and ai + bj + εij are independent, mean-0 variables (not necessarily normal) with variances σa2, σb2 and σε2. One can also add covariates, but we’ll keep things simple here. In addition to estimating the variances, one can estimate the ai and bj, and thus predict the missing Yij. (See the b component in the output of lmer(), to be discussed below. It lists the estimated ai and bj, in that order.)

Art and Katelyn (as well as Perry) use the Method of Moments rather than the standard methods for random effects models, so that (a) they reduce computation, by deriving explicit solutions and (b) are not tied to assumptions of normality. I use MM in my project too, but here I’ll use the famous R package for random effects models, lme4.  I ran the experiments here on a quad-core machine, using an R parallel cluster of that size.

I first ran without chunking:
> system.time(lmerout <- lmer(V3 ~ (1|V1) +  (1|V2),ml1m))
user system elapsed
1105.129 3.692 1108.946
> getME(lmerout,'theta')
V1.(Intercept) V2.(Intercept)
0.3994240 0.6732281

Not good — almost 20 minutes! So, I turned to partools.The main Software Alchemy function is cabase(). For a given statistical algorithm, one needs to specify the function that implements the algorithm and a function to extract the estimated parameters from the output of the algorithm. There is already a wrapper for lm() in the package, so I just modified it for lmer():

 

calmer <- function (cls, lmerargs) 
{
 ovf <- function(u) {
    tmp <- paste("lmer(", lmerargs, ")", collapse = "")
 docmd(tmp)
 }
 estf <- function(lmerout) {
    getME(lmerout,'theta')
 }
 cabase(cls, ovf, estf)
}

I had earlier used partools function distribsplit() to distribute my data frame m1m to the cluster nodes, in chunks of the same name. So, the call to cabase() does the estimation on each chunk, collects the results, and averages them. Here is what happened:

> system.time(print(calmer(cls,'V3 ~ (1|V1) + (1|V2),ml1m')))
$thts
 V2.(Intercept) V1.(Intercept)
[1,] 0.6935344 0.3921150
[2,] 0.6623718 0.3979066
[3,] 0.6505459 0.3967739
[4,] 0.6825985 0.4103681

$tht
V2.(Intercept) V1.(Intercept)
 0.6722626 0.3992909

 user system elapsed
 0.812 0.080 189.744

We got an almost 6-fold improvement! And note that we used only 4 cores; this is the superlinear effect that I discuss in my Software Alchemy paper. Just as important, we got essentially the same estimates as from the non-chunked computation. (Please note: The chunked version can be shown to have the same asymptotic variance as the original one; neither is an “approximation” to the other.)

But wait a minute. Didn’t I say that Software Alchemy is intended for i.i.d. data? I did. In the paper I said that the theory could easily be extended to the case of independent but non identically-distributed data, but it would be difficult to state general conditions for this.

Random effects models generally do NOT assume the data are identically distributed. In Art and Katelyn’s paper, for instance, they define indicator variables zij representing that user i rated movie j, and treat these as constants. So, for example, the row counts — numbers of movies rated by each user — are treated as constants.

It’s my contention that such quantities should be treated as random, just like other user behavior. (We haven’t brought in variables such as movie genre here, but if we did, a similar statement would hold.) And once we do that, then voila!, we now have i.i.d. data, and Software Alchemy applies.

In my paper on arXiv,  I also show why it doesn’t matter: The MM estimators are going to turn out essentially indentical, regardless of assuming fixed or random quantities of this type.

Moreover, the advantage of doing things this way goes beyond simply being able to use Software Alchemy.  Those quantities may be of statistical interest in their own right. For instance, consider a study of children within families. In the classical approach, the number of kids in a family in a random effects model would be treated as fixed. But research has shown that family size is negatively correlated with income, and the number of kids should be an important variable — a random variable — in the analysis.

Now, back to MM. The analysis will compute a lot of quantities involving within-user variances and so on. If so, it will help to group together rows or columns of our input data frame, e.g. group by user and group by movie. How might we do this using par tools?

The R function split() comes to mind, but there is an issue of combining the outputs of split(),  which are R lists, from the various cluster nodes. Note in particular that some user, for example, may have data only at one of the nodes. But this is exactly the situation that the partools function addlists() is designed to handle!

That helps, but still we have to worry about overhead incurred by shipping so much data back and forth between the worker nodes and manager node in the cluster. Also, what about memory? Again, we should try (may need) to avoid holding the entire  data set in memory at one time.  The  partools function filesplit() helps in that regard, but this continues to be a computational challenge. It may be that the only good solutions involve C.

Advertisements

8 thoughts on “Partools, Recommender Systems and More”

  1. Hello Prof. Matloff,

    Great approach both from theoretical and computational aspects to resolve memory intensive issue and speed up the computation significantly by data parallel.

    Would you mind provide a reproducible case for us so that I can try it in my own machine for performance optimizations?

    Thanks

    –Patric

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s