Partools 1.1.4

Partools 1.1.4 is now on GitHub.

The main change this time is enhancement of the debugging facilities (which work not only for partools but also the cluster-based portion of R’s parallel package in general). As some of you know, I place huge importance on debugging, so much so that I wrote a book on it (The Art of Debugging with GDB, DDD, and Eclipse}, N. Matloff and P. Salzman, NSP, 2008).

But debugging parallel code is hard, especially in the parallel package. The problem is that your R code running on the cluster nodes does so without having a terminal window associated with it. I’ve had various tools for dealing with that in partools from the beginning, but in the latest version their effectiveness is greatly enhanced by adding mechanisms involving R’s dump.frames(). This was Hadley’s idea, for which I am quite grateful. I’ve had a lot of fun using the enhanced debugging tools myself.

This also inspired me to finally add a debugging vignette to the package, which I had long planned to do but hadn’t gotten around to.

I also thank Gábor Csárdi for cleaning up the DESCRIPTION file.

I have more enhancements to partools in the pipeline. One of them involves k-NN nonparametric regression, using Software Alchemy but in a different way than you might think. Actually, I’ve already done this before, in my freqparcoord package with Yingkang Xie, but I’ll do a little tweaking before adding it here. This too is something I’ve been planning to do for a while but hadn’t gotten around to. What inspired me to give it a higher priority was a paper that I recently ran across by some researchers at Stanford and UCB, which establishes nice theoretical properties for a Software Alchemy-type approach for another kind  of nonparametric regression estimation, kernel ridge regression.


partools: a Sensible R Package for Large Data Sets

As I mentioned recently, the new, greatly extended version of my partools package is now on CRAN. (The current version on CRAN is 1.1.3, whereas at the time of my previous announcement it was only 1.1.1. Note that Unix is NOT required.)

It is my contention that for most R users who work with large data,  partools — or methods like it — is a better, simpler, far more convenient approach than Hadoop and Spark.  If you are an R user and, like most Hadoop/Spark users, don’t have a mega cluster (thousands of nodes), partools is a sensible alternative to Hadoop and Spark.

I’ll introduce partools usage in this post. I encourage comments (pro or con, here or in private). In particular, for those of you attending the JSM next week, I’d be happy to discuss the package in person, and hear your comments, feature requests and so on.

Why do I refer to partools as “sensible”? Consider:

  • Hadoop and Spark are quite difficult to install and configure, especially for non-computer systems experts. By contrast, partools just uses ordinary R; there is nothing to set up.
  • Spark, currently much favored  by many over Hadoop, involves new, complex and abstract programming paradigms, even under the R interface, SparkR. By contrast, again, partools just uses ordinary R.
  • Hadoop and Spark, especially the latter, have excellent fault tolerance features. If you have a cluster consisting of thousands of nodes, the possibility of disk failure must be considered. But otherwise, the fault tolerance of Hadoop and Spark are just slowing down your computation, often radically so.  (You could also do your own fault tolerance, ranging from simple backup to sophisticated systems such as Xtreemfs.)

What Hadoop and Spark get right is to base computation on distributed files. Instead of storing data in a monolithic file x, it is stored in chunks, say x.01, x.02,…, which can greatly reduce network overhead in the computation. The partools package also adopts this philosophy.

Overview of  partools:

  • There is no “magic.”  The package merely consists of short, simple uitiliies that make use of R’s parallel package.
  • The key philosophy is Keep It Distributed (KID). Under KID, one does many distributed operations,, with a collective operation being doing occasionally, when needed.

Sample partools (PT) session (see package vignette for details, including code, output):

  • 16-core machine.
  • Flight delay data, 2008. Distributed file created previously from monolithic one  via PT’s filesplit().
  • Called PT’s fileread(), causing each cluster node to read its chunk of the big file.
  •  Called PT’s distribagg() to find max values of DepDelay, ArrDelay, Airtime. 15.952 seconds, vs. 249.634 for R’s serial aggregate().
  • Interested in Sunday evening flights.  Each node performs that filtering op, assigning to data frame sundayeve. Note that that is a distributed data frame, in keeping with KID.
  • Continue with KID, but if later we want to un-distribute that data frame, we could call PT’s distribgetrows().
  • Performed a linear regression analysis, predicting ArrDelay from DepDelay and Distance, using Software Alchemy, via PT’s calm() function. Took 18.396 seconds, vs. 76.225 for ordinary lm(). (See my new book, Parallel Computation for Data Science, for details on Software Alchemy.)
  • Did a distributed na.omit() to each chunk, using parallel‘s clusterEvalQ(). Took 2.352 seconds, compared to 9.907 it would have needed if not distributed.
  • Performed PCA. Took 8.949 seconds for PT’s caprcomp(), vs. 58.444 for the non-distributed case.
  • Calculated interquartile range for each of 12 variables, taking 2.587 seconds, compared to 29.584 for the non-distributed case.
  • Performed a more elaborate  distributed na.omit(), in time 9.293, compared to 55.032 in the serial case.

Again, see the vignette for details on the above, on how to deal with files that don’t fit into memory etc.