Snowdoop/partools Update

I’ve put together an updated version of my partools package, including Snowdoop, an alternative to MapReduce algorithms.  You can download it here, version 1.0.1.

To review:  The idea of Snowdoop is to create your own file chunking, rather than having something like Hadoop do it for you, and then using ordinary R coding to perform parallel operations.  This avoids the need to deal with new constructs and complicated configuration issues with Hadoop and R interfaces to it.

Major changes are as follows:

  • There is a k-means clustering example of Snowdoop in the examples/ directory.  Among other things, it illustrates the fact that with the Snowdoop approach, one automatically achieves a “caching” effect lacking in Hadoop, trivially by default.
  • There is a filesort() function, to sort a distributed file, keeping the result in memory in distributed form.  I don’t know yet how efficient it will be relative to Hadoop.
  • There are various new short utility functions, such as filesplit().

Still not on Github yet, but Yihui should be happy that I converted the Snowdoop vignette to use knitr. 🙂

All of this is still preliminary, of course.  It remains to be seen to what scale this approach will work well.

More Snowdoop Coming

In spite of the banter between Yihui and me, I’m glad to hear that he may be interested in Snowdoop, as are some others.  I’m quite busy this week (finishing writing my Parallel Computation for Data Science book, and still have a lot of Fall Quarter grading to do 🙂 ), but you’ll definitely be hearing more from me on Snowdoop and partools, including the following:

  • Vignettes for Snowdoop and the debugging tool.
  • Code snippets for splitting and coalescing files, including dealing with header records.
  • Code snippet implementing a distributed version of subset().

And yes, I’ll likely break down and put it on Github. 🙂  [I’m not old-fashioned, just nuisance-averse. 🙂 ] Watch this space for news, next installment maybe 3-4 days from now.

New Package: partools

I mentioned last week that I would be putting together a package, based in part on my posts on Snowdoop.  I’ve now done so, in a package partools, with the name alluding to the fact that they are intended for use with the cluster-based part of R’s parallel package.  The main ingredients are:

  • Various code snippets to faciltate parallel coding.
  • A debugging tool for parallel coding.
  • The Snowdoop functions I posted earlier.
  • Code for my “Software Alchemy” method.

Still in primitive form, can stand some fleshing out, but please give it a try.  I’ll be submitting to CRAN soon.

Snowdoop, Part II

In my last post, I questioned whether the fancy Big Data processing tools such as Hadoop and Spark are really necessary for us R users.  My argument was that (a) these tools tend to be difficult to install and configure, especially for non-geeks; (b) the tools require learning new computation paradigms and function calls; and (c) one should be able to generally do just as well with plain ol’ R.  I gave a small example of the idea, and promised that more would be forthcoming.  I’ll present one in this posting.

I called my approach Snowdoop for fun, and will stick with that name.  I hastened to explain at the time that although some very short support routines could be turned into a package (see below), Snowdoop is more a concept than a real package.  It’s just a simple idea for attacking problems that are normally handled through Hadoop and the like.

The example I gave last time involved the “Hello World” of Hadoop-dom, a word count.  However, mine simply counted the total number of words in a document, rather than the usual app in which is it reported how many times each individual word appears.  I’ll present the latter case here.

Here is the code:

# each node executes this function 
wordcensus <- function(basename,ndigs) {
 fname <- filechunkname(basename,ndigs)
 words <- scan(fname,what="")
 tapply(words,words,length, simplify=FALSE)

# manager 
fullwordcount <- function(cls,basename,ndigs) {
 counts <- clusterCall(cls,wordcensus,basename,ndigs)
 addlistssum <- function(lst1,lst2)

And here are the library functions:

# give each node in the cluster cls an ID number 
assignids <- function(cls) {
 # note that myid will be global
 function(i) myid <<- i)

# determine the file name for the chunk to be handled by node myid
filechunkname <- function(basename,ndigs) {
 tmp <- basename
 n0s <- ndigs - nchar(as.character(myid))

# "add" 2 lists, applying the operation 'add' to elements in
# common,
# copying non-null others
addlists <- function(lst1,lst2,add) {
 lst <- list()
 for (nm in union(names(lst1),names(lst2))) {
 if (is.null(lst1[[nm]])) lst[[nm]] <- lst2[[nm]] else
 if (is.null(lst2[[nm]])) lst[[nm]] <- lst1[[nm]] else
 lst[[nm]] <- add(lst1[[nm]],lst2[[nm]])

All pure R!  No Java, no configuration.  Indeed, it’s worthwhile comparing to the word count example in sparkr, the R interface to Spark.  There we see calls to sparkr functions such as flatMap(), reduceByKey() and collect().  Well, guess what!  The reduceByKey() function is pretty much the same as R’s tried and true apply().  The collect() function is more or less our Snowdoop library function addlists().  So, again, there is no need to resort to Spark, Hadoop, Java and so on.

And as noted last time, in Snowdoop, we can easily keep objects persistent in memory between cluster calls, like Spark but unlike Hadoop.  Consider k-means clustering, for instance.  Each node would keep its data chunk in memory across the iterations (say using R’s <<- operator upon read-in).  The distance computation at each iteration could be used with CRAN’s pdist library, finding distances from the node’s chunk to the current set of centroids.

Again, while the support routines, e.g. addlists() etc. above, plus a few not shown here, could be collected into a package for convenience, Snowdoop is more a method than a formal package.

So, is there a price to be paid for all this convenience and simplicity?  As noted last time, Snowdoop doesn’t have the fault tolerance redundancy of Hadoop/Spark.  Conceivably there may be a performance penalty in applications in which the Hadoop distributed shuffle-sort is key to the algorithm.  Also, I’m not sure anyone has ever tried R’s parallel library with a cluster of hundreds of nodes or more.

But the convenience factor makes Snowdoop highly attractive.  For example, try plugging “rJava install” into Google, and you’ll see that many people have trouble with this package, which is needed for sparkr (especially if the user doesn’t have root privileges on his machine).