Snowdoop, Part II

In my last post, I questioned whether the fancy Big Data processing tools such as Hadoop and Spark are really necessary for us R users.  My argument was that (a) these tools tend to be difficult to install and configure, especially for non-geeks; (b) the tools require learning new computation paradigms and function calls; and (c) one should be able to generally do just as well with plain ol’ R.  I gave a small example of the idea, and promised that more would be forthcoming.  I’ll present one in this posting.

I called my approach Snowdoop for fun, and will stick with that name.  I hastened to explain at the time that although some very short support routines could be turned into a package (see below), Snowdoop is more a concept than a real package.  It’s just a simple idea for attacking problems that are normally handled through Hadoop and the like.

The example I gave last time involved the “Hello World” of Hadoop-dom, a word count.  However, mine simply counted the total number of words in a document, rather than the usual app in which is it reported how many times each individual word appears.  I’ll present the latter case here.

Here is the code:


# each node executes this function 
wordcensus <- function(basename,ndigs) {
 fname <- filechunkname(basename,ndigs)
 words <- scan(fname,what="")
 tapply(words,words,length, simplify=FALSE)
}

# manager 
fullwordcount <- function(cls,basename,ndigs) {
 assignids(cls)
 clusterExport(cls,"filechunkname")
 clusterExport(cls,"addlists")
 counts <- clusterCall(cls,wordcensus,basename,ndigs)
 addlistssum <- function(lst1,lst2)
   addlists(lst1,lst2,sum)
 Reduce(addlistssum,counts)
}

And here are the library functions:


# give each node in the cluster cls an ID number 
assignids <- function(cls) {
 # note that myid will be global
 clusterApply(cls,1:length(cls),
 function(i) myid <<- i)
}

# determine the file name for the chunk to be handled by node myid
filechunkname <- function(basename,ndigs) {
 tmp <- basename
 n0s <- ndigs - nchar(as.character(myid))
 paste(basename,".",rep('0',n0s),myid,sep="",colllapse="")
} 

# "add" 2 lists, applying the operation 'add' to elements in
# common,
# copying non-null others
addlists <- function(lst1,lst2,add) {
 lst <- list()
 for (nm in union(names(lst1),names(lst2))) {
 if (is.null(lst1[[nm]])) lst[[nm]] <- lst2[[nm]] else
 if (is.null(lst2[[nm]])) lst[[nm]] <- lst1[[nm]] else
 lst[[nm]] <- add(lst1[[nm]],lst2[[nm]])
 }
 lst
}

All pure R!  No Java, no configuration.  Indeed, it’s worthwhile comparing to the word count example in sparkr, the R interface to Spark.  There we see calls to sparkr functions such as flatMap(), reduceByKey() and collect().  Well, guess what!  The reduceByKey() function is pretty much the same as R’s tried and true apply().  The collect() function is more or less our Snowdoop library function addlists().  So, again, there is no need to resort to Spark, Hadoop, Java and so on.

And as noted last time, in Snowdoop, we can easily keep objects persistent in memory between cluster calls, like Spark but unlike Hadoop.  Consider k-means clustering, for instance.  Each node would keep its data chunk in memory across the iterations (say using R’s <<- operator upon read-in).  The distance computation at each iteration could be used with CRAN’s pdist library, finding distances from the node’s chunk to the current set of centroids.

Again, while the support routines, e.g. addlists() etc. above, plus a few not shown here, could be collected into a package for convenience, Snowdoop is more a method than a formal package.

So, is there a price to be paid for all this convenience and simplicity?  As noted last time, Snowdoop doesn’t have the fault tolerance redundancy of Hadoop/Spark.  Conceivably there may be a performance penalty in applications in which the Hadoop distributed shuffle-sort is key to the algorithm.  Also, I’m not sure anyone has ever tried R’s parallel library with a cluster of hundreds of nodes or more.

But the convenience factor makes Snowdoop highly attractive.  For example, try plugging “rJava install” into Google, and you’ll see that many people have trouble with this package, which is needed for sparkr (especially if the user doesn’t have root privileges on his machine).

Advertisements

4 thoughts on “Snowdoop, Part II”

    1. Well, as you saw, I went to great lengths to avoid really calling it a “package.” And the main routines I have so far are what I put in the blog post.

      However, I am preparing a package to put on CRAN that contains both the little Snowdoop tools and several other tools not directly related to Snowdoop. Watch this space for details, maybe in a couple of weeks.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s