How About a “Snowdoop” Package?

Along with all the hoopla on Big Data in recent years came a lot of hype on Hadoop.  This eventually spread to the R world, with sophisticated packages being developed such as rmr to run on top of Hadoop.

Hadoop made it convenient to process data in very large distributed databases, and also convenient to create them, using the Hadoop Distributed File System.  But eventually word got out that Hadoop is slow, and very limited in available data operations.

Both of those shortcomings are addressed to a large extent by the new kid on the block, Spark, which has an R interface package, sparkr.  Spark is much faster than Hadoop, sometimes dramatically so, due to strong caching ability and a wider variety of available operations.  Recently distributedR has also been released, again with the goal of using R on voluminous data sets, and there is also the more established pbdR.

However, I’d like to raise a question here:  Do we really need all that complicated machinery?  I’ll propose a much simpler alternative here, and am curious to see what people think.  (Disclaimer:  I have only limited experience with Hadoop, and only a bit with SparkR.   I’ll present a proposal below, and very much want to see what others think.)

These packages ARE complicated.  There is a considerable amount of configuration to do, worsened by dependence on infrastructure software such as Java or MPI, and in some cases by interface software such as rJava.  Some of this requires systems knowledge that many R users may lack.  And once they do get these systems set up, they may be required to design algorithms with world views quite different from R, even though they are coding in R.

Here is a possible alternative:  Simply use the familiar cluster-oriented portion of R’s parallel package, an adaptation of snow; I’ll refer to that portion of parallel as Snow, and just for fun, call the proposed package Snowdoop.  I’ll illustrate it with the “Hello world” of Hadoop, word count in a text file (slightly different from the usual example, as I’m just counting total words here, rather than the number of times each distinct word appears.)

(It’s assumed here that the reader is familiar with the basics of Snow.  If not, see the first chapter of the partial rough draft of my forthcoming book.)

Say we have a data set that we have partitioned into two files, words.1 and words.2.  In my example here, they will contain the R sign-on message, with words.1 consisting of

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

 Natural language support but running in an English locale

and words.2 containing.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Here is our code:

 
# give each node in the cluster cls an ID number 
assignids <- function(cls) {    
   clusterApply(cls,1:length(cls), 
      function(i) myid <<- i) 
} 

# each node executes this function 
getwords <- function(basename) { 
   fname <- paste(basename,".",myid,sep="")
   words <- scan(fname,what="") 
   length(words) 
} 

# manager 
wordcount <- function(cls,basename) { 
   assignids(cls) 
   clusterExport(cls,"getwords") 
   counts <- clusterCall(cls,getwords,basename)
   sum(unlist(counts)) 
}

# call example:
> library(parallel)
> c2 <- makeCluster(2)
> wordcount(c2,"words")
[1] 83


 

This couldn’t be simpler.  Yet it does what we want:

  • parallel computation on chunks of a distributed file, on independently-running nodes
  • automated “caching” (use the R <<- operator with the output of scan() above)
  • no configuration or platform worries
  • ordinary R programming, no “foreign” concepts

Indeed, it’s so simple that Snowdoop would hardly be worthy of being called a package.  It could include some routines for creating a chunked file, general file read/write routines, parallel load/save and so on, but it would still be a very small package in the end.

Granted, there is no data redundancy built in here, and we possibly lose pipelining effects, but otherwise, it seems fine.  What do you think?

9 thoughts on “How About a “Snowdoop” Package?”

  1. I made a similar observation
    ME: “isn’t hadoop just the ‘split-apply-combine’ paradigm?”
    Friend: Yes, however, hadoop implies theres some physical clusters to distribute the split apply combine workflow

    I’ve used parallel and snow before, but for multicores on a single machine. I’ve never had to use different clusters of machines. Can it handle it, then I agree, snowdoop sounds like it can replace hadoop? i’m guessing so, since many of the packages’ functions have ‘cluster’ in it?

    1. Yes, Hadoop uses a split-apply-combine paradigm, but with the crucial added operation of a sort in between.

      The Snow/parallel packages work fine on clusters, or in principle for disparate machines spread around the world. What I proposed for Snowdoop in that simple example assumes a common file system for the machines, though that could be dealt with too.

      Snowdoop is just a simple idea, not fleshed out, so it’s not ready to “replace” anything. But that word “simple” is key — no configuration headaches, no Java, no writing R code for “foreign” concepts, etc., all potentially big advantages.

  2. Great idea! I’ve been working with fairly big datasets in the past and at one point also had a short glimpse at Hadoop – before I shrugged and turned away again 😉 I think with a bit more fleshed solution based on your proposal, the freedom of R programming could very well be leveraged to a implementation of a split-apply/map-reduce paradigm that proves *very* useful for rapid prototyping of things you would *actually* like to do with your data (as opposed to settling for Hadoop’s way of doing things)

    1. As you say, the biggest advantage is staying in R, as opposed to R interfaces to non-R things such as flatMap() in SparkR.

      I’ll post more examples in the next few days.

  3. I can’t seem to run your code and figure out what you are doing. What is the relationship between words.1 words.2 and “words”? I get this error:
    Error in checkForRemoteErrors(lapply(cl, recvResult)) :
    2 nodes produced errors; first error: cannot open the connection

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.