How About a “Snowdoop” Package?

Along with all the hoopla on Big Data in recent years came a lot of hype on Hadoop.  This eventually spread to the R world, with sophisticated packages being developed such as rmr to run on top of Hadoop.

Hadoop made it convenient to process data in very large distributed databases, and also convenient to create them, using the Hadoop Distributed File System.  But eventually word got out that Hadoop is slow, and very limited in available data operations.

Both of those shortcomings are addressed to a large extent by the new kid on the block, Spark, which has an R interface package, sparkr.  Spark is much faster than Hadoop, sometimes dramatically so, due to strong caching ability and a wider variety of available operations.  Recently distributedR has also been released, again with the goal of using R on voluminous data sets, and there is also the more established pbdR.

However, I’d like to raise a question here:  Do we really need all that complicated machinery?  I’ll propose a much simpler alternative here, and am curious to see what people think.  (Disclaimer:  I have only limited experience with Hadoop, and only a bit with SparkR.   I’ll present a proposal below, and very much want to see what others think.)

These packages ARE complicated.  There is a considerable amount of configuration to do, worsened by dependence on infrastructure software such as Java or MPI, and in some cases by interface software such as rJava.  Some of this requires systems knowledge that many R users may lack.  And once they do get these systems set up, they may be required to design algorithms with world views quite different from R, even though they are coding in R.

Here is a possible alternative:  Simply use the familiar cluster-oriented portion of R’s parallel package, an adaptation of snow; I’ll refer to that portion of parallel as Snow, and just for fun, call the proposed package Snowdoop.  I’ll illustrate it with the “Hello world” of Hadoop, word count in a text file (slightly different from the usual example, as I’m just counting total words here, rather than the number of times each distinct word appears.)

(It’s assumed here that the reader is familiar with the basics of Snow.  If not, see the first chapter of the partial rough draft of my forthcoming book.)

Say we have a data set that we have partitioned into two files, words.1 and words.2.  In my example here, they will contain the R sign-on message, with words.1 consisting of

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

 Natural language support but running in an English locale

and words.2 containing.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Here is our code:

 
# give each node in the cluster cls an ID number 
assignids <- function(cls) {    
   clusterApply(cls,1:length(cls), 
      function(i) myid <<- i) 
} 

# each node executes this function 
getwords <- function(basename) { 
   fname <- paste(basename,".",myid,sep="")
   words <- scan(fname,what="") 
   length(words) 
} 

# manager 
wordcount <- function(cls,basename) { 
   assignids(cls) 
   clusterExport(cls,"getwords") 
   counts <- clusterCall(cls,getwords,basename)
   sum(unlist(counts)) 
}

# call example:
> library(parallel)
> c2 <- makeCluster(2)
> wordcount(c2,"words")
[1] 83


 

This couldn’t be simpler.  Yet it does what we want:

  • parallel computation on chunks of a distributed file, on independently-running nodes
  • automated “caching” (use the R <<- operator with the output of scan() above)
  • no configuration or platform worries
  • ordinary R programming, no “foreign” concepts

Indeed, it’s so simple that Snowdoop would hardly be worthy of being called a package.  It could include some routines for creating a chunked file, general file read/write routines, parallel load/save and so on, but it would still be a very small package in the end.

Granted, there is no data redundancy built in here, and we possibly lose pipelining effects, but otherwise, it seems fine.  What do you think?

Count Your BLAS-ings

One nice thing about open-source software is that users often have a lot of choices.  Such is the case with R, for instance the thousands of contributed packages available on CRAN.  My focus here is on BLAS, the core of matrix operations in R, where again there are interesting choices available to users who wish to take advantage of them.

The Basic Linear Algebra Subroutines have been around for many decades, used throughout science and engineering, in various implementations.  A fairly efficient BLAS implementation is included with R, but for those with heavy linear algebra needs, several open-source alternatives are available, such as ATLAS, ACML and OpenBLAS, as well as the commercial Intel MKL.  (Recently Revolution Analytics announced an open version of their R platform that includes the MKL.)

Here I will discuss  OpenBLAS, a library currently attractive to many, due to its open-source nature and ability to make use of the multicore machines that are so common today.   I’ll focus on numerical accuracy.

This should not be considered a detailed tutorial on OpenBLAS, but here is a “hand-waving” overview of its usage and installation. Usage couldn’t be simpler, actually; you just continue business as usual,  with OpenBLAS transparently doing what base-R BLAS has always done for you.   Under more advanced usage, you might try to tweak things by setting the number of cores.

Installation is only a bit more elaborate, if you are comfortable building R from source.  At the configure stage, I ran

configure --prefix=/home/matloff/MyR311 --enable-BLAS-shlib

After running the usual make and make install.  I then needed to do a symbolic link of libRblas.so to the OpenBLAS library.  Since I’ve been doing timing comparisons for my book, I’ve made shell aliases to run either stock BLAS or OpenBLAS; a more sophisticated approach would have been to use update-alternatives.  

No doubt about it, OpenBLAS is fast, and many timing comparisons for R are to be found on the Web, including the Revo link above.  But what about numerical accuracy?  After seeing Mike Hannon’s recent post on R-help, along with Brian Ripley’s reply, I became curious, I searched the Web for information on this aspect, and came up empty-handed.  So, I  will present here the results of some simple experiments I’ve done as a result.

First, though, a disclaimer:  Although I know the basics of numerical analysis, I am not an expert in any sense, including the sense of being an expert on the various BLASes.  If anyone out there has more to add, that would be highly appreciated.

OpenBLAS derives its speed not just from making use of multiple cores, but also from various tweaks of the code, yielding a very fine degree of optimization.  One can thus envision a development team (which, by the way, took over the old Goto BLAS project) so obsessed with speed that they might cut some corners regarding numerical accuracy.  Thus the latter is a subject of legitimate concern.

For my little test here, I chose to compute eigenvalues, using R’s eigen() function.  I generated p x p unit covariance matrices (1s on the diagonal, ρ everywhere off the diagonal) for my test:

covrho <- function(p,rho) {
 m <- diag(p)
 m[row(m) != col(m)] <- rho
 m
}

I tried this with various values of p and ρ; here I’ll show the results for 2500 and 0.95, respectively.  The machine used has 16 cores, plus a hyperthreading degree of 2; OpenBLAS likely used 32 threads.

With the standard R BLAS, the elapsed time was 57.407.  Under OpenBLAS, that time was reduced to 12.101.

But interestingly, the first eigenvalue was found to be 2375.05 in both cases.  (This was the exact value in eout$values[1], where eout was the return value from eigen().)

Changing ρ to 0.995, I got reported principal eigenvalues of 2487.505 in both cases.  (Timings were roughly as before.)

As another example, I also tried finding the matrix inverse for this last matrix, using solve().   Both versions of BLAS gave 199.92 as the [1,1] element of the inverse.  Interestingly, though, there was a wider time discrepancy, 53.955 seconds versus 0.933.

It is a little odd that the numbers come out with so few decimal places.  I wonder whether R is deliberately doing some rounding, based on estimates of accuracy.  In any event, I would generally caution against looking at too many decimal places, no matter how good the accuracy is, since typically the input data itself is not so accurate.

So, it seems, at first glance, that OpenBLAS is doing fine.  But Brian has an excellent point about the value of sticking with the tried-and-true, in this case meaning, R’s default BLAS implementation.

I invite you to try your own accuracy comparisons, and post them here.