More Snowdoop Coming

In spite of the banter between Yihui and me, I’m glad to hear that he may be interested in Snowdoop, as are some others.  I’m quite busy this week (finishing writing my Parallel Computation for Data Science book, and still have a lot of Fall Quarter grading to do :-) ), but you’ll definitely be hearing more from me on Snowdoop and partools, including the following:

  • Vignettes for Snowdoop and the debugging tool.
  • Code snippets for splitting and coalescing files, including dealing with header records.
  • Code snippet implementing a distributed version of subset().

And yes, I’ll likely break down and put it on Github. :-)  [I’m not old-fashioned, just nuisance-averse. :-) ] Watch this space for news, next installment maybe 3-4 days from now.

New Package: partools

I mentioned last week that I would be putting together a package, based in part on my posts on Snowdoop.  I’ve now done so, in a package partools, with the name alluding to the fact that they are intended for use with the cluster-based part of R’s parallel package.  The main ingredients are:

  • Various code snippets to faciltate parallel coding.
  • A debugging tool for parallel coding.
  • The Snowdoop functions I posted earlier.
  • Code for my “Software Alchemy” method.

Still in primitive form, can stand some fleshing out, but please give it a try.  I’ll be submitting to CRAN soon.

Snowdoop, Part II

In my last post, I questioned whether the fancy Big Data processing tools such as Hadoop and Spark are really necessary for us R users.  My argument was that (a) these tools tend to be difficult to install and configure, especially for non-geeks; (b) the tools require learning new computation paradigms and function calls; and (c) one should be able to generally do just as well with plain ol’ R.  I gave a small example of the idea, and promised that more would be forthcoming.  I’ll present one in this posting.

I called my approach Snowdoop for fun, and will stick with that name.  I hastened to explain at the time that although some very short support routines could be turned into a package (see below), Snowdoop is more a concept than a real package.  It’s just a simple idea for attacking problems that are normally handled through Hadoop and the like.

The example I gave last time involved the “Hello World” of Hadoop-dom, a word count.  However, mine simply counted the total number of words in a document, rather than the usual app in which is it reported how many times each individual word appears.  I’ll present the latter case here.

Here is the code:


# each node executes this function 
wordcensus <- function(basename,ndigs) {
 fname <- filechunkname(basename,ndigs)
 words <- scan(fname,what="")
 tapply(words,words,length, simplify=FALSE)
}

# manager 
fullwordcount <- function(cls,basename,ndigs) {
 assignids(cls)
 clusterExport(cls,"filechunkname")
 clusterExport(cls,"addlists")
 counts <- clusterCall(cls,wordcensus,basename,ndigs)
 addlistssum <- function(lst1,lst2)
   addlists(lst1,lst2,sum)
 Reduce(addlistssum,counts)
}

And here are the library functions:


# give each node in the cluster cls an ID number 
assignids <- function(cls) {
 # note that myid will be global
 clusterApply(cls,1:length(cls),
 function(i) myid <<- i)
}

# determine the file name for the chunk to be handled by node myid
filechunkname <- function(basename,ndigs) {
 tmp <- basename
 n0s <- ndigs - nchar(as.character(myid))
 paste(basename,".",rep('0',n0s),myid,sep="",colllapse="")
} 

# "add" 2 lists, applying the operation 'add' to elements in
# common,
# copying non-null others
addlists <- function(lst1,lst2,add) {
 lst <- list()
 for (nm in union(names(lst1),names(lst2))) {
 if (is.null(lst1[[nm]])) lst[[nm]] <- lst2[[nm]] else
 if (is.null(lst2[[nm]])) lst[[nm]] <- lst1[[nm]] else
 lst[[nm]] <- add(lst1[[nm]],lst2[[nm]])
 }
 lst
}

All pure R!  No Java, no configuration.  Indeed, it’s worthwhile comparing to the word count example in sparkr, the R interface to Spark.  There we see calls to sparkr functions such as flatMap(), reduceByKey() and collect().  Well, guess what!  The reduceByKey() function is pretty much the same as R’s tried and true apply().  The collect() function is more or less our Snowdoop library function addlists().  So, again, there is no need to resort to Spark, Hadoop, Java and so on.

And as noted last time, in Snowdoop, we can easily keep objects persistent in memory between cluster calls, like Spark but unlike Hadoop.  Consider k-means clustering, for instance.  Each node would keep its data chunk in memory across the iterations (say using R’s <<- operator upon read-in).  The distance computation at each iteration could be used with CRAN’s pdist library, finding distances from the node’s chunk to the current set of centroids.

Again, while the support routines, e.g. addlists() etc. above, plus a few not shown here, could be collected into a package for convenience, Snowdoop is more a method than a formal package.

So, is there a price to be paid for all this convenience and simplicity?  As noted last time, Snowdoop doesn’t have the fault tolerance redundancy of Hadoop/Spark.  Conceivably there may be a performance penalty in applications in which the Hadoop distributed shuffle-sort is key to the algorithm.  Also, I’m not sure anyone has ever tried R’s parallel library with a cluster of hundreds of nodes or more.

But the convenience factor makes Snowdoop highly attractive.  For example, try plugging “rJava install” into Google, and you’ll see that many people have trouble with this package, which is needed for sparkr (especially if the user doesn’t have root privileges on his machine).

How About a “Snowdoop” Package?

Along with all the hoopla on Big Data in recent years came a lot of hype on Hadoop.  This eventually spread to the R world, with sophisticated packages being developed such as rmr to run on top of Hadoop.

Hadoop made it convenient to process data in very large distributed databases, and also convenient to create them, using the Hadoop Distributed File System.  But eventually word got out that Hadoop is slow, and very limited in available data operations.

Both of those shortcomings are addressed to a large extent by the new kid on the block, Spark, which has an R interface package, sparkr.  Spark is much faster than Hadoop, sometimes dramatically so, due to strong caching ability and a wider variety of available operations.  Recently distributedR has also been released, again with the goal of using R on voluminous data sets, and there is also the more established pbdR.

However, I’d like to raise a question here:  Do we really need all that complicated machinery?  I’ll propose a much simpler alternative here, and am curious to see what people think.  (Disclaimer:  I have only limited experience with Hadoop, and only a bit with SparkR.   I’ll present a proposal below, and very much want to see what others think.)

These packages ARE complicated.  There is a considerable amount of configuration to do, worsened by dependence on infrastructure software such as Java or MPI, and in some cases by interface software such as rJava.  Some of this requires systems knowledge that many R users may lack.  And once they do get these systems set up, they may be required to design algorithms with world views quite different from R, even though they are coding in R.

Here is a possible alternative:  Simply use the familiar cluster-oriented portion of R’s parallel package, an adaptation of snow; I’ll refer to that portion of parallel as Snow, and just for fun, call the proposed package Snowdoop.  I’ll illustrate it with the “Hello world” of Hadoop, word count in a text file (slightly different from the usual example, as I’m just counting total words here, rather than the number of times each distinct word appears.)

(It’s assumed here that the reader is familiar with the basics of Snow.  If not, see the first chapter of the partial rough draft of my forthcoming book.)

Say we have a data set that we have partitioned into two files, words.1 and words.2.  In my example here, they will contain the R sign-on message, with words.1 consisting of

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

 Natural language support but running in an English locale

and words.2 containing.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Here is our code:

 
# give each node in the cluster cls an ID number 
assignids <- function(cls) {    
   clusterApply(cls,1:length(cls), 
      function(i) myid <<- i) 
} 

# each node executes this function 
getwords <- function(basename) { 
   fname <- paste(basename,".",myid,sep="")
   words <- scan(fname,what="") 
   length(words) 
} 

# manager 
wordcount <- function(cls,basename) { 
   assignids(cls) 
   clusterExport(cls,"getwords") 
   counts <- clusterCall(cls,getwords,basename)
   sum(unlist(counts)) 
}

# call example:
> library(parallel)
> c2 <- makeCluster(2)
> wordcount(c2,"words")
[1] 83


 

This couldn’t be simpler.  Yet it does what we want:

  • parallel computation on chunks of a distributed file, on independently-running nodes
  • automated “caching” (use the R <<- operator with the output of scan() above)
  • no configuration or platform worries
  • ordinary R programming, no “foreign” concepts

Indeed, it’s so simple that Snowdoop would hardly be worthy of being called a package.  It could include some routines for creating a chunked file, general file read/write routines, parallel load/save and so on, but it would still be a very small package in the end.

Granted, there is no data redundancy built in here, and we possibly lose pipelining effects, but otherwise, it seems fine.  What do you think?

Count Your BLAS-ings

One nice thing about open-source software is that users often have a lot of choices.  Such is the case with R, for instance the thousands of contributed packages available on CRAN.  My focus here is on BLAS, the core of matrix operations in R, where again there are interesting choices available to users who wish to take advantage of them.

The Basic Linear Algebra Subroutines have been around for many decades, used throughout science and engineering, in various implementations.  A fairly efficient BLAS implementation is included with R, but for those with heavy linear algebra needs, several open-source alternatives are available, such as ATLAS, ACML and OpenBLAS, as well as the commercial Intel MKL.  (Recently Revolution Analytics announced an open version of their R platform that includes the MKL.)

Here I will discuss  OpenBLAS, a library currently attractive to many, due to its open-source nature and ability to make use of the multicore machines that are so common today.   I’ll focus on numerical accuracy.

This should not be considered a detailed tutorial on OpenBLAS, but here is a “hand-waving” overview of its usage and installation. Usage couldn’t be simpler, actually; you just continue business as usual,  with OpenBLAS transparently doing what base-R BLAS has always done for you.   Under more advanced usage, you might try to tweak things by setting the number of cores.

Installation is only a bit more elaborate, if you are comfortable building R from source.  At the configure stage, I ran

configure --prefix=/home/matloff/MyR311 --enable-BLAS-shlib

After running the usual make and make install.  I then needed to do a symbolic link of libRblas.so to the OpenBLAS library.  Since I’ve been doing timing comparisons for my book, I’ve made shell aliases to run either stock BLAS or OpenBLAS; a more sophisticated approach would have been to use update-alternatives.  

No doubt about it, OpenBLAS is fast, and many timing comparisons for R are to be found on the Web, including the Revo link above.  But what about numerical accuracy?  After seeing Mike Hannon’s recent post on R-help, along with Brian Ripley’s reply, I became curious, I searched the Web for information on this aspect, and came up empty-handed.  So, I  will present here the results of some simple experiments I’ve done as a result.

First, though, a disclaimer:  Although I know the basics of numerical analysis, I am not an expert in any sense, including the sense of being an expert on the various BLASes.  If anyone out there has more to add, that would be highly appreciated.

OpenBLAS derives its speed not just from making use of multiple cores, but also from various tweaks of the code, yielding a very fine degree of optimization.  One can thus envision a development team (which, by the way, took over the old Goto BLAS project) so obsessed with speed that they might cut some corners regarding numerical accuracy.  Thus the latter is a subject of legitimate concern.

For my little test here, I chose to compute eigenvalues, using R’s eigen() function.  I generated p x p unit covariance matrices (1s on the diagonal, ρ everywhere off the diagonal) for my test:

covrho <- function(p,rho) {
 m <- diag(p)
 m[row(m) != col(m)] <- rho
 m
}

I tried this with various values of p and ρ; here I’ll show the results for 2500 and 0.95, respectively.  The machine used has 16 cores, plus a hyperthreading degree of 2; OpenBLAS likely used 32 threads.

With the standard R BLAS, the elapsed time was 57.407.  Under OpenBLAS, that time was reduced to 12.101.

But interestingly, the first eigenvalue was found to be 2375.05 in both cases.  (This was the exact value in eout$values[1], where eout was the return value from eigen().)

Changing ρ to 0.995, I got reported principal eigenvalues of 2487.505 in both cases.  (Timings were roughly as before.)

As another example, I also tried finding the matrix inverse for this last matrix, using solve().   Both versions of BLAS gave 199.92 as the [1,1] element of the inverse.  Interestingly, though, there was a wider time discrepancy, 53.955 seconds versus 0.933.

It is a little odd that the numbers come out with so few decimal places.  I wonder whether R is deliberately doing some rounding, based on estimates of accuracy.  In any event, I would generally caution against looking at too many decimal places, no matter how good the accuracy is, since typically the input data itself is not so accurate.

So, it seems, at first glance, that OpenBLAS is doing fine.  But Brian has an excellent point about the value of sticking with the tried-and-true, in this case meaning, R’s default BLAS implementation.

I invite you to try your own accuracy comparisons, and post them here.

Why Are We Still Teaching t-Tests?

My posting about the statistics profession losing ground to computer science drew many comments, not only here in Mad (Data) Scientist, but also in the co-posting at Revolution Analytics, and in Slashdot.  One of the themes in those comments was that Statistics Departments are out of touch and have failed to modernize their curricula.  Though I may disagree with the commenters’ definitions of “modern,” I have in fact long felt that there are indeed serious problems in statistics curricula.

I must clarify before continuing that I do NOT advocate that, to paraphrase Shakespeare, “First thing we do, we kill all the theoreticians.”   A precise mathematical understanding of the concepts is crucial to good applications.  But stat curricula are not realistic.

I’ll use Student t-tests to illustrate.  (This is material from my open-source book on probablity and statistics.)  The t-test is an exemplar for the curricular ills in three separate senses:

  • Significance testing has long been known to be under-informative at best, and highly misleading at worst.  Yet it is the core of almost any applied stat course.  Why are we still teaching — actually highlighting — a method that is recognized to be harmful?
  • We prescribe the use of the t-test in situations in which  the sampled population has an exact normal distribution — when we know full well that there is no such animal.  All real-life random variables are bounded (as opposed to the infinite-support normal distributions) and discrete (unlike the continuous normal family).  [Clarification, added 9/17:  I advocate skipping the t-distribution,  and going directly to inference based on the Central Limit Theorem.  Same for regression.  See my book.]
  • Going hand-in-hand with the t-test is the sample variance. The classic quantity s2 is an unbiased estimate of the population variance σ2, with s2 defined as 1/(n-1) times the sum of squares of our data relative to the sample mean.  The concept of unbiasedness does have a place, yes, but in this case there really is no point to dividing by n-1 rather than n.  Indeed, even if we do divide by n-1, it is easily shown that the quantity that we actually need, s rather than s2, is a BIASED (downward) estimate of σ.  So that n-1 factor is much ado about nothing.

Right from the beginning, then, in the very first course a student takes in statistics, the star of the show, the t-test, has three major problems.

Sadly, the R language largely caters to this old-fashioned, unwarranted thinking.  The var() and sd() functions use that 1/(n-1) factor, for example — a bit of a shock to unwary students who wish to find the variance of a random variable uniformly distributed on, say, 1,2,…,10.

Much more importantly, R’s statistical procedures are centered far too much on significance testing.  Take ks.test(), for instance; all one can do is a significance test, when it would be nice to be able to obtain a confidence band for the true cdf.  Or consider log-linear models:  The loglin() function is so centered on testing that the user must proactively request parameter estimates, never mind standard errors.  (One can get the latter by using glm() as a workaround, but one shouldn’t have to do this.)

I loved the suggestion by Frank Harrell in r-devel to at least remove the “star system” (asterisks of varying numbers for different p-values) from R output.  A Quixotic action on Frank’s part (so of course I chimed in, in support of his point); sadly, no way would such a change be made.  To be sure, R in fact is modern in many ways, but there are some problems nevertheless.

In my blog posting cited above, I was especially worried that the stat field is not attracting enough of the “best and brightest” students.  Well, any thoughtful student can see the folly of claiming the t-test to be “exact.”  And if a sharp student looks closely, he/she will notice the hypocrisy of using the 1/(n-1) factor in estimating variance for comparing two general means, but NOT doing so when comparing two proportions.  If unbiasedness is so vital, why not use 1/(n-1) in the proportions case, a skeptical student might ask?

Some years ago, an Israeli statistician, upon hearing me kvetch like this, said I would enjoy a book written by one of his countrymen, titled What’s Not What in Statistics.  Unfortunately, I’ve never been able to find it.  But a good cleanup along those lines of the way statistics is taught is long overdue.

Good for TI, Good for Schools, Bad for Kids, Bad for Stat

In my last post, I agreed with Prof. Xiao-Li Meng that Advanced Placement (AP) Statistics courses turn off many students to the statistics field, by being structured in a manner that makes for a boring class.  I cited as one of the problems the fact that the course officially requires TI calculators.  This is a sad waste of resources, as the machines are expensive while R is free, and R is capable of doing things that are much more engaging for kids.

Interestingly, this week the Washington Post ran an article on the monopoly that TI calculators have in the schools.  This was picked up by a Slashdot poster, who connected it to my blog post on AP Stat.  The Post article has some interesting implications.

As the article notes, it’s not just an issue of calculators vs. R.  It’s an issue of calculators in general vs. the TI calculator.  Whether by shrewd business strategy or just luck, TI has attained a structural monopoly.  The textbooks and standardized exams make use of TI calculators, which forces all the teachers to use that particular brand.

Further reinforcing that monopoly are the kickbacks, er, donations to the schools.  When my daughter was in junior high school and was told by the school to buy a TI calculator, I noticed at the store that Casio calculators were both cheaper and had more capabilities.  I asked the teacher about this, and she explained that TI makes donations to the schools.

All this shows why Ms. Chow, the Casio rep quoted in the article, is facing an uphill battle in trying to get schools to use her brand. But there is also something very troubling about Chow’s comment, “That is one thing we do struggle with, teachers worried about how long it is going to take them to learn [Casio products].”  Math teachers would have trouble learning to use a calculator?  MATH teachers?!  I am usually NOT one to bash the U.S. school system, but if many math teachers are this technically challenged, one must question whether they should be teaching math in the first place.  This also goes to the point in my last blog post that kids generally are not getting college-level instruction in the nominally college-level AP Stat courses.

Chow’s comment also relates to my speculation that, if there were a serious proposal to switch from TI to R, the biggest source of resistance would be the AP Stat teachers themselves.  Yet I contend that even they would find that it is easy to learn R to the level needed, meaning being able to do what they currently do on TIs—and to go further, such as analyzing large data sets that engage kids, producing nice color graphics.  This is not hard at all; the teachers don’t need to become programmers.

The Post article also brings up the issue of logistics.  How would teachers give in-class tests in an R-based AP Stat curriculum?  How would the national AP Stat exam handle this?

Those who dismiss using R for AP Stat on such logistical grounds may be shocked to know that the AP Computer Science exam is not conducted with a live programmable computer at hand either. It’s all on paper, with the form of the questions being designed so that a computer is not needed.  (See the sample test here.)  My point is that, if even a test that is specifically about programming can be given without a live computer present, certainly the AP Stat course doesn’t need one either.  For that matter, most questions on the AP Stat exam  concentrate on concepts, not computation, anyway, which is the way it should be.

The teachers should demand a stop to this calculator scam, and demand that the textbooks, AP Stat exam etc. be based on R (or some other free software) rather than on expensive calculators. The kids would benefit, and so would the field of statistics.

Follow

Get every new post delivered to your Inbox.

Join 84 other followers