Statistics: Losing Ground to CS, Losing Image Among Students

The American Statistical Association (ASA)  leadership, and many in Statistics academia. have been undergoing a period of angst the last few years,  They worry that the field of Statistics is headed for a future of reduced national influence and importance, with the feeling that:

  • The field is to a large extent being usurped by other disciplines, notably Computer Science (CS).
  • Efforts to make the field attractive to students have largely been unsuccessful.

I had been aware of these issues for quite a while, and thus was pleasantly surprised last year to see then-ASA president Marie Davidson write a plaintive editorial titled, “Aren’t We Data Science?”

Good, the ASA is taking action, I thought.  But even then I was startled to learn during JSM 2014 (a conference tellingly titled “Statistics:  Global Impact, Past, Present and Future”) that the ASA leadership is so concerned about these problems that it has now retained a PR firm.

This is probably a wise move–most large institutions engage in extensive PR in one way or another–but it is a sad statement about how complacent the profession has become.  Indeed, it can be argued that the action is long overdue; as a friend of mine put it, “They [the statistical profession] lost the PR war because they never fought it.”

In this post, I’ll tell you the rest of the story, as I see it, viewing events as a statistician, computer scientist and R enthusiasist.

CS vs. Statistics

Let’s consider the CS issue first.  Recently a number of new terms have arisen, such as data science, Big Data, and analytics, and the popularity of the term machine learning has grown rapidly.  To many of us, though, this is just  “old wine in new bottles,” with the “wine” being Statistics.  But the new “bottles” are disciplines outside of Statistics–especially CS.

I have a foot in both the Statistics and CS camps.  I’ve spent most of my career in the Computer Science Dept. at the University of California, Davis, but I began my career in Statistics at that institution.  My mathematics doctoral thesis at UCLA was in probability theory, and my first years on the faculty at Davis focused on statistical methodology.  I was one of the seven charter members of the Department of Statistics.   Though my departmental affiliation later changed to CS, I never left Statistics as a field, and most of my research in Computer Science has been statistical in nature.  With such “dual loyalties,” I’ll refer to people in both professions via third-person pronouns, not first, and I will be critical of both groups.  (A friend who read a draft of this post joked it should be titled “J’accuse”  but of course this is not my intention.)   However, in keeping with the theme of the ASA’s recent actions, my essay will be Stat-centric:  What is poor Statistics to do?

Well then, how did CS come to annex the Stat field?  The primary cause, I believe, came from the CS subfield of Artificial Intelligence (AI).  Though there always had been some probabilistic analysis in AI, in recent years the interest has been almost exclusively in predictive analysis–a core area of Statistics.

That switch in AI was due largely to the emergence of Big Data.  No one really knows what the term means, but people “know it when they see it,” and they see it quite often these days.  Typical data sets range from large to huge to astronomical (sometimes literally the latter, as cosmology is one of the application fields), necessitating that one pay key attention to the computational aspects.  Hence the term data science, combining quantitative methods with speedy computation, and hence another reason for CS to become involved.

Involvement is one thing, but usurpation is another.  Though not a deliberate action by any means, CS is eclipsing Stat in many of Stat’s central areas.  This is dramatically demonstrated by statements that are made like,  “With machine learning methods, you don’t need statistics”–a punch in the gut for statisticians who realize that machine learning really IS statistics.  ML goes into great detail in certain aspects, e.g. text mining, but in essence it consists of parametric and nonparametric curve estimation methods from Statistics, such as logistic regression, LASSO, nearest-neighbor classification, random forests, the EM algorithm and so on.

Though the Stat leaders seem to regard all this as something of an existential threat to the well-being of their profession, I view it as much worse than that.  The problem is not that CS people are doing Statistics, but rather that they are doing it poorly:  Generally the quality of CS work in Stat is weak.  It is not a problem of quality of the researchers themselves; indeed, many of them are very highly talented.  Instead, there are a number of systemic reasons for this, structural problems with the CS research “business model”:

  • CS, having grown out of research on fast-changing software and hardware systems, became accustomed to the “24-hour news cycle”–very rapid publication rates, with the venue of choice being (refereed) frequent conferences rather than slow journals.  This leads to research work being less thoroughly conducted, and less thoroughly reviewed, resulting in poorer quality work.  The fact that some prestigious conferences have acceptance rates in the teens or even lower doesn’t negate these realities.
  • Because CS Depts. at research universities tend to be housed in Colleges of Engineering, there is heavy pressure to bring in lots of research funding, and produce lots of PhD students.  Large amounts of time is spent on trips to schmooze funding agencies and industrial sponsors,  writing grants, meeting conference deadlines and managing a small army of doctoral students–instead of time spent in careful, deep, long-term contemplation about the problems at hand.  This is made even worse by the rapid change in the fashionable research topic de jour, making it difficult to go into a topic in any real depth.  Offloading the actual research onto a large team of grad students can result in faculty not fully applying the talents they were hired for; I’ve seen too many cases in which the thesis adviser is not sufficiently aware of what his/her students are doing.
  • There is rampant “reinventing the wheel.”  The above-mentioned  lack of “adult supervision” and lack of long-term commitment to research topics results in weak knowledge of the literature.  This is especially true for knowledge of the Stat literature, which even the “adults” tend to have very little awareness of.  For instance, consider a paper on the use of mixed labeled and unlabeled training data in classification.  (I’ll omit names.)   One of the two authors is one of the most prominent names in the machine learning field, and the paper has been cited over 3,000 times, yet the paper cites nothing in the extensive Stat literature on this topic, consisting of a long stream of papers from 1981 to the present.
  • Again for historical reasons, CS research is largely empirical/experimental in nature.  This causes what in my view is one of the most serious problems plaguing CS research in Stat–lack of rigor.  Mind you, I am not saying that every paper should consist of theorems and proofs or be overly abstract; data- and/or simulation-based studies are fine.  But there is no substitute for precise thinking, and in my experience, many (nominally) successful CS researchers in Stat do not have a solid understanding of the fundamentals underlying the problems they work on.  For example, a recent paper in a top CS conference incorrectly stated that the logistic classification model cannot handle non-monotonic relations between the predictors and response variable; the paper really stressed this point, yet actually, one can add quadratic terms and so on to model this.
  • This “engineering-style” research model causes a cavalier attitude towards underlying models and assumptions.  Most empirical work in CS doesn’t have any models to worry about.  That’s entirely  appropriate, but in my observation it creates a mentality that inappropriately carries over when CS researchers do Stat work.  A few years ago, for instance, I attended a talk by a machine learning specialist who had just earned her PhD at one of the very top CS Departments  in the world.  She had taken a Bayesian approach to the problem she worked on, and I asked her why she had chosen that specific prior distribution.  She couldn’t answer–she had just blindly used what her thesis adviser had given her–and moreover, she was baffled as to why anyone would want to know why that prior was chosen.
  • Again due to the history of the field, CS people tend to have grand, starry-eyed ambitions–laudable, but a double-edged sword.   On the one hand, this is a  huge plus, leading to highly impressive feats such as recognizing faces in a crowd.  But this mentality leads to  an oversimplified view of things,  with everything being viewed as a paradigm shift.  Neural networks epitomize this problem.  Enticing phrasing such as “Neural networks work like the human brain” blinds many researchers to the fact that neural nets are not fundamentally different from other parametric and nonparametric methods for regression and classification.   (Recently I was pleased to discover–“learn,” if you must–that the famous book by Hastie, Tibshirani and Friedman complains about what they call “hype” over neural networks; sadly, theirs is a rare voice on this matter.)  Among CS folks, there is often a failure to understand that the celebrated accomplishments of “machine learning” have been mainly the result of applying a lot of money, a lot of people time, a lot of computational power and prodigious amounts of tweaking to the given problem–not because fundamentally new technology has been invented.

All this matters–a LOT.  In my opinion, the above factors result in highly lamentable opportunity costs.   Clearly, I’m not saying that people in CS should stay out of Stat research.  But the sad truth is that the usurpation process is causing precious resources–research funding, faculty slots, the best potential grad students, attention from government policymakers, even attention from the press–to go quite disproportionately to CS, even though Statistics is arguably better equipped to make use of them.   This is not a CS vs. Stat issue; Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.

Making Statistics Attractive to Students

This of course is an age-old problem in Stat.  Let’s face it–the very word statistics sounds hopelessly dull.  But I would argue that a more modern development is making the problem a lot worse–the Advanced Placement (AP) Statistics courses in high schools.

Professor Xiao-Li Meng has written extensively about the destructive nature of AP Stat.  He observed, “Among Harvard undergraduates I asked, the most frequent reason for not considering a statistical major was a ‘turn-off’ experience in an AP statistics course.”  That says it all, doesn’t it?  And though Meng’s views predictably sparked defensive replies in some quarters, I’ve had exactly the same experiences as Meng in my own interactions with students.  No wonder students would rather major in a field like CS and study machine learning–without realizing it is Statistics.  It is especially troubling that Statistics may be losing the “best and brightest” students.

One of the major problems is that AP Stat is usually taught by people who lack depth in the subject matter.  A typical example is that a student complained to me that even though he had attended a top-quality high school in the heart of Silicon Valley, his AP Stat teacher could not answer his question as to why it is customary to use n-1 rather than n in the denominator of s2 .  But even that lapse is really minor, compared to the lack among the AP teachers of the broad overview typically possessed by Stat professors teaching university courses, in terms of what can be done with Stat, what the philosophy is, what the concepts really mean and so on.  AP courses are ostensibly college level, but the students are not getting college-level instruction.  The “teach to the test” syndrome that pervades AP courses in general exacerbates this problem.

The most exasperating part of all this is that AP Stat officially relies on TI-83 pocket calculators as its computational vehicle.  The machines are expensive, and after all we are living in an age in which R is free!  Moreover, the calculators don’t have the capabilities of dazzling graphics and analyzing of nontrivial data sets that R provides–exactly the kinds of things that motivate young people.

So, unlike the “CS usurpation problem,” whose solution is unclear, here is something that actually  can be fixed reasonably simply.  If I had my druthers, I would simply ban AP Stat, and actually, I am one of those people who would do away with the entire AP program.   Obviously, there are too many deeply entrenched interests for this to happen, but one thing that can be done for AP Stat is to switch its computational vehicle to R.

As noted, R is free and is multi platform, with outstanding graphical capabilities.  There is no end to the number of data sets teenagers would find attractive for R use, say the Million Song Data Set.

As to a textbook, there are many introductions to Statistics that use R, such as Michael Crawley’s Statistics: an Introduction Using R, and Peter Dalgaard’s Introductory Statistics Using R.  But to really do it right, I would suggest that a group of Stat professors collaboratively write an open-source text, as has been done for instance for Chemistry.  Examples of interest to high schoolers should be used, say this engaging analysis on OK Cupid.

This is not a complete solution by any means.  There still is the issue of AP Stat being taught by people who lack depth in the field, and so on.  And even switching to R would meet with resistance from various interests, such as the College Board and especially the AP Stat teachers themselves.

But given all these weighty problems, it certainly would be nice to do something, right?  Switching to R would be doable–and should be done.

A Matrix Powers Package, and Some General Edifying Material on R

Here I will introduce matpow, a package to flexibly and conveniently compute matrix powers.  But even if you are not interested in matrices, I think many of you will find that this post contains much general material on R that you’ll find useful.  Indeed, most of this post will be about general R issues, not so much about matrices per se.  So, bear with me.

Why matrix powers?  

Sadly, most university beginning linear algebra courses say very little about why the material is important.  Worse, if some motivation is presented, it tends to be physics-oriented.  Indeed, the current trend in U.S. universities is to fold the introductory linear algebra curriculum into the differential equations course.

Most readers of this blog know better, as they are (to various extents) aware of the major roles that matrices play in statistics, e.g. in linear models and principal components analysis.  But did you know that matrices play a major role in the analysis of social networks?  We’re not in Physicsland anymore, Toto.

Matrix powers are useful here.  For instance, suppose we wish to determine whether a network (or graph) is connected, meaning that every node leads to every other node in one or more steps.  (The famous Six Degrees of Separation notion comes from this context.) It turns out that can be determined by taking powers of the adjacency matrix A of the network, whose (i,j) element is 1 if there is a one-step link from i to j, 0 otherwise.  (For technical reasons, we will set the diagonal of A to all 1s.)  If some power of A has all its elements nonzero, then the graph is connected; it’s disconnected if this situation is never reached.  (One need calculate only up to the power n-1, where n is the number of rows and columns of the matrix.)  Moreover, eigenanalysis of A yields other important information about the network, and one way to compute eigenvectors of a matrix involves finding its powers too.

But why have a package to compute the powers?

Isn’t this already straightforward in R?  For instance, the third power of A is computed via the code

m %*% m %*% m

and higher powers can be coded with a simple loop.  But we needed to be much more flexible and versatile than this in developing matpow.

1.   Generality:   Our matpow package accommodates various classes of matrices.

There are many of these in the R world.  In addition to the basic “matrix” class, there is also the “Matrix” class (from the Matrix package), the “big.matrix” class (from bigmemory) and so on.  Syntax can vary; with bigmemory, for instance, one must use brackets in assignment, e.g.

m1[,]

And most important, not all of these use %*% as their multiplication operator.  For example, in gputools one uses the function gpuMatMult() for this purpose.

2.  Accommodation of parallel computation:  Today’s matrices can be huge, so parallel computation would be nice.  Our matpow package does not itself include facilities for parallel computation, but it is designed to integrate well with external parallel methods, as mentioned above for gputools, an R package that interfaces to GPUs.

3.  Capability to include callback functions, to perform application-specific operations after each new power is calculated.

For instance, consider the social network example again.  As we compute higher and higher powers of A, we want to check for an “all non zeroes” condition after each iteration; if the condition is found to hold, there is no point in doing any more iterations.  We can have our callback function check for this.

And there is more.  The (i,j) element in the rth power of A can be shown to be the number of ways to go from node i to node j in r steps. If that element is now positive but was 0 in power r-1 of A, we now know that the shortest path from i to j consists of r steps.  So, we can have our callback function watch for changes from zero to nonzero, and record them, and thus have the overall internode distance matrix in the end.

A few quick examples:

Here is a quick tour, including of the squaring option.  There we square the input matrix, then square the result and so on, thus requiring only a logarithmic number of steps to reach our desired power.  This is much faster if we don’t need the intermediate powers.

> m <- rbind(1:2,3:4)
> ev <- matpow(m,16)  # find m to 16th power
> ev$prod1  # here it is
[,1] [,2]
[1,] 115007491351 1.67615e+11
[2,] 251422553235 3.66430e+11
> ev$i # last iteration was 15th
[1] 15
> ev <- matpow(m,16,squaring=TRUE)
> ev$prod1  # same result
[,1] [,2]
[1,] 115007491351 1.67615e+11
[2,] 251422553235 3.66430e+11
> ev$i  # but with only 4 iterations
[1] 4

# test network connectivity
> m <-    rbind(c(1,0,0,1),c(1,0,1,1),c(0,1,0,0),c(0,0,1,1))
> ev <- matpow(m,callback=cgraph,mindist=T)
> ev$connected
[1] TRUE
> ev$dists
[,1] [,2] [,3] [,4]
[1,] 1 3 2 1
[2,] 1 1 1 1
[3,] 2 1 1 2
[4,] 3 2 1 1

 

How can we deal with different matrix classes/multiplication syntaxes?

As noted, one problem we needed to deal with in developing matpow was how to accommodate diverse matrix classes and multiplication syntaxes.  We solved that problem by using R’s eval() and parse() functions.  Consider the following toy example:

> x <- 28
> s <- "x <- 16"
> eval(parse(text=s))
> x
[1] 16

Of course, here it is just a very roundabout way to set x to 16.   But for matpow, it’s just what we need.  We want our multiplication command in string form to be something like prod <- m %*% m in the “matrix” case,  prod[,] <- m[,] %*% m[,] in the “big.matrix” (bigmemory) case, and so on.

The (built-in or user-supplied) argument genmulcmd (“generate multiplication command”) to matpow() handles this.  For instance, here is the built-in one for gputools:

 genmulcmd.gputools <- function (a, b, c) 
   paste(c, " <- gpuMatMult(", a, ",", b, ")")

We then can plug the string into eval() and parse().  In this manner, we can switch matrix types by changing not even a full line of code, just an argument to matpow().

How can we implement callbacks?

Recall our example matpow() call:

matpow(m,callback=cgraph,mindist=T)

Here we are specifying that there is a callback function (the default is NULL), and it is to be cgraph().  This happens to be built-in for the package, but it could be user-written.

The key issue is how data will be communicated between matpow() and the callback–in BOTH directions, i.e. we need the callback to write information back to matpow().   This is not easy in R, which as a functional language tries to avoid side effects.

Consider for example this:

> x <- sample(1:20,8)
> x
[1] 5 12 9 3 2 6 16 18
> sort(x)
[1] 2 3 5 6 9 12 16 18
> x
[1] 5 12 9 3 2 6 16 18

The point is that x didn’t change, i.e. we had no side effect to the argument x.  If we do want x to change, we must write

> x <- sort(x)

But environments are different.  The matpow() function maintains an environment ev, which it passes to the callback as an argument. Though ev is like an R list, and its components are accessed via the familiar $ operator, the difference is that the callback can change those components (a side effect), thus communicate information back to matpow().

For instance, look at this code from cgraph():

if (all(prd > 0)) { 
   ev$connected <- TRUE 
   ev$stop <- TRUE 
}

We’ve discovered that the network is connected, so we set the proper two components of ev.    Back in matpow(), the code will sense that ev$stop is TRUE, and cease the iteration process.  Since matpow() uses ev as its return value, the user then can see that the network is connected.

Other issues:

Again, keep in mind that matpow() will use whatever form of matrix multiplication you give it.  If you give it a parallel form of multiplication, then matpow()‘s computation will be parallel.  Or if your R is configured with a multicore-capable BLAS, such as OpenBLAS, you again will be computing in parallel.

Though we have gotten good performance in this context with gputools, it should be noted that gpuMatMult() returns the product to the R caller.  This causes unnecessary copying, which on GPUs can sap speed.  Code that maintains the product on the GPU from iteration to iteration would be best.

Where to get matpow():

We plan submission to CRAN, but for now download it from here. Unzip the package, then run R CMD INSTALL on the directory that is created as a result of unzipping.

New freqparcoord Example

In my JSM talk this morning, I spoke about work done by Yingkang Xie and myself, on a novel approach to the parallel coordinates method of visualization.  I’ve made several posts to this blog in the past on freqparcoord, our implemention of our method.

My talk this morning used some recently-available NYC taxi data.  You may find the discoveries made on this data by freqparcoord of interest.  See my slides from the talk.

Code Snippet: Extracting a Subsample from a Large File

Last week a reader of the r-help mailing list posted a query titled “Importing random subsets of a data file.”  With a very large file, it is often much easier and faster–and really, just as good–to just work with a much smaller subset of the data.

Fellow readers then posted rather sophisticated solutions, such as storing the file in a database. Here I’ll show how to perform this task much more simply.  And if you haven’t been exposed to R’s text file reading functions before, it will be a chance for you to learn a bit.

I’m assuming here that we want to avoid storing the entire file in memory at once, which may be difficult or impossible.  In other words, functions like read.table() are out.

I’m also assuming that you don’t know exactly how many records are in the file, though you probably have a rough idea.  (If you do know this number, I’ll outline an alternative approach at the end of this post.)

Finally, due to lack of knowledge of the total number of records, I’m also assuming that extracting every kth record is sufficiently “random” for you.

So, here is the code (downloadable from here):

subsamfile <- function(infile,outfile,k,header=T) {
   ci <- file(infile,"r")
   co <- file(outfile,"w")
   if (header) {
      hdr <- readLines(ci,n=1)
      writeLines(hdr,co)
   }
   recnum = 0
   numout = 0
   while (TRUE) {
      inrec <- readLines(ci,n=1)
      if (length(inrec) == 0) { # end of file?
         close(co) 
         return(numout)
      }
   recnum <- recnum + 1
   if (recnum %% k == 0) {
      numout <- numout + 1
      writeLines(inrec,co)
   }
  }
}

Very straightforward code.  We use file() to open the input and output files, and read in the input file one line at a time, by specifying the argument n = 1 in the first call to file().  Each inputted record is a character string.  To sense the end-of-file condition on the input file, we test whether the input record has length 0.  (Any record, even an empty one, will have length 1, i.e. each record is read as a 1-element vector of mode character, again due to setting n = 1.)

On a Linux or Mac platform, we can determine the number of records in the file ahead of time by running wc -l infile (either directly or via R’s system()).  This may take a long time, but if we are willing to incur that time, then the above code could be changed to extract random records. We’d do something like cullrecs <- sample(1:ntotrecs,m,replace=FALSE) where m is the desired number of records to extract, and then whenever recnum matches the next element of cullrecs, we’d write that record to outfile.

Will you be at the JSM next week? My talk is on Tuesday, but I’ll be there throughout the meeting. If you’d like to exchange some thoughts on R or statistics, I’d enjoy chatting with you.

A Handy Trick for Remote Graphics

I often create plots that require quite a bit of computation.  Ideally I would run this on what I’ll call Machine A, which is a very fast machine, but I am often far away, on Machine B.  So, I’d like to run my computation on B but display it on A.

For the platforms I use (Linux, Mac), I can simply use X11 forwarding, by typing ssh -Y at B to log into A.  You may not have that capability, so I’ll give you a simple method here to accomplish the same thing without X11.  And as a bonus, if you’ve never used R sockets, you can learn something about them along the way.  (You will need the ability to run a server at B.  But just try the commands below to check whether you can do this.)  Here’s what to do:

1.  In an R session at A, run


> con <- socketConnection(port=11000,server=T,
blocking=T,open="a+b",timeout=600)

This starts the server, but the above command will hang until we run a corresponding command at B:


> con <- socketConnection(host=quoted_address_of_A,port=11000,
server=F,blocking=T,open="a+b",timeout=600)

At this point you will have your R prompts at both A and B, so you’re all set.

Now run your computation at A. For instance, I might use my freqparcoord package (see several of my earlier blog postings), which returns the graph in a ggplot2 object. I feed that object into R’s serialize() function, which encodes it and then sends it through my socket con:


> serialize(freqparcoord(tx50,10,c(6,9,11:14)),con)

Then at B, I receive it:


> unserialize(con)

That reads the message from con, decodes it, and prints the result. The latter action results in the screen display.

You can keep doing this as long as you like. Note that I did set a 10-minute timeout period, so if you will have long thoughts, frequent phone interruptions etc., you may wish to set a longer period. You can close the connection by typing


> close(con)

Rth: a Flexible Parallel Computation Package for R

I’ve been mentioning here that I’ll be discussing a new package, Rth, developed by me and Drew Schmidt, the latter of pbdR fame.  It’s now ready for use!  In this post, I’ll explain what goals Rth has, and how to use it.

Platform Flexibility

The key feature of Rth is in the word flexible in the title of this post, which refers to the fact that Rth can be used on two different kinds of platforms for parallel computation:  multicore systems and Graphics Processing Units (GPUs).  You all know about the former–it’s hard to buy a PC these days that is not at least dual-core–and many of you know about the latter.  If your PC or laptop has a somewhat high-end graphics card, this enables extremely fast computation on certain kinds of problems.   So, whether have, say, a quad-core PC or a good NVIDIA graphics card, you can run Rth for fast computation, again for certain types of applications.  And both multicore and GPUs are available in the Amazon EC2 cloud service. 

Rth Quick Start

Our Rth home page tells you the GitHub site at which you can obtain the package, and how to install it.  (We plan to place it on CRAN later on.)  Usage is simple, as in this example:

library(Rth)
Loading required package: Rcpp
> x <- runif(10)
> x
[1] 0.21832266 0.04970642 0.39759941 0.27867082 0.01540710 0.15906994
[7] 0.65361604 0.95695404 0.25700848 0.94633625
> sort(x)
[1] 0.01540710 0.04970642 0.15906994 0.21832266 0.25700848 0.27867082
[7] 0.39759941 0.65361604 0.94633625 0.95695404
> rthsort(x)
[1] 0.01540710 0.04970642 0.15906994 0.21832266 0.25700848 0.27867082
[7] 0.39759941 0.65361604 0.94633625 0.95695404

Performance

So, let’s see how fast we can sort 50000000 U(0,1) numbers.  We’ll try R’s built-in sort (with the default method, Quicksort), and then try Rth with 2 cores and then 4.

> system.time(sort(x))
user system elapsed
18.866 0.209 19.144
> system.time(rthsort(x,nthreads=2))
user system elapsed
5.763 0.739 3.949
> system.time(rthsort(x,nthreads=4))
user system elapsed
8.798 1.114 3.724

I ran this on a 32-core machine, so I could have tried even more threads, though typically one reaches a point at which increasing the number of cores actually slows things down.

The cogniscenti out there will notice immediately that we obtained a speedup of far more than 2 while using only 2 cores.  This obviously is due to use of different algorithms.   In this instance, the difference arises from a different sorting algorithm being used in Thrust, a software system on top of which Rth runs.  (See the Rth home page for details on Thrust.)

Rth is an example of what I call Pretty Good Parallelism (an allusion to Pretty Good Privacy).   For certain applications it can get you good speedup on two different kinds of common platforms (multicore, GPU). Like most parallel computation systems, it works best on very regular, “embarrassingly parallel” problems.  For very irregular, complex apps, one may need to resort to very detailed C code to get a good speedup.

Platforms

As mentioned, the code runs on top of Thrust, which runs on Linux, Mac and Windows OSs. Also, it uses Rcpp, which is cross-platform as well.

In other words, Rth should run under all three OSs.  However, so far it has been tested only on Linux and Mac platforms.  It should work fine on Windows, but neither of us has ready access to such a machine, so it hasn’t been tested there yet.  

Necessary Programming Background

As seen above, the Rth functions are just R code, hence usable by anyone familiar with R.  No knowledge of Thrust, C++, GPU etc. is required.

However, you may wish to write your own Rth functions.  In fact, we hope you can contribute to the package!  For this you need a good knowledge of C++, which is what Thrust is written in.  

What Functions Are Available, And What Might Be Available?

Currently the really fast operations available in Rth are:  sort/order/rank; distance computation; histograms; and contingency tables.  These can be used as foundations for developing other functions.  For example, the parallel distance computation can be used to write code for parallel k-means clustering, or for kernel-based nonparametric multivariate density estimation.  Some planned new functions are listed on the home page.

Conclusion

Give Rth a try!  Let us know about your experiences with it, and again, code contributions would be highly welcome.

I plan to devote some of my future blog posts here to other topics in parallel computation.  Much of the material will come from my forthcoming book, Parallel Computation for Data Science.

R beats Python! R beats Julia! Anyone else wanna challenge R?

Before I left for China a few weeks ago, I said my next post would be on our Rth parallel R package. It’s not quite ready yet, so today I’ll post one of the topics I spoke on last night at the Berkeley R Language Beginners Study Group. Thanks to the group for inviting me, and thanks to Allan Miller for suggesting I address this topic.

A couple of years ago, the Julia language was announced, and by releasing some rather unfair timing comparisons, the Julia group offended some in the R community. Later, some in the Python world decided that the right tool for data science ought to be Python (supplemented by NumPy etc.). Claims started appearing on the Web that R’s king-of-the-hill status in data science would soon evaporate, with R being replaced by one of these other languages, if not something else.

I chose the lighthearted title of this post as a hint that I am not doctrinaire on this topic. R is great, but if something else comes along that’s better, I’ll welcome it. But the fact is that I don’t see that happening, as I will explain in this post.

Actually, I’m a big fan of Python. I’ve been using it (and teaching with it) for years. It’s exceptionally clean and elegant, so much nicer than Perl. And for those who feel that object-orientation is a necessity (I’m not such a person), Python’s OOP structures are again clean and elegant. I’ve got a series of tutorials on the language, if you are seeking a quick, painless introduction.

I know less about Julia, but what I’ve seen looks impressive, and the fact that prominent statistician and R expert Doug Bates has embraced it should carry significant weight with anyone.

Nevertheless, I don’t believe that Python or Julia will become “the new R” anytime soon, or ever. Here’s why:

First, R is written by statisticians, for statisticians.

It matters. An Argentinian chef, say, who wants to make Japanese sushi may get all the ingredients right, but likely it just won’t work out quite the same. Similarly, a Pythonista could certainly cook up some code for some statistical procedure by reading a statistics book, but it wouldn’t be quite same. It would likely be missing some things of interest to the practicing statistician. And R is Statistically Correct.

For the same reason, I don’t see Python or Julia building up a huge code repository comparable to CRAN. Not only does R have a gigantic head start, but also there is the point that statistics simply is not Python’s or Julia’s central mission; the incentives to get that big in data science just aren’t there, I believe.

(This is not to say that CRAN doesn’t need improvement. It needs much better indexing, and maybe even a Yelp-style consumer review facility.)

Now, what about the speed issue? As mentioned, the speed comparisons with R (and with other languages) offered by the Julia people were widely regarded as unfair, as they did not take advantage of R’s speedy vectorization features. Let’s take a look at another example that has been presented in the R-vs.-Julia debate.

Last year I attended a talk in our Bay Area R Users Group, given by a highly insightful husband/wife team. Their main example was simulation of random walk.

In their trial run, Julia was much faster than R. But I objected, because random walk is a sum. Thus one can generate the entire process in R as vector calls, one to generate the steps and then a call to cumsum(), e.g.

> rw <- function(nsteps) {
+    steps <- sample(c(-1,1),nsteps,
+    replace=TRUE)
+    cumsum(steps)
+ }
> rw(100)
  [1]  1  2  3  2  3  2  1  0  1  0 -1 -2 -1  0  1  0 -1  0 -1 -2 -3 -2 -1  0  1
 [26]  0  1  2  1  2  3  2  1  2  3  2  1  2  1  0  1  0  1  0  1  2  3  4  5  4
 [51]  3  2  1  0 -1 -2 -1  0 -1  0  1  0  1  0  1  0 -1  0  1  0 -1 -2 -3 -4 -3
 [76] -4 -3 -4 -3 -2 -3 -2 -3 -2 -3 -4 -3 -4 -3 -2 -1  0 -1  0 -1 -2 -1 -2 -1 -2

So for example, in the simulation, at the 76th step we were at position -4.

This vectorized R code turned out to be much faster than the Julia code–more than 1000 times faster, in fact, in the case of simulation 1000000 steps. For 100000000 steps, Julia actually is much faster than R, but the point is that the claims made about Julia’s speed advantage are really overblown.

For most people, I believe the biggest speed issue is for large data manipulation rather than computation. But recent R packages such as data.table and dplyr take care of that quite efficiently. And for serial computation, Rcpp and its related packages ease C/C++ integration.

Note my qualifier “serial” in that last sentence. For real speed, parallel computation is essential. And I would argue that here R dominates Python and Julia, at least at present.

Python supports threading, the basis of multicore computation. But its type of threading is not actually parallel; only one thread/core can be active at a time. This has been the subject of huge controversy over the years, so Guido Van Rossum, inventor of the language, added a multiprocessing module. But it’s rather clunky to use, and my experience with it has not been good. My impression of Julia’s parallel computation facilities so far, admittedly limited, is similar.

R, by contrast, features a rich variety of packages available for parallel computing. (Again, I’ll discuss Rth in my next post.) True, there is also some of that for Python, e.g. interfaces of Python to MPI. But I believe it is fair to say that for parallel computing, R beats Python and Julia.

Finally, in our Bay Area R meeting last week, one speaker made the audacious statement, “R is not a full programming language.” Says who?! As I mentioned earlier, I’m a longtime Python fan, but these days I even do all my non-stat coding in R, for apps that I would have used Python for in the past. For example, over the years I had developed a number of Python scripts to automate the administration of the classes I teach. But last year, when I wanted to make some modifications to them, I decided to rewrite them in R from scratch, so as to make future modifications easier for me.

Every language has its stellar points. I’m told, for example, that for those who do a lot of text processing, Python’s regular expression facilities are more extensive than R’s. The use of one of the R-Python bridge packages may be useful here, and indeed interlanguage connections may be come more common as time goes on. But in my view, it’s very unlikely that Python or Julia will become more popular than R among data scientists.

So, take THAT, Python and Julia! :-)

Follow

Get every new post delivered to your Inbox.

Join 57 other followers