Rth: a Flexible Parallel Computation Package for R

I’ve been mentioning here that I’ll be discussing a new package, Rth, developed by me and Drew Schmidt, the latter of pbdR fame.  It’s now ready for use!  In this post, I’ll explain what goals Rth has, and how to use it.

Platform Flexibility

The key feature of Rth is in the word flexible in the title of this post, which refers to the fact that Rth can be used on two different kinds of platforms for parallel computation:  multicore systems and Graphics Processing Units (GPUs).  You all know about the former–it’s hard to buy a PC these days that is not at least dual-core–and many of you know about the latter.  If your PC or laptop has a somewhat high-end graphics card, this enables extremely fast computation on certain kinds of problems.   So, whether have, say, a quad-core PC or a good NVIDIA graphics card, you can run Rth for fast computation, again for certain types of applications.  And both multicore and GPUs are available in the Amazon EC2 cloud service. 

Rth Quick Start

Our Rth home page tells you the GitHub site at which you can obtain the package, and how to install it.  (We plan to place it on CRAN later on.)  Usage is simple, as in this example:

library(Rth)
Loading required package: Rcpp
> x <- runif(10)
> x
[1] 0.21832266 0.04970642 0.39759941 0.27867082 0.01540710 0.15906994
[7] 0.65361604 0.95695404 0.25700848 0.94633625
> sort(x)
[1] 0.01540710 0.04970642 0.15906994 0.21832266 0.25700848 0.27867082
[7] 0.39759941 0.65361604 0.94633625 0.95695404
> rthsort(x)
[1] 0.01540710 0.04970642 0.15906994 0.21832266 0.25700848 0.27867082
[7] 0.39759941 0.65361604 0.94633625 0.95695404

Performance

So, let’s see how fast we can sort 50000000 U(0,1) numbers.  We’ll try R’s built-in sort (with the default method, Quicksort), and then try Rth with 2 cores and then 4.

> system.time(sort(x))
user system elapsed
18.866 0.209 19.144
> system.time(rthsort(x,nthreads=2))
user system elapsed
5.763 0.739 3.949
> system.time(rthsort(x,nthreads=4))
user system elapsed
8.798 1.114 3.724

I ran this on a 32-core machine, so I could have tried even more threads, though typically one reaches a point at which increasing the number of cores actually slows things down.

The cogniscenti out there will notice immediately that we obtained a speedup of far more than 2 while using only 2 cores.  This obviously is due to use of different algorithms.   In this instance, the difference arises from a different sorting algorithm being used in Thrust, a software system on top of which Rth runs.  (See the Rth home page for details on Thrust.)

Rth is an example of what I call Pretty Good Parallelism (an allusion to Pretty Good Privacy).   For certain applications it can get you good speedup on two different kinds of common platforms (multicore, GPU). Like most parallel computation systems, it works best on very regular, “embarrassingly parallel” problems.  For very irregular, complex apps, one may need to resort to very detailed C code to get a good speedup.

Platforms

As mentioned, the code runs on top of Thrust, which runs on Linux, Mac and Windows OSs. Also, it uses Rcpp, which is cross-platform as well.

In other words, Rth should run under all three OSs.  However, so far it has been tested only on Linux and Mac platforms.  It should work fine on Windows, but neither of us has ready access to such a machine, so it hasn’t been tested there yet.  

Necessary Programming Background

As seen above, the Rth functions are just R code, hence usable by anyone familiar with R.  No knowledge of Thrust, C++, GPU etc. is required.

However, you may wish to write your own Rth functions.  In fact, we hope you can contribute to the package!  For this you need a good knowledge of C++, which is what Thrust is written in.  

What Functions Are Available, And What Might Be Available?

Currently the really fast operations available in Rth are:  sort/order/rank; distance computation; histograms; and contingency tables.  These can be used as foundations for developing other functions.  For example, the parallel distance computation can be used to write code for parallel k-means clustering, or for kernel-based nonparametric multivariate density estimation.  Some planned new functions are listed on the home page.

Conclusion

Give Rth a try!  Let us know about your experiences with it, and again, code contributions would be highly welcome.

I plan to devote some of my future blog posts here to other topics in parallel computation.  Much of the material will come from my forthcoming book, Parallel Computation for Data Science.

13 thoughts on “Rth: a Flexible Parallel Computation Package for R”

      1. Didn’t compile (arch – i386). The issue is already in Github =) https://github.com/Rth-org/Rth/issues/2

        Maybe is something with libs loading??
        g++ -m32 -I”C:/PROGRA~1/R/R-31~1.0/include” -DNDEBUG
        -I”d:/RCompile/CRANpkg/
        extralibs64/local/include” -O2 -Wall -mtune=core2 -c rs.cpp -o rs.o
        rs.cpp:7:34: fatal error: thrust/device_vector.h: No such file or directory
        compilation terminated.

      2. I’m not sure how to do that. I tried to do so with:

        R CMD INSTALL –configure-args=”PKG_CPPFLAGS=’${PKG_CPPFLAGS} ../inst/include'” Rth-master

        and

        R CMD INSTALL –configure-args=”PKG_CPPFLAGS=’-I../inst/include'” Rth-master

        And keep getting the same error.

    1. There are a couple of issues here.

      First, as explained in our Rth home page, there have been odd interactions between two or more of: R, Rcpp, Thrust, TBB. Some of these appear to be due to R still not fully supporting long vectors at the present time, but oddly, they seem to arise especially with thrust::for_each().

      Ironically, one of the advantages of Rth’s contingency table function rthtable() over R’s table() is that the Rth version allows having more than 231 cells–yet rthtable() does use thrust::for_each().

      Second, there would be the issue of how to specify the body of an Rth foreach loop. Doing it in interpreted R would be too slow, and implementing a translator would be fantastic–literally, i.e. it’s a fantasy right now, for a lot of people who write various R packages. It sure would be nice. That leaves having the user write C/C++ code and then linking it to Thrust. Should be doable.

      On our planned functions list in the home page, we mention doing something like rthapply(), but again these issues will arise.

      So, for the time being, what I see happening is slowly adding various specific application functions, rather than general tools.

    1. Very common in the parallel computing world. After a certain point, adding threads becomes counterproductive, as overhead starts to dominate speed gains.

  1. “At a certain point” being three … and so much for massive parallelism?

    🙂

    I just wondered at this case, four cores available, yet four threads don’t help – while two did help. Is one thread reserved for the supervisor? CAN you spawn more threads than you have physical cores? What would the performance be for three cores?

    Thanks.

  2. “Too many cooks spoil the broth.” After a certain point, having too many threads means they just get in each other’s way. No, one thread is not being used for the OS; the latter runs after every timeslice ends, and schedules a new thread. One can certainly spawn more threads than cores, and sometimes this can be profitable, as it means the machine will have useful work to do when there is a cache miss or page fault for some thread.

Leave a reply to Jonathan Greenberg Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.