Threading in R?

I was pleased to see today’s post, “(A Very) Experimental Threading in R,” by Lukasz Bartnik, as this is a long-standing interest of mine. My own effort in this direction has been my package Rdsm.

The notion of threading, for those who may not have this background, refers to several instances of a program, in this case, several instances of R, sharing global variables but otherwise running independently. As Bartnik points out, this can make I/O programming easier and clearer; see my Python tutorial, Chapter 4, for a network sockets example, in which the code must deal with situations in which data may come from one of many sockets, but without foreknowledge of which socket will be next. By having a separate process devoted to each socket, but storing the incoming data in a shared variable, the problem is neatly solved (and more conveniently than using nonblocking I/O).

R examples of various sorts are given in the Rdsm package. And as Bartnik also points out, perhaps the most common situation where his or my package might be used is running an R “background job.”

In terms of speedups through parallelization, the results are mixed. See my book on parallel computation for data science. Bartnik’s package conceivably could have lower overhead, making parallelization speedup from threaded R more feasible.

For me, a major piece of unfinished business regarding Rdsm is the use of backing storage, i.e. storing the shared variables on disk. This could help with problems having large memory needs, and may be useful for distributed computation. Rdsm runs on top of bigmemory, which does allow use of backing storage. However, this seems to require a file system that immediately propagates file changes made by one process to visibility by other processes, which didn’t seem to work on ordinary Linux systems, for instance.

As to truly threaded R, my understanding (could need an update) is that the R Core team has vowed “Never!” Too many technical issues.


4 thoughts on “Threading in R?”

  1. Hey, I’m happy there’s feedback on this work – so thanks for commenting. I’m kind of guessing this will turn out to be much harder than I can anticipate at this point but I thought it’s still worth trying – and maybe there is something useful coming from this effort.

    Anyway, I’ve been trying to find hard evidence on why threads in R are deemed impossible but nothing specific so far. Since you mention R Core Team’s position on this topic – do you know of documents/presentations/sources I could look at to learn more?

    Thanks again for noticing.

    1. I think you could choose any of the team at random and find that he has strong feelings about this subject. 🙂 Maybe someone will see this and comment, but if not, you might start with Luke Tierney, who I seem to recall (maybe not) saying that R will never be threaded. You probably should also read the various discussions of what is and is not “safe” in accessing R variables from within C/C++ code. I suspect that some of those claims are dubious, but I don’t have enough expertise on the internals to say.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s