Threading in R?

I was pleased to see today’s post, “(A Very) Experimental Threading in R,” by Lukasz Bartnik, as this is a long-standing interest of mine. My own effort in this direction has been my package Rdsm.

The notion of threading, for those who may not have this background, refers to several instances of a program, in this case, several instances of R, sharing global variables but otherwise running independently. As Bartnik points out, this can make I/O programming easier and clearer; see my Python tutorial, Chapter 4, for a network sockets example, in which the code must deal with situations in which data may come from one of many sockets, but without foreknowledge of which socket will be next. By having a separate process devoted to each socket, but storing the incoming data in a shared variable, the problem is neatly solved (and more conveniently than using nonblocking I/O).

R examples of various sorts are given in the Rdsm package. And as Bartnik also points out, perhaps the most common situation where his or my package might be used is running an R “background job.”

In terms of speedups through parallelization, the results are mixed. See my book on parallel computation for data science. Bartnik’s package conceivably could have lower overhead, making parallelization speedup from threaded R more feasible.

For me, a major piece of unfinished business regarding Rdsm is the use of backing storage, i.e. storing the shared variables on disk. This could help with problems having large memory needs, and may be useful for distributed computation. Rdsm runs on top of bigmemory, which does allow use of backing storage. However, this seems to require a file system that immediately propagates file changes made by one process to visibility by other processes, which didn’t seem to work on ordinary Linux systems, for instance.

As to truly threaded R, my understanding (could need an update) is that the R Core team has vowed “Never!” Too many technical issues.

Advertisements

Using CART: Implementation Matters

In preparing the following example for my forthcoming book, I was startled by how different two implementations of Classification and Regression Trees (CART) performed on a particular data set. Here is what happened:

For my example, I first used the Vertebral Column data set from the UCI Machine Learning Repository. The task is to classify patients into one of three vertebral categories. There are 310 observations, with only 6 predictor variables, so there should be no problem of overfitting. Using a straight logistic model, I achieved about 88% accuracy.

I then tried CART, using the rpart package. (Note, throughout the book, I try to stick to default values of the arguments.) Here is the code:


> vert <- read.table('column_3C.dat',header=FALSE)
> library(rpart)
> rpvert <- rpart(V7 ~ .,data=vert,method='class')
> rpypred <- predict(rpvert,type='class')
> mean(rpypred == vert$V7)
[1] 0.883871

OK, very similar.

Then I tried the well-known Letters Recognition data set from UCI. This too is a classification problem, one class for each of the capital letters ‘A’ through ‘Z’, with 20,000 observations. (The letters are represented about equally frequently in this data, so the priors are ‘wrong’, but that is not the issue here.) There are 16 predictor variables. I got about 84% accuracy, again using logit (All vs. All).

However, rpart did poorly:

> rplr <- rpart(lettr ~ .,data=lr,method='class')
> rpypred <- predict(rplr,type='class')
> mean(rpypred == lr$lettr)
[1] 0.4799

Of course, potential deficiences in CART led the original developers of CART to the notion of random forests, so I gave that a try.

> rflr <- randomForest(lettr ~ .,data=lr)
> rfypred <- predict(rflr)
> mean(rfypred == lr$lettr)
[1] 0.96875

Really? Can there be that vast a difference between CART and random forests? And by the way, I got about 91% accuracy with k-Nearest Neighbors (implemented in the knnest function from my regtools package on CRAN).

I speculated that the cause might be that the response variables here (26 of them) are non-monotonically related to the predictors. I put that theory to a couple of the originators of CART, Richard Olshen and Chuck Stone, but they didn’t seem to think it is an issue. But while it is true that nonmonotonicity should eventually be handled by predictors being split multiple times, I still suspect it could be the culprit, say due to the tree-building process stopping too early.

On that hunch, I tried another implementation of CART, ctree from the partykit package, which uses quite different splitting and stopping rules. This was considerably better:

> library(partykit)
> ctout <- ctree(lettr ~ .,data=lr)
> ctpred <- predict(ctout,lr)
> mean(ctpred == lr$lettr)
[1] 0.8552

Hmm…Not sure what to make of this, but it certainly is thought-provoking.

By the way, partykit includes a random forest implementation as well, but it is slow and can be a memory hog. The authors still consider it experimental.