Using CART: Implementation Matters

In preparing the following example for my forthcoming book, I was startled by how different two implementations of Classification and Regression Trees (CART) performed on a particular data set. Here is what happened:

For my example, I first used the Vertebral Column data set from the UCI Machine Learning Repository. The task is to classify patients into one of three vertebral categories. There are 310 observations, with only 6 predictor variables, so there should be no problem of overfitting. Using a straight logistic model, I achieved about 88% accuracy.

I then tried CART, using the rpart package. (Note, throughout the book, I try to stick to default values of the arguments.) Here is the code:

> vert <- read.table('column_3C.dat',header=FALSE)
> library(rpart)
> rpvert <- rpart(V7 ~ .,data=vert,method='class')
> rpypred <- predict(rpvert,type='class')
> mean(rpypred == vert$V7)
[1] 0.883871

OK, very similar.

Then I tried the well-known Letters Recognition data set from UCI. This too is a classification problem, one class for each of the capital letters ‘A’ through ‘Z’, with 20,000 observations. (The letters are represented about equally frequently in this data, so the priors are ‘wrong’, but that is not the issue here.) There are 16 predictor variables. I got about 84% accuracy, again using logit (All vs. All).

However, rpart did poorly:

> rplr <- rpart(lettr ~ .,data=lr,method='class')
> rpypred <- predict(rplr,type='class')
> mean(rpypred == lr$lettr)
[1] 0.4799

Of course, potential deficiences in CART led the original developers of CART to the notion of random forests, so I gave that a try.

> rflr <- randomForest(lettr ~ .,data=lr)
> rfypred <- predict(rflr)
> mean(rfypred == lr$lettr)
[1] 0.96875

Really? Can there be that vast a difference between CART and random forests? And by the way, I got about 91% accuracy with k-Nearest Neighbors (implemented in the knnest function from my regtools package on CRAN).

I speculated that the cause might be that the response variables here (26 of them) are non-monotonically related to the predictors. I put that theory to a couple of the originators of CART, Richard Olshen and Chuck Stone, but they didn’t seem to think it is an issue. But while it is true that nonmonotonicity should eventually be handled by predictors being split multiple times, I still suspect it could be the culprit, say due to the tree-building process stopping too early.

On that hunch, I tried another implementation of CART, ctree from the partykit package, which uses quite different splitting and stopping rules. This was considerably better:

> library(partykit)
> ctout <- ctree(lettr ~ .,data=lr)
> ctpred <- predict(ctout,lr)
> mean(ctpred == lr$lettr)
[1] 0.8552

Hmm…Not sure what to make of this, but it certainly is thought-provoking.

By the way, partykit includes a random forest implementation as well, but it is slow and can be a memory hog. The authors still consider it experimental.


6 thoughts on “Using CART: Implementation Matters”

  1. My recollection is that `rpart` by default has an excessively high `cp` complexity parameter, by default, for real-world problems, and that you get better results if you reduce that substantially and cross-validate.

    1. Excellent suggestion. I tried with cp = 0.00005 and got 88% accuracy. This increases my suspicion that the problem is nonmonotonic relations between Y and Xi.

      As I mentioned in my posting, I try to stick to default values in the book; otherwise, I’d have a book for each package! But this case shows the danger in that.

  2. From my experience these algos are all very differently susceptible to over- and under-fitting on a training sample.

    I understand (and share!) your instinct to go with the algo defaults, but the caret package makes it very easy to tune across a grid of parameter values. It’ll even give you a sensible default grid for every tuneable parameter.

    Splitting the data into train and test, tuning the parameter values, and then comparing the different algos on the test set might also be informative about their inherent strengths and weaknesses on these datasets.

  3. I completely agree with @harlanharris. `rpart()` is *only* to be used with somewhat careful choice of the tuning parameter. I’ve been using it in teaching for many years now, and always taught *how* to correctly tune it. I’m sorry Matt, but only using default argument settings is not acceptable with modern methods. Some have inbuilt crossvalidation etc, others don’t. The very famous `glmnet` package cannot even be used sensibly if you dont’ crossvalidate.

    1. Thanks, Martin. Actually, I do use cross-validation extensively in my book. But I point out that it too can go wrong. It is subject to the “multiple inference” problem, so that some values of the parameter (e.g. cp here) appear great but actually are artifacts. If one does k-fold cross-validation, there is also the issue of choosing the value of k. And the corresponding theoretical foundations are shaky, i.e. statistical consistency is obtained only under certain conditions that are themselves shaky in a practical sense. In other words, there is no panacea, as we all know. — Norm (not Matt)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s