Using CART: Implementation Matters

In preparing the following example for my forthcoming book, I was startled by how different two implementations of Classification and Regression Trees (CART) performed on a particular data set. Here is what happened:

For my example, I first used the Vertebral Column data set from the UCI Machine Learning Repository. The task is to classify patients into one of three vertebral categories. There are 310 observations, with only 6 predictor variables, so there should be no problem of overfitting. Using a straight logistic model, I achieved about 88% accuracy.

I then tried CART, using the rpart package. (Note, throughout the book, I try to stick to default values of the arguments.) Here is the code:

> vert <- read.table('column_3C.dat',header=FALSE)
> library(rpart)
> rpvert <- rpart(V7 ~ .,data=vert,method='class')
> rpypred <- predict(rpvert,type='class')
> mean(rpypred == vert$V7)
[1] 0.883871

OK, very similar.

Then I tried the well-known Letters Recognition data set from UCI. This too is a classification problem, one class for each of the capital letters ‘A’ through ‘Z’, with 20,000 observations. (The letters are represented about equally frequently in this data, so the priors are ‘wrong’, but that is not the issue here.) There are 16 predictor variables. I got about 84% accuracy, again using logit (All vs. All).

However, rpart did poorly:

> rplr <- rpart(lettr ~ .,data=lr,method='class')
> rpypred <- predict(rplr,type='class')
> mean(rpypred == lr$lettr)
[1] 0.4799

Of course, potential deficiences in CART led the original developers of CART to the notion of random forests, so I gave that a try.

> rflr <- randomForest(lettr ~ .,data=lr)
> rfypred <- predict(rflr)
> mean(rfypred == lr$lettr)
[1] 0.96875

Really? Can there be that vast a difference between CART and random forests? And by the way, I got about 91% accuracy with k-Nearest Neighbors (implemented in the knnest function from my regtools package on CRAN).

I speculated that the cause might be that the response variables here (26 of them) are non-monotonically related to the predictors. I put that theory to a couple of the originators of CART, Richard Olshen and Chuck Stone, but they didn’t seem to think it is an issue. But while it is true that nonmonotonicity should eventually be handled by predictors being split multiple times, I still suspect it could be the culprit, say due to the tree-building process stopping too early.

On that hunch, I tried another implementation of CART, ctree from the partykit package, which uses quite different splitting and stopping rules. This was considerably better:

> library(partykit)
> ctout <- ctree(lettr ~ .,data=lr)
> ctpred <- predict(ctout,lr)
> mean(ctpred == lr$lettr)
[1] 0.8552

Hmm…Not sure what to make of this, but it certainly is thought-provoking.

By the way, partykit includes a random forest implementation as well, but it is slow and can be a memory hog. The authors still consider it experimental.

My regtools Package Is Now on CRAN

In my posts to this blog (less frequent than I would like, hopefully more frequent in the future), I’ve often mentioned my R package regtools, which contains a number of functions useful for regression and classification. None of them duplicate what is available in the excellent packages on CRAN, so I will dare characterize regtools as innovative. 🙂 In spite of the smiley emoticon in that last sentence, I hope you feel the same way.

You can download the package from CRAN, and yes, Joe R., it includes a vignette. 🙂 It begins with an example of one of the main features of the package, the use of nonparametric regression estimation to assess the fit of parametric regression models. I think/hope you’ll find it interesting.

Manny Parzen Used R!

Prof. Manny Parzen, a pioneer of modern statistics, passed away in February, aged 87. I should have commented back then, but it’s still worth saying something today. I happened to be thinking of him this morning.

I did not know Manny personally. This makes it odd that I refer to him by his first name, but I do so in the same spirit that people who are fans of Bernie Sanders but don’t know him personally, including myself, refer to him as Bernie. This right away tells you something about Manny.

It also says something about Manny that he stayed active long after retirement age — and that he was actually using R! That is remarkable for a theoretician of his era, and indeed maybe even for the present era.

I discovered this at the 43rd Symposium on the Interface of
Computing Science and Statistics
in 2012, where there was a special session honoring him. A couple of his former PhD students spoke, including my old professor, Don Ylvisaker, and Manny himself gave a talk.

Manny was pretty frail at the time, using a walker, yet mentally sharp as ever, and as I mentioned, still quite active in research. In fact, the young man sitting next to me turned out to be a current PhD student working under Manny.

At any rate, what jolted me was that in the midst of his talk, on the subtle theoretical properties of histograms and the like, he suddenly started talking about R! Specifically, he brought in R’s quantile() function, and went into a comparison of the 9 different options quantile() offers the user for defining quantiles. This really impressed me, and illustrates how much R has become a central tool.

By the way, after the talk I went up and introduced myself, and told him a joke that he had told many years ago when he gave a talk in the UC Davis Stat Dept. I won’t waste bandwidth here by going into the details of the joke, but it concerned students who guess randomly on exams. He laughed heartily at his own joke, and said, “It’s still true!”

StatET IDE for R

I personally do not use Integrated Development environments (IDEs) for R, or for that matter for any programming language. From my point of view, they take up too much precious real estate on the screen, and most important, they generally do not allow me to use my own text editor and my own abbreviations and macros. Since I want to have a uniform editing environment no matter whether I am programming, typing e-mail or whatever, IDEs are not for me. I just use Vim, with my own abbreviation collection.

So for example, I’ve never used RStudio (gasp!). I know many people who swear by it (and maybe a few who swear at it 🙂 ), but it’s just not for me. ESS (Emacs Speaks Statistics) is great for the hard core R people, and in fact I occasionally use it — I have an Emacs tutorial, by the way — but not too much.

If you do want to use an IDE for R, and want something different, I’ve always admired (from afar) StatET, for which there is a nice introduction in progress. It does take some configuration, and Eclipse requires some learning (I have a tutorial for it too), but StatET has always impressed me as very well-designed.

Disclaimer: I do not know whether the StatET article’s claims about RStudio are correct. My point is that the article is a nice introduction to StatET.

New Release of partools Package

My new release of partools is now on CRAN.

The package is aimed at doing parallel data science in what I call an “un-MapReduce” manner. It takes the point of view that MapReduce-based frameworks such as Hadoop and Spark are fine for the types of applications their designers had in mind, namely rather simple SQL actions, but have fundamental handicaps that prevent them from performing well on many, if not most, of the types of computation that typical users need for large data sets and/or highly compute-bound applications. The distributed file/object nature of those MapReduce systems is retained, but the confining MapReduce computational paradigm is avoided.

The package now contains about 30 functions, ranging from infrastructure support to summary and aggregation to statistical/machine learning applications. See the vignette for a fairly detailed introduction. Two new capabilities that I wish to highlight are:

  • Aggregation and related operations on objects of class “data.table”.
  • Parallel computation for some modern statistical/machine learning algorithms (they are statistics to me, but you may call them machine learning if you prefer).

The core of that second highlighted set of functions makes use of what I call Software Alchemy, which I have explained in previous blog posts. See for instance the example on random forests in the vignette.

Happy Paralleling. 🙂

Bad Coder, Bad Coder!

My title here is in the sense of “Bad dog, bad dog!”, a scolding I sometimes see dog owners use to tame their pets, and is also an allusion to Bad Reporter, a sometimes hilarious and always irreverent political comic strip in the San Francisco Chronicle. And my title is intended to convey the point that I think that “good programming practice” rules are sometimes taken overly seriously.

I’ll comment on two aspects here:

  • Variable names, spacing etc.
  • “Side effects.”

The second is the more unsettling of the two, but the first is interesting because Hadley Wickham is involved, which always gets people’s attention. 🙂

What prompted this is that there is currently a discussion in the ASA Statistical Learning and Data Mining Section e-mail list on Hadley’s coding style guidelines.  Some of you may know that Google has also had its own R style guidelines for some years now.

Much as I admire Hadley and Google, I really think this is much ado about nothing. I do have my own personal style (if you are curious, just look at my code), but seriously, folks, I don’t think it really matters. As long as a programmer is consistent  and has given careful thought as to how to be clear — not only clear to others who read the code, but also clear to the programmer herself when she reads the code cold two months from now — it’s fine.

I would like to see better style in programmers’ English, though. Please STOP using the terms “curly braces” and “square brackets”! Braces ARE curly, by definition, and brackets ARE square, right? Or, at least be consistent in your redundancy, speaking of “round parentheses,” “dotty periods,” “long dashes,” “big capitals,” “elevated overbars” and so on. 🙂

And now, for the hard stuff, side effects. R is a functional language, meaning among other things that no variable is changed upon executing a function, other than assignment of the return value. In other words,

y <- f(x)

will change y but neither x nor any other variables, e.g. globals, will be changed.

For many in the R community, this restriction is followed with religious-like fervor. They consider it safe, clear and easy to debug.

Yes, yes, I understand the arguments, and the No Side Effecters have a point. But it is much too dogmatic, in my opinion, and often causes one to incur BIG performance penalties.

I am not here to praise Caeser, but I’m not here to bury him either. 🙂 If you want to subscribe to that anti-side effects philosophy, fine. My only point is that, like it or not, R is slowly moving away from the banning of side effects. I would cite a few examples:

  • Reference classes. R has always had some ways to create side effects for the antisocial 🙂 , one notable example being environments. Reference classes carry that loophole one step further.
  • The bigmemory package. Yes, it was developed as a workaround to R’s 32-bit memory constraints (somewhat but not completely resolved in more recent versions of R), but I’m told that one of the reasons for the package’s popularity is its ability change data in-place in memory. This is basically a side effects issue, as traditionally the assignment to y above has created a new, separate copy of y in memory, often a big performance killer. (Again, this too has been ameliorated somewhat in recent versions of R.)
  • The data.table package. Same comments here as for the “insidious” use of bigmemory above, but also consider the operation> setkey(mtc1,cyl).Ever notice that no assignment is made? If you rave about data.table, you should keep in mind what factors underly that lightning speed.

I won’t get into the global variables issue here, other than to say again that I think the arguments against them often tend to be more ideological than logical.

In summary, I believe that careful thought, rather than an obsession with rules, is your ticket to good — clear and efficient — code.


Latest on the Julia Language (vs. R)

I’ve written before about the Julia language. As someone who is very active in the R community, I am biased of course, and have been (and remain) a skeptic about Julia. But I would like to report on a wonderful talk I attended today at Stanford. To my surprise and delight, the speaker, Viral Shah of Julia Computing Inc, focused on the “computer science-y” details, i.e. the internals and the philosophy, quite interesting and certainly very impressive.

I had not previously known, for instance, how integral the notion of typing was in Julia, e.g. integer vs. float, and the very extensive thought processses in the Julia group that led to this emphasis. And it was fun to see the various cool Julia features that appeal to a systems guy like me, e.g. easy viewing of the assembly language implementation  of a Julia function.

I was particularly interested in one crucial aspect that separates R from other languages that are popular in data science applications — NA values. I asked the speaker about that during the talk, only to find that he had anticipated this question and had devoted space in his slides to it. After covering that topic, he added that this had caused considerable debate within the Julia team as to how to handle it, which turned out to be something of a compromise.

Well, then, given this latest report on Julia (new releases coming soon), what is MY latest? How do I view it now?

As I’ve said here before, the fact that such an eminent researcher and R developer, Doug Bates of the University of Wisconsin, has shifted his efforts from R to Julia is enough for me to hold Julia in high regard, sight unseen. I had browsed through some Julia material in the past, and had seen enough to confirm that this is a language to be reckoned with. Today’s talk definitely raised my opinion of the language even further. But…

I am both a computer scientist and a statistician. Though only my early career was in a Department of Statistics (I was one of the founders of the UC Davis Stat. Dept.), I have done statistics throughout my career. And my hybrid status plays a key role in how I view Julia.

As a computer scientist, especially one who likes to view things at the systems levels, Julia is fabulous. But as a statistician, speed is only one of many crucial aspects of the software that I write and use. The role of NA values in R is indispensable, I say, not something to be compromised. And even more importantly, what I call the “helper” infrastructure of R is something I would be highly loathe to part with, things like naming of vector elements and matrix rows for instance. Such things have led to elegant solutions to many problems in software that I write.

And though undoubtedly (and hopefully) more top statisticians like Doug Bates will become active Julia contributors, the salient fact about R, as I always say, is that R is written for statisticians by statisticians. It matters. I believe that R will remain the language of choice in statistics for a long time to come.

And so, though my hat is off to Viral Shah, I don’t think Julia is about to “go viral” in the stat world in the foreseeable future. 🙂