May | 2014 | Mad (Data) Scientist

Before I left for China a few weeks ago, I said my next post would be on our Rth parallel R package. It’s not quite ready yet, so today I’ll post one of the topics I spoke on last night at the Berkeley R Language Beginners Study Group. Thanks to the group for inviting me, and thanks to Allan Miller for suggesting I address this topic.

A couple of years ago, the Julia language was announced, and by releasing some rather unfair timing comparisons, the Julia group offended some in the R community. Later, some in the Python world decided that the right tool for data science ought to be Python (supplemented by NumPy etc.). Claims started appearing on the Web that R’s king-of-the-hill status in data science would soon evaporate, with R being replaced by one of these other languages, if not something else.

I chose the lighthearted title of this post as a hint that I am not doctrinaire on this topic. R is great, but if something else comes along that’s better, I’ll welcome it. But the fact is that I don’t see that happening, as I will explain in this post.

Actually, I’m a big fan of Python. I’ve been using it (and teaching with it) for years. It’s exceptionally clean and elegant, so much nicer than Perl. And for those who feel that object-orientation is a necessity (I’m not such a person), Python’s OOP structures are again clean and elegant. I’ve got a series of tutorials on the language, if you are seeking a quick, painless introduction.

I know less about Julia, but what I’ve seen looks impressive, and the fact that prominent statistician and R expert Doug Bates has embraced it should carry significant weight with anyone.

Nevertheless, I don’t believe that Python or Julia will become “the new R” anytime soon, or ever. Here’s why:

First, R is written by statisticians, for statisticians.

It matters. An Argentinian chef, say, who wants to make Japanese sushi may get all the ingredients right, but likely it just won’t work out quite the same. Similarly, a Pythonista could certainly cook up some code for some statistical procedure by reading a statistics book, but it wouldn’t be quite same. It would likely be missing some things of interest to the practicing statistician. And R is Statistically Correct.

For the same reason, I don’t see Python or Julia building up a huge code repository comparable to CRAN. Not only does R have a gigantic head start, but also there is the point that statistics simply is not Python’s or Julia’s central mission; the incentives to get that big in data science just aren’t there, I believe.

(This is not to say that CRAN doesn’t need improvement. It needs much better indexing, and maybe even a Yelp-style consumer review facility.)

Now, what about the speed issue? As mentioned, the speed comparisons with R (and with other languages) offered by the Julia people were widely regarded as unfair, as they did not take advantage of R’s speedy vectorization features. Let’s take a look at another example that has been presented in the R-vs.-Julia debate.

Last year I attended a talk in our Bay Area R Users Group, given by a highly insightful husband/wife team. Their main example was simulation of random walk.

In their trial run, Julia was much faster than R. But I objected, because random walk is a sum. Thus one can generate the entire process in R as vector calls, one to generate the steps and then a call to cumsum(), e.g.

> rw <- function(nsteps) {
+    steps <- sample(c(-1,1),nsteps,
+    replace=TRUE)
+    cumsum(steps)
+ }
> rw(100)
  [1]  1  2  3  2  3  2  1  0  1  0 -1 -2 -1  0  1  0 -1  0 -1 -2 -3 -2 -1  0  1
 [26]  0  1  2  1  2  3  2  1  2  3  2  1  2  1  0  1  0  1  0  1  2  3  4  5  4
 [51]  3  2  1  0 -1 -2 -1  0 -1  0  1  0  1  0  1  0 -1  0  1  0 -1 -2 -3 -4 -3
 [76] -4 -3 -4 -3 -2 -3 -2 -3 -2 -3 -4 -3 -4 -3 -2 -1  0 -1  0 -1 -2 -1 -2 -1 -2

So for example, in the simulation, at the 76th step we were at position -4.

This vectorized R code turned out to be much faster than the Julia code–more than 1000 times faster, in fact, in the case of simulation 1000000 steps. For 100000000 steps, Julia actually is much faster than R, but the point is that the claims made about Julia’s speed advantage are really overblown.

For most people, I believe the biggest speed issue is for large data manipulation rather than computation. But recent R packages such as data.table and dplyr take care of that quite efficiently. And for serial computation, Rcpp and its related packages ease C/C++ integration.

Note my qualifier “serial” in that last sentence. For real speed, parallel computation is essential. And I would argue that here R dominates Python and Julia, at least at present.

Python supports threading, the basis of multicore computation. But its type of threading is not actually parallel; only one thread/core can be active at a time. This has been the subject of huge controversy over the years, so Guido Van Rossum, inventor of the language, added a multiprocessing module. But it’s rather clunky to use, and my experience with it has not been good. My impression of Julia’s parallel computation facilities so far, admittedly limited, is similar.

R, by contrast, features a rich variety of packages available for parallel computing. (Again, I’ll discuss Rth in my next post.) True, there is also some of that for Python, e.g. interfaces of Python to MPI. But I believe it is fair to say that for parallel computing, R beats Python and Julia.

Finally, in our Bay Area R meeting last week, one speaker made the audacious statement, “R is not a full programming language.” Says who?! As I mentioned earlier, I’m a longtime Python fan, but these days I even do all my non-stat coding in R, for apps that I would have used Python for in the past. For example, over the years I had developed a number of Python scripts to automate the administration of the classes I teach. But last year, when I wanted to make some modifications to them, I decided to rewrite them in R from scratch, so as to make future modifications easier for me.

Every language has its stellar points. I’m told, for example, that for those who do a lot of text processing, Python’s regular expression facilities are more extensive than R’s. The use of one of the R-Python bridge packages may be useful here, and indeed interlanguage connections may be come more common as time goes on. But in my view, it’s very unlikely that Python or Julia will become more popular than R among data scientists.

So, take THAT, Python and Julia! 🙂

	Anonymous on Just How Good Is ChatGPT in Da…
	Quantile Regression… on Quantile Regression with Rando…
	Anonymous on Quantile Regression with Rando…
	Sina Özdemir on qeML Example: Nonparametric Qu…
	Anonymous on qeML Example: Nonparametric Qu…

Mad (Data) Scientist

Monthly Archives: May 2014

R beats Python! R beats Julia! Anyone else wanna challenge R?

Musings, useful code etc. on R and data science