R beats Python! R beats Julia! Anyone else wanna challenge R?

Before I left for China a few weeks ago, I said my next post would be on our Rth parallel R package. It’s not quite ready yet, so today I’ll post one of the topics I spoke on last night at the Berkeley R Language Beginners Study Group. Thanks to the group for inviting me, and thanks to Allan Miller for suggesting I address this topic.

A couple of years ago, the Julia language was announced, and by releasing some rather unfair timing comparisons, the Julia group offended some in the R community. Later, some in the Python world decided that the right tool for data science ought to be Python (supplemented by NumPy etc.). Claims started appearing on the Web that R’s king-of-the-hill status in data science would soon evaporate, with R being replaced by one of these other languages, if not something else.

I chose the lighthearted title of this post as a hint that I am not doctrinaire on this topic. R is great, but if something else comes along that’s better, I’ll welcome it. But the fact is that I don’t see that happening, as I will explain in this post.

Actually, I’m a big fan of Python. I’ve been using it (and teaching with it) for years. It’s exceptionally clean and elegant, so much nicer than Perl. And for those who feel that object-orientation is a necessity (I’m not such a person), Python’s OOP structures are again clean and elegant. I’ve got a series of tutorials on the language, if you are seeking a quick, painless introduction.

I know less about Julia, but what I’ve seen looks impressive, and the fact that prominent statistician and R expert Doug Bates has embraced it should carry significant weight with anyone.

Nevertheless, I don’t believe that Python or Julia will become “the new R” anytime soon, or ever. Here’s why:

First, R is written by statisticians, for statisticians.

It matters. An Argentinian chef, say, who wants to make Japanese sushi may get all the ingredients right, but likely it just won’t work out quite the same. Similarly, a Pythonista could certainly cook up some code for some statistical procedure by reading a statistics book, but it wouldn’t be quite same. It would likely be missing some things of interest to the practicing statistician. And R is Statistically Correct.

For the same reason, I don’t see Python or Julia building up a huge code repository comparable to CRAN. Not only does R have a gigantic head start, but also there is the point that statistics simply is not Python’s or Julia’s central mission; the incentives to get that big in data science just aren’t there, I believe.

(This is not to say that CRAN doesn’t need improvement. It needs much better indexing, and maybe even a Yelp-style consumer review facility.)

Now, what about the speed issue? As mentioned, the speed comparisons with R (and with other languages) offered by the Julia people were widely regarded as unfair, as they did not take advantage of R’s speedy vectorization features. Let’s take a look at another example that has been presented in the R-vs.-Julia debate.

Last year I attended a talk in our Bay Area R Users Group, given by a highly insightful husband/wife team. Their main example was simulation of random walk.

In their trial run, Julia was much faster than R. But I objected, because random walk is a sum. Thus one can generate the entire process in R as vector calls, one to generate the steps and then a call to cumsum(), e.g.

> rw <- function(nsteps) {
+    steps <- sample(c(-1,1),nsteps,
+    replace=TRUE)
+    cumsum(steps)
+ }
> rw(100)
  [1]  1  2  3  2  3  2  1  0  1  0 -1 -2 -1  0  1  0 -1  0 -1 -2 -3 -2 -1  0  1
 [26]  0  1  2  1  2  3  2  1  2  3  2  1  2  1  0  1  0  1  0  1  2  3  4  5  4
 [51]  3  2  1  0 -1 -2 -1  0 -1  0  1  0  1  0  1  0 -1  0  1  0 -1 -2 -3 -4 -3
 [76] -4 -3 -4 -3 -2 -3 -2 -3 -2 -3 -4 -3 -4 -3 -2 -1  0 -1  0 -1 -2 -1 -2 -1 -2

So for example, in the simulation, at the 76th step we were at position -4.

This vectorized R code turned out to be much faster than the Julia code–more than 1000 times faster, in fact, in the case of simulation 1000000 steps. For 100000000 steps, Julia actually is much faster than R, but the point is that the claims made about Julia’s speed advantage are really overblown.

For most people, I believe the biggest speed issue is for large data manipulation rather than computation. But recent R packages such as data.table and dplyr take care of that quite efficiently. And for serial computation, Rcpp and its related packages ease C/C++ integration.

Note my qualifier “serial” in that last sentence. For real speed, parallel computation is essential. And I would argue that here R dominates Python and Julia, at least at present.

Python supports threading, the basis of multicore computation. But its type of threading is not actually parallel; only one thread/core can be active at a time. This has been the subject of huge controversy over the years, so Guido Van Rossum, inventor of the language, added a multiprocessing module. But it’s rather clunky to use, and my experience with it has not been good. My impression of Julia’s parallel computation facilities so far, admittedly limited, is similar.

R, by contrast, features a rich variety of packages available for parallel computing. (Again, I’ll discuss Rth in my next post.) True, there is also some of that for Python, e.g. interfaces of Python to MPI. But I believe it is fair to say that for parallel computing, R beats Python and Julia.

Finally, in our Bay Area R meeting last week, one speaker made the audacious statement, “R is not a full programming language.” Says who?! As I mentioned earlier, I’m a longtime Python fan, but these days I even do all my non-stat coding in R, for apps that I would have used Python for in the past. For example, over the years I had developed a number of Python scripts to automate the administration of the classes I teach. But last year, when I wanted to make some modifications to them, I decided to rewrite them in R from scratch, so as to make future modifications easier for me.

Every language has its stellar points. I’m told, for example, that for those who do a lot of text processing, Python’s regular expression facilities are more extensive than R’s. The use of one of the R-Python bridge packages may be useful here, and indeed interlanguage connections may be come more common as time goes on. But in my view, it’s very unlikely that Python or Julia will become more popular than R among data scientists.

So, take THAT, Python and Julia! 🙂

124 thoughts on “R beats Python! R beats Julia! Anyone else wanna challenge R?”

  1. Echo that. I am a Julia user myself and I tend to use Julia most where R fails e.g. when mongolite failed for my use case and I switched to Golang + Julia. Or if I do regex manipulation then I go to Julia. I would say the best single computer in-memory data manipulation tool is R with data.table + dplyr, I mainly care about speed so data.table > pandas, so panda is a close 2nd. As I have discussed in my presentation (https://prezi.com/view/zeQ5p3EYlUodiJ65pElW/) I think it’s possible for Julia to one day (2019+) be the best single-computer in-memory data manipulation tool. But I think the real promise of Julia is in JuliaDB.jl which can take on the juggernaut that is SAS for disk-based data manipulation. Plus it can take on dask for distributed tasks. Hey, one day it may be able to take on Spark as well but I do note that Spark is very different to JuliaDB.jl.

  2. It is amazing that there are still new comments to this four year old post, but it does show the number of people who are interested in “picking the right language” for their analysis. I was one of them but however decided to develop SoS Notebook (https://vatlab.github.io/sos-docs/) to solve this problem. Basically, with feather (https://github.com/wesm/feather) that supports R, Python, and Julia, people can switch freely between languages and there is no need to stick to a single “best” language, and SoS Notebook is designed for that.

Leave a reply to mumon Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.