R beats Python! R beats Julia! Anyone else wanna challenge R?

Before I left for China a few weeks ago, I said my next post would be on our Rth parallel R package. It’s not quite ready yet, so today I’ll post one of the topics I spoke on last night at the Berkeley R Language Beginners Study Group. Thanks to the group for inviting me, and thanks to Allan Miller for suggesting I address this topic.

A couple of years ago, the Julia language was announced, and by releasing some rather unfair timing comparisons, the Julia group offended some in the R community. Later, some in the Python world decided that the right tool for data science ought to be Python (supplemented by NumPy etc.). Claims started appearing on the Web that R’s king-of-the-hill status in data science would soon evaporate, with R being replaced by one of these other languages, if not something else.

I chose the lighthearted title of this post as a hint that I am not doctrinaire on this topic. R is great, but if something else comes along that’s better, I’ll welcome it. But the fact is that I don’t see that happening, as I will explain in this post.

Actually, I’m a big fan of Python. I’ve been using it (and teaching with it) for years. It’s exceptionally clean and elegant, so much nicer than Perl. And for those who feel that object-orientation is a necessity (I’m not such a person), Python’s OOP structures are again clean and elegant. I’ve got a series of tutorials on the language, if you are seeking a quick, painless introduction.

I know less about Julia, but what I’ve seen looks impressive, and the fact that prominent statistician and R expert Doug Bates has embraced it should carry significant weight with anyone.

Nevertheless, I don’t believe that Python or Julia will become “the new R” anytime soon, or ever. Here’s why:

First, R is written by statisticians, for statisticians.

It matters. An Argentinian chef, say, who wants to make Japanese sushi may get all the ingredients right, but likely it just won’t work out quite the same. Similarly, a Pythonista could certainly cook up some code for some statistical procedure by reading a statistics book, but it wouldn’t be quite same. It would likely be missing some things of interest to the practicing statistician. And R is Statistically Correct.

For the same reason, I don’t see Python or Julia building up a huge code repository comparable to CRAN. Not only does R have a gigantic head start, but also there is the point that statistics simply is not Python’s or Julia’s central mission; the incentives to get that big in data science just aren’t there, I believe.

(This is not to say that CRAN doesn’t need improvement. It needs much better indexing, and maybe even a Yelp-style consumer review facility.)

Now, what about the speed issue? As mentioned, the speed comparisons with R (and with other languages) offered by the Julia people were widely regarded as unfair, as they did not take advantage of R’s speedy vectorization features. Let’s take a look at another example that has been presented in the R-vs.-Julia debate.

Last year I attended a talk in our Bay Area R Users Group, given by a highly insightful husband/wife team. Their main example was simulation of random walk.

In their trial run, Julia was much faster than R. But I objected, because random walk is a sum. Thus one can generate the entire process in R as vector calls, one to generate the steps and then a call to cumsum(), e.g.

> rw <- function(nsteps) {
+    steps <- sample(c(-1,1),nsteps,
+    replace=TRUE)
+    cumsum(steps)
+ }
> rw(100)
  [1]  1  2  3  2  3  2  1  0  1  0 -1 -2 -1  0  1  0 -1  0 -1 -2 -3 -2 -1  0  1
 [26]  0  1  2  1  2  3  2  1  2  3  2  1  2  1  0  1  0  1  0  1  2  3  4  5  4
 [51]  3  2  1  0 -1 -2 -1  0 -1  0  1  0  1  0  1  0 -1  0  1  0 -1 -2 -3 -4 -3
 [76] -4 -3 -4 -3 -2 -3 -2 -3 -2 -3 -4 -3 -4 -3 -2 -1  0 -1  0 -1 -2 -1 -2 -1 -2

So for example, in the simulation, at the 76th step we were at position -4.

This vectorized R code turned out to be much faster than the Julia code–more than 1000 times faster, in fact, in the case of simulation 1000000 steps. For 100000000 steps, Julia actually is much faster than R, but the point is that the claims made about Julia’s speed advantage are really overblown.

For most people, I believe the biggest speed issue is for large data manipulation rather than computation. But recent R packages such as data.table and dplyr take care of that quite efficiently. And for serial computation, Rcpp and its related packages ease C/C++ integration.

Note my qualifier “serial” in that last sentence. For real speed, parallel computation is essential. And I would argue that here R dominates Python and Julia, at least at present.

Python supports threading, the basis of multicore computation. But its type of threading is not actually parallel; only one thread/core can be active at a time. This has been the subject of huge controversy over the years, so Guido Van Rossum, inventor of the language, added a multiprocessing module. But it’s rather clunky to use, and my experience with it has not been good. My impression of Julia’s parallel computation facilities so far, admittedly limited, is similar.

R, by contrast, features a rich variety of packages available for parallel computing. (Again, I’ll discuss Rth in my next post.) True, there is also some of that for Python, e.g. interfaces of Python to MPI. But I believe it is fair to say that for parallel computing, R beats Python and Julia.

Finally, in our Bay Area R meeting last week, one speaker made the audacious statement, “R is not a full programming language.” Says who?! As I mentioned earlier, I’m a longtime Python fan, but these days I even do all my non-stat coding in R, for apps that I would have used Python for in the past. For example, over the years I had developed a number of Python scripts to automate the administration of the classes I teach. But last year, when I wanted to make some modifications to them, I decided to rewrite them in R from scratch, so as to make future modifications easier for me.

Every language has its stellar points. I’m told, for example, that for those who do a lot of text processing, Python’s regular expression facilities are more extensive than R’s. The use of one of the R-Python bridge packages may be useful here, and indeed interlanguage connections may be come more common as time goes on. But in my view, it’s very unlikely that Python or Julia will become more popular than R among data scientists.

So, take THAT, Python and Julia! 🙂


120 thoughts on “R beats Python! R beats Julia! Anyone else wanna challenge R?”

  1. In the sentence “there is the point that statistics simply is not Python’s or R’s central mission” did you mean to write Julia instead of R?

  2. R is a fantastic prototyping and statistical tool, but it has serious deficiencies as a full programming language. It is extremely slow when it comes to data manipulation (dplyr etc notwithstanding but these now require me to learn a new package), and while vectorizing and parallelization is possible for the “pure” alogorithmic side of the coding, as soon as we reach the realm of “actual” programming tasks, inevitably interfacing to the real world (read “other people”), R becomes unwieldy. You could never write a proper web framework in R, for example (just look at the hoops Shiny has to jump through to get anything except toy examples up and running – and usually having to resort to Javascript anyway), because it does not have robust code-architecture tools, not least of which is multi-threading. Python multi-threading may be useless for code speedup, but it is extremely useful for any code that is IO bound, while R just sits there and waits. R’s data attributes are also very useful in many contexts, but throw up horrendously obtuse bugs when these attributes don’t play nice with some package or other, requiring all sorts of ugly casting funny business.

    I have been using R for many years now for very advanced and very rewarding analyses, but as soon as one reaches the “success” point, in many cases wanting to share some wonderful project with the wider world, suddenly you’re needing to re-code the whole thing in a different language (in my case, Python), which is a pity.

    1. First, let me quote my former department chair: “If you have objections, then you’re right!” 🙂

      I completely agree that threading is good for I/O, and that Python handles threading well; see my Python tutorial. But my R package Rdsm also enables threaded programming for R; I urge you to try it. By the way, I have a threaded I/O example using Rdsm in my R programming book (using the old interface, replaced last year by a cleaner one).

      Regarding data manipulation: The core issue is whether writing to a data object requires reallocating memory for the object. This problem stems from R’s functional programming philosophy, but it is definitely being addressed. R in general is getting better at avoiding the reallocation (esp. R 3.1). Concerning data frames in particular, it is my understanding that data.frame avoids the reallocation while dplyr does not. You may wish to give data.frame a try, and even better, bigmemory. Indeed, I was told by one of the authors of bigmemory that it is this aspect that users of bigmemory find most attractive.

      1. you mean data.table, a main idea of the package is to provide data.frame like objects that are fully call-by-reference instead of call-by-value (which all of R’s standard data types are to the outside)

      2. Norman thank you for these pieces of pertinent advice. I am not running 3.1 yet and I will try. Moreover, I have, interestingly enough, found that Windows 8.1 is much better than 7 when it comes to the memory performance of R. I don’t yet know why that is. I will investigate Rdsm and bigmemory. If these packages help me to “stay inside R” that will be massively worthwhile as let’s be fair, R beats everything else when it comes to actually getting analysis done. What a pleasure if the world of analysis, and the world of deployment, were able to converge.

    2. Author of Shiny here. It would be very easy to write a proper web framework in R, we just didn’t believe that’s what the R community wants–at least not more than what we built. And I’d love to hear what kind of nontrivial web apps you can write in Python without slinging HTML and JavaScript.

      I believe that R as a language is far more expressive than Python–its metaprogramming facilities are hugely underappreciated even by R fans. It does however have more warts, I don’t think that’s controversial.

      1. We all owe you a great debt, Joe, for Shiny. Indeed, I plan to deploy parts of my code with your services once they have bedded down. What I find problematic, and unfortunately a large part of “real world” data manipulation has this aspect, is the poor performance of R’s memory allocation framework. As Norman alludes to above, growing arrays is painfully slow and inefficient in R, whereas it is simple, and much faster, in Python. This is something that any piece of serious online software is going to need if it is dealing with new incoming data in semi-real time, be that financial data (in my case), social data, or any form of new real time data which the new world of “the internet of things” presages. R is just not ready for this new world. A simple example: just try to do a simple, fast interface to mongodb using R. It’s painfully (unuseably) slow, mainly because of the need to reallocate memory. Yes I could jump through hoops (and I do) to make it faster, but Python still beats it hands down, and usually by an order of magnitude, while making it *much* simpler to do. Part of the reason for that of course, is that packages that do this sort of thing (in this case the Pymongo library) are written in a much more serious way than packages that do the same in R. Another example: tailable cursors (crucial for event-driven style programming when new data comes in to the databse). Works in Python, doesn’t in R. Of course this is a community issue, it could in theory be done. But it doesn’t, and part of the reason for that is that there is no audience in R for this kind of thing, and the reason there is no audience in R is that its design makes it difficult and so the audience is never attracted.

        R is amazing for static data sets and I am not abandoning it any time soon. But to me its underlying “philosophy” is superb statistical analysis of static data sets. It’s made the design choices that make it (or anything designed on it) difficult to scale into a big data, real time data world, especially for production.

        My criticism of R, by the way, is based on frustration, because I have tried *very hard* to make my real time financial data analysis software, which must juggle very large arrays around and update both data, and complex analyses, in semi-real time, work completely in R. It’s just too many “gotchas” that I keep running into. I wish I were wrong.

      2. btw when I say “bedded down” below I mean when *my* software has bedded down. I am using Shiny already internally for prototyping, and plan to use your cloud services for deployment externally.

    3. — it has serious deficiencies as a full programming language.

      That’s because it isn’t. Beginning from S, the language has tried to be both stat command pack syntax and some sort of programmable engine. As the saying goes, “No one can serve two masters. Either you will hate the one and love the other, or you will be devoted to the one and despise the other. You cannot serve both God and money.” I’ll leave it to the reader to analogize which of R’s purposes is God and which is money.

      For 99.44% of R users, command execution is sufficient. Those that want to use a “real” programming language should just write C, and an R wrapper to their C code. It’s not clear to me that the programmer tails should be wagging the command dog.

      1. Interesting take on R and S as ‘languages’ because that’s the biggest objections users of SAS, SPSS and Minitab had to R before there were GUIs like R Commander – “I have to learn a programming language to use R! No way!” This is despite the fact that SAS, SPSS and Minitab evolved from BIOMED, a FORTRAN library that had to be called from FORTRAN programs.


        1. The GUIs on these statistical packages were only created when they were ported to Windows from mainframes and minicomputers.
        2. They do have scripting languages. They look a lot like a macro assembler or a pre-Unix-shell JCL, not a modern language like Python or Ruby.

        R was originally a dialect of the S *programming language*. Said language was fashioned from what the creators thought was the best parts of FORTRAN, Lisp and APL.

        Yes, R has deficiencies as a programming language. Its ancestry is imperative and functional and its original object system is based on standard statistical processes done interactively at a REPL, not object-oriented programming as in Smalltalk, Python or Ruby. But it *is* a full programming language, and is equally adaptable to the imperative and functional programming paradigms.

  3. Nice post Matloff! I have similar thoughts on this!

    Now, just to spice things up, what do you think of this problem?


    It was the first time I used Julia, and the speed really impressed me! Nevertheless, I ended up not using Julia because, even though the bayesian simulation is faster, everything else takes much longer (coding, debugging etc). My final solution was to parallelize in 4 core / 8 threads machine, so that the R code takes a manageble time for various different simulations.

    But, do you think there would be a way to make the R code faster, like in your random walk example? Or is this Julia performance really hard to beat?

    1. A while loop is difficult to vectorize, and for that matter, to parallelize. You could try splitting the loop into chunks, and vectorize/parallelize each one, but it would be unwieldy and not as good as C/C++. I’d recommend C/C++, together with Rcpp. The latter is really worth learning anyway.

      Someone mentioned that R now offers byte compiling. Good point, but it doesn’t always result in a speedup. But it’s certainly quick and easy to do, so give it a try.

      1. Well, it turns out that Rcpp is easier then I thought! With a some (bad) coding I was able to reduce the time to 40 seconds without parallelization.

    2. Well, for openers: anything involving lots of random number generation should be refactored so you can do the random number generation with full vectorization / paralellization. The recipes are in the O’Reilly book “Parallel R”.

  4. “Python’s regular expression facilities are more extensive than R’s”

    Not really – as far back as I can remember, R has supported the Perl-compatible regular expressions (PCRE) library.

    One other pro-R note: for loops which don’t lend themselves to vectorization or parallelization, R now has byte code compilation.

  5. I’m very sick of seeing pythonistas always going with perl bashing (I mean it even happens in the book Programming Python, not a good look). They’re equally capable languages with similar philosophies. Python’s less flexible syntax makes it more suited for some tasks, and may ameliorate the lone-coder problem that perl can create with respect to uncommunitive egomaniac programmers. However in terms of language features they’re basically equal and have very similar fundamental designs. Phython’s strengths are greater in some places and perl’s in others. So this is just a shout out to the python community to please give the perl bashing a rest.

    R’s major problem is it’s designed by statisticians. When I come across data mangling code in R I generally shudder.

  6. I make extensive use of both Python and R so I feel qualified to make comment. I think you’re right in that Python won’t replace R in the field of statistics, as it’s like you say, “R is written by statisticians, for statisticians.” My guess is you’re a statistician? Traditionally the numpy/Python community is closer to the userbase of Matlab. It has wide use among engineers, financial modelers, physicists, geoscientists and climate modelers and this is reflected in maturity of the Python toolkits, such as scipy, scikit-learn and pandas.

    Similarly, the flavour and power of the languages reflects their background. R is a statistics package that has general purpose applications language functionality bolted on after the fact, while Python is the reverse.

    Python walks all over R in app development, particularly for one-stop shop software, such as a combined analysis engine and web server with a database backend. Also, Python is *much* better at text manipulation. I think your statement around Pythonistas who write stats libraries not being statistically literate is a little misinformed and somewhat prejudiced. The Python stats libraries (such as scikit-statsmodels) are generally written by practicing analysts/scientists who are very literate in the numerical methods. The implemented stats methods in Python certainly reflects the community that built them (many physicists, financial modelers and computer scientists). I’ve read the source code of a number of academia R stats packages and it can be pretty damn horrible. The Python source code is generally much tidier.

    I agree that multiprocessing in R is easier, the functional design lends itself to that. However, it’s less flexible and very geared around lists and the apply statement (e.g. mclapply). Python multiprocessing is the one part of the language that feels like a hack (to get around the global interpreter lock) and has caveats if you’re using OO code. However, after the learning curve, it is much finer grained in multiprocessing control and is great for developing *applications*.

    With reference to speedups, while R has Rcpp, Python has great packages such as Cython which integrate C directly into the code in a beautiful, clean, Pythonic way. This reduces development time and makes code maintenance easier. Python also has far better profiling and debugging tools, which is a reflection of the large computer scientist community around the language.

    My typical use cases for each language are:
    R: Stats, most bioinformatics, quick exploration/hack of some data.
    Python: text processing, app for other users, functionality to be reused where OO programming makes sense, machine learning.

    Most of the time I use R as generally I have some untidy data that I wish to quickly analyse and out-of-the-box, it’s faster to write R code. Numpy is great for matrix-style data, but it’s not often I handle this – more data.frame, untidy, mixed type data. However, Python is closing the gap with pandas, which is like an R data.frame on steroids.

    1. You ask if I’m a statistician. Yes, but I’m actually a computer science professor, and am a former developer in Silicon Valley (long ago). And yes, I’m aware of Cython, pypy etc., and have done my share of interfacing C/C++ code to Python. I’m actually more acquainted with the innards of Python than R. As I said, I’m a big Python fan.

      I stand by what I said about Pythonistas not having the background to write good statistical code. I have enormous respect for the intellectual capacity of physicists, but I’ve met very few who understand statistics–though I’ve met many who think they do. 🙂 Ditto for computer scientists [except for the intellectual part 🙂 ].

      1. Fair enough, but I would argue that people who don’t fully understand a method generally don’t distribute software for that method. The flavour of the mature Python libraries suggests that Pythonistas are far more comfortable with Bayesian stats and machine learning than frequentist stats. Let us hope that statistically not-so-literate Pythonistas implementing frequentist methods write good test cases, which is fortunately a common Python habit.

      2. I’m an avowed frequentist 🙂 so maybe I’m not the best to comment on how R does on Bayesian methods. However, I know many sophisticated, hard-core Bayesians who happily use R for that purpose. Have you seen the CRAN Task View on Bayesian Methods, or for that matter, the books on Bayesian analysis in R?

        I have to apologize in advance, because you’re not going to like what I say about machine learning. 😦 I regard ML as a generally unprincipled revival of the old statistical topic of nonparametric curve estimation. It’s not that I don’t like that old topic–quite the contrary–but my choice of the word unprincipled here was deliberate (albeit harsh). There is SO much misinformation floating out there on this material, and an appalling lack of understanding of what these methods are really trying to accomplish. Again, I’m sorry to say so, but my earlier wisecrack on computer scientists was not entirely in jest. Of course, no implications meant for anyone in this thread, just a general observation, but as some readers here may be aware, there is a feeling among many in the statistics community, including me, that CS has somehow usurped the classification methodology area from statistics, and done it poorly at that.

      1. I’ve only looked somewhat at pandas and the related statsmodels, and all I can do is say that they look very thin to me.

        Consider linear regression, for openers. The output of R’s lm() is richer, and there are so many alternatives, in which one can do cross-classification, robust regression, quantile regression, the LASSO and so on.

        Ironically, R seems to beat those Python packages even at what they ought to excel at–class structure. The R class structure, even S3, allows the pieces to be put together in a seamless, uniform manner. I can call plot() and it will do something reasonable for the context, e.g. draw a scatter diagram if the arguments are two vectors, draw a series of residual plots in the argument is an lm() object, and so on. These then feed well into the graphics packages.

        I may well be selling those Python data packages short, and admit to ignorance. But as a statistician, I want more than what they seem to offer.

      2. Pandas is great for data munging, but the indexing syntax is confusing at first as it differs substantially to R data.frames. The pandas data structures aren’t as neatly wrapped up at R data.frames either, they’re more like an R matrix() with lots of data.frame attributes – this is because underneath the pandas DataFrames are actually numpy arrays. However, for *really* big data this numpy design is great; pass-by-reference is the default behaviour and numeric columns can be typed as singles, doubles, quads, etc to save on RAM.

    2. +1. Couldn’t agree more. Hack around in R, do my exporatory data analysis in R, find all the cutting edge analysis in R, indeed develop my understanding of the problem at hand in R. But when I need to actually get it out there, nothing can hold a candle to Python.

  7. And what about graphics? R certainly has great graphics utilities, and I believe that was a selling point of R from the beginning. What are the graphics utilities like in julia and python?

    1. The graphics in R leave everything else (save Matlab) in the dust. In fairness. Anybody who has a true understanding of publishing quality graphics owes a massive debt, yes to Hadley, but mainly to the teriffic quality (and speed) of base graphics in R. Just look at the default label spacing between axis annotation and the axis itself in Matplotlib: wrong. Too close. Untidy. Then do a simple plot in base graphics. Feel the quality. See the huge customizability, but crucially, you don’t acctually need to customize because the defaults are good! This is where I agree with all the “R is designed by statisticians” posters. You can just see that communication of data is a very important part of the field, and that is made obvious in R.

  8. As a statistician I’ve been using R for years and when I started, it was a great improvement over SPSS and SAS. However, over the years I have had a sort of a love-hate relationship with it. The main problem with R is that it has been developed for short scripting. When you start to have tens of complex data sources, models and simulations and you need glue them all together, you really start missing some higher level languages such as Java. When there is thousands or tens of thousands of R code, things just get really messy. In this sense, I don’t see other scripting languages such as Python much of an improvement. Also, when you’re doing something more complicated than simple regression, Python quickly runs out of libraries. (Ok, you can call R or its C libraries from Python or from other languages.)

    Now that there are many great libraries and features such as data.table, plyr, parallel, reference objects etc. (which were not available back in the days) to make things more efficient, some of the core features in R still suck. For example, why can’t R show the line number always when there is an error or warning and why can’t you log backtrace properly (or at least I haven’t found a way)? Other modern interpreted languages don’t struggle with this.

    There are also other annoyances such as that R might convert the type of your object without asking you and then you find this when you program fails in some rare case during some long run simulation. Or there might be a variable in the global environment that you forgot about and which should not be there, but you still use it within your function and after a fresh start your program fails. Debugging someone else’s package can be also a big pain.

    1. Yep. While I love R, I also hate it a little for its drawbacks, (compared to the other languages I use). I have similar gripes with R: gotchas from implicit (silent) type conversion, poor exception handling/obtuse error messages and bad scope design.

  9. Here are a lot of wrong assumptions. The comparison is supposed between R and Python and Julia but actually one compare here the libraries from R with libraries from Python!!! So, one can say that library X from R is better than library Y from Python but nothing more.

    If you are statistician you will not like this. From page 2 of the book “Optimal Estimation of Parameters” by Jorma Rissanen (see: “http://www.amazon.com/Optimal-Estimation-Parameters-Jorma-Rissanen/dp/1107004748 ) is the following quote

    “Very few statisticians have been studying information theory, the result of which, I think, is the disarray of the present discipline of statistics.”

    This state of disarray in statistics is reflected also in R language.

  10. I’m surprised that this article attempts to list parallel computations as one of R’s strong points. In reality, R’s parallel computation support is rotten.

    In fact, R does not support threads – at all. Instead, all it offers is brittle support for process forks. True, this is hidden behind a nice interface but it has an atrocious runtime overhead (`fork` essentially copies the whole working set, which is simply not practical for large data^1, and has a much bigger performance overhead than threads or other parallelisation mechanisms). It’s also, as I’ve said, brittle: I’ve had to de-parallelise several pieces of code because they would randomly segfault. This is **not** a rare occurrence at all – it happens frequently enough to make use of parallel processing in R unusable in practice. Needless to say my parallel computation was straightforward, pure R code, nothing in there could conceivably have caused a segfault on its own.

    Once your code segfaults it doesn’t really matter any more how nice the API is.

    ^1 I’m aware that Unix forks implement copy-on-write (like R) to avoid redundantly copying. However, all of this is moot once you start writing the results to a data frame or vector. – It’s very easy to run into memory issues, and I suspect that they (partially) explain the problems described above.

    1. You seem to be alluding to the section of R’s parallel package that comes from the older multicore package. I’ve never had those things occur on me (and your footnote 1 should have been in your text, as it is a major point), but I’ve not used it enough to say much.

      Overhead is indeed always an issue to consider. As I mentioned in response to someone else, you may wish to consider my Rdsm package, which does give a threads (-like) environment to the R programmer, with true shared memory, most importantly including the ability to WRITE to shared memory. There is overhead involved, but only as a one-time expense at startup.

      For non-embarrassingly-parallel problems, I believe that none of the languages we’ve been discussing–R, Python and Julia–can be made to work truly well. Interface to C/C++ is essential.

  11. Great post. I use both R and python, and like many others said, each has it’s own strength. That said, I found writing a big S4 package really… icky. It felt so redundant, verbose, and picky. Overall, not fun. I haven’t tried the R5 classes yet. Maybe they’re better?

    For ease of installation and selection, you can’t beat CRAN and bioconductor! I’ve found installation of python packages can be pretty tricky, if you end up in dependency hell. Doesn’t happen in R land.

    1. I fully agree. I don’t like S4, and don’t write S4 code (though I use that written by others). I view “S5” (reference classes) as a better S4, but see my previous sentence regarding S4. Note that in my original posting I wrote that (a) Python’s class structure is much cleaner than R’s, and (b) I am NOT one of those who consider OOP a panacea/necessity.

    1. You may not be fully up to date on R memory issues. Recent versions of R have been transitioning to huge address-space capability.

      Moreover, you can use the CRAN package I’ve cited a couple of times in this thread, big memory. The latter even allows referencing objects that are stored in disk files instead of in memory. To my knowledge (which may be limited), there is nothing like that available for Python.

      So, memory issues, which had long been the Achilles Heel of R, have largely been resolved. I say “largely” because one can encounter memory problems in any language. I’ve had memory issues with large-scale simulations in Python, for instance, and even in C.

      1. To manipulate data on disk, using Python:

        >>> fileno = os.open(path, os.O_RDWR) # or os.RDONLY
        >>> mm = mmap.mmap(fileno, length_bytes, access=mmap.ACCESS_WRITE) # or ACCESS_READ
        >>> arr = numpy.frombuffer(mm, dtype, count, offset_bytes)
        >>> … # do something with arr[0:count]
        >>> del arr, mm
        >>> os.close(self.fileno)

      2. R is written in C. Python is written in C. They both have access to OS calls. So, taken literally, anything one can do, the other can do too.

        But the issue is whether someone already HAS done it, i.e. has set up a turnkey package, ready to use. This then goes back to the libraries issue that I discussed in my original post.

  12. far and away, for me, the worst aspect of R as a programming language is not speed. after my program works, it is great.

    but, for new students, it is brutal. it is hard to figure out where something in a long function has not worked. by default, the error messages are both too uninformative and too late. it’s also pretty poor on orthogonality and “you get what you expect.”

    can I really inflict this on students taking only one course?

    1. Error messages are awful for most languages, including R, in my opinion, so we agree there.

      Concerning “new students writing long functions,” I’d like to hear more. If they are new to programming, why are they writing long functions? And in any case, why aren’t they taught to break up long functions? Do they use a debugging tool, even debug()?

  13. You say that R can be faster than Julia. It´s true, you can optimize in every language, reestructure the code, interface with Fortran/C/C++. But what I am feeling is that if you write a naive/fast/quick and dirty implementation in Julia it will be faster than Python or R in most cases.

      1. very nicely put 🙂

        I read a report comparing data mining algorithms; several people wrote the same 2 or 3 algorithms I think all in C. Point was to see which algorithm was faster — turned out, the *programmer* had much more influence on speed than the algorithms.

    1. Sounds right. That said, Julia seems to be mainly targeted at the Matlab guys, and I have a horrible feeling that its application to the world of statistics will suffer as a result. Moreover the startup time is a lot slower than Python, which is a bit of an issue if you’re popping in and out of vi,, all the time (okay yes yes you can reload the code but still….), and to be fair, the REPL seems sluggish. My concern with Julia is that it’s just an evolution of some combo of Python/R/Matlab, doesn’t really bring anything new to the table, other than speed which we’ve all got ways of managing in the other 3 already. It doesn’t bring a killer feature to the table. Go’s concurrency model, for example, really gets me excited about that language, despite all its simplicity and non-existant libraries, while one of the purer functionals (Clojure/Haskell) could well “stretch” my mind a lot more than Julia will. And to invest time and effort into a new language, there has to be something exciting about it, more than just “this will run faster”.

      1. When was the last time you started up Julia and evaluated the “sluggishness” of the REPL? That has been improved dramatically over the last 6 months. It’s also a little unfair to compare to Python’s startup time, since Python by itself is not that useful for doing any serious math – how many modules do you need to import and how long does that take before you’ve loaded everything you need?

        Your “ways of managing [speed] in” Python/R/Matlab means writing C code. This puts up pretty significant barriers to entry for users who only know or prefer to work in a high-level language to understand or contribute to development of your performance-critical library code. Isn’t constantly having to switch languages a pain, wouldn’t you rather be able to get performance without having to reorient your brain into this odd place of working in two separate languages, all while keeping in mind the strange peculiarities that arise at the interface between them?

        Aside from solving the “2 language problem” when it comes to performance, Julia brings together a lot of neat features that don’t work all that well in Python/R/Matlab – or at least they’re not common or easy to use in a first-class idiomatic way. Metaprogramming, multiple dispatch, and parallelism are immediately accessible, no hurdles to jump through. User-created types are first class, simple to create, and perform just as well as the built-in types. Defining new mathematical objects and their behaviors is natural, concise, easy, and high performance thanks to multiple dispatch and the clever conversion and promotion system. Wrapping C and Fortran, when there’s a library you’d rather not rewrite, requires zero boilerplate (Python and R already do pretty well here, Julia manages to do even better). Scripting and managing external processes is as clean as in Bash. The package manager is distinctly 21st-century, built around Git, and works great on all platforms. Better tools for managing binary dependencies than I’ve seen anywhere else (not all the packages are using them yet, but we’ll fix that). Trust me, the syntactic resemblance to Matlab is superficial, the similarities end there.

        If that’s not compelling to you, you’re more conservative and reluctant to change than I am. That’s fine, but most people who’ve tried working in Julia have found it to be a very simple switch from whichever decades-old high level language they previously preferred. R is still a singularly good platform for doing statistics in, but not much else. Given a few more years for Julia’s library ecosystem to mature and catch up, what can R do that Julia can’t, that would cause “its application to the world of statistics [to] suffer?”

        In my biased opinion, the killer app where Julia already outclasses its competitors with truly unique functionality and performance is in the area of operations research. Constrained optimization, linear/nonlinear/integer programming works beautifully thanks to the packages being developed in the JuliaOpt organization, particularly JuMP. That community has several dedicated single-purpose tools like AMPL, that are even worse than R as a general-purpose language, so my prediction is that field will be the first in which Julia transitions to being the dominant best single choice. Statistics may take longer for those of you who are happy to be patient with R’s runtime performance, or constantly switching to and from C.

  14. I’m not sure the comparison here quite captures the difference between Python/R and Julia. Julia was designed to be a LLVM-compilable dynamic language. Python and R were not: they were designed to be interpreted, with attempts at acceleration added on afterwards. So regardless of how Python and R perform now (mostly by deferring the bulk of the computation to C/C++), the question will be: supposing you have some new data-intensive analysis code to write. Will you write it in C/C++ so you can call it from Python or R, or will you write it in Julia and get adequate performance natively? (Or would you be better off writing it in a strongly-typed higher-level language where compilation is more effective, like Haskell or Scala?)

    If I can write even fiddly algorithmic code that runs fast in language Y, and Y also makes it reasonably easy to do fast prototyping and other things, then I’d be more inclined to write new routines in Y instead of a language where I have to write in C/C++. Eventually one would expect Y to catch up even with Python’s numerics and R’s statistics, if Y had a sufficiently compelling profile. Whether Julia is that Y-language remains to be seen, but I think it is a mistake to compare R/C++ and Python/C++ to Julia without fully recognizing the C++ part of the previous two.

  15. Awesome post! Thank you so much for sharing..

    For those who want to learn R Programming, here is a great new course on youtube for beginners and Data Science aspirants. The content is great and the videos are short and crisp. New ones are getting added, so I suggest to subscribe.

  16. For this kind of comparison, it would be helpful to include the Julia code you’re comparing to. Who knows if it was well written or not? Also note that it’s not like you *can’t* write vectorized code in Julia – you just don’t have to. In order to have some code to compare and get some hard numbers, I wrote the simplest vectorized and iterative versions of this binary random walk code in Julia that I could and compared them to R on my system:

    rw_vec(n) = cumsum(2randbool(n) .- 1)

    function rw_itr(n)
    a = Array(Int,n)
    s = 0
    for i = 1:n
    s += ifelse(randbool(), -1, 1)
    a[i] = s
    return a

    In R 2.14.2, rw(1000000) takes a minimum of 0.02 seconds – sometimes it’s 2x or 3x that, probably because GC kicks in. The vectorized Julia version, rw_vec(1000000) takes 0.01 seconds minimum – twice as fast as R – but often it’s 2.5x slower than that, also because of GC. The iterative Julia version, rw_itr(1000000) takes 0.005 seconds. That’s twice as fast as the vectorized Julia version and 4x faster than R. It also allocates much less memory than either one – just the output array. For 100000000 steps, R takes 2.717 seconds, the vectorized Julia version takes 1.52 seconds, and the iterative Julia version takes 0.6 seconds.

    1. Interesting numbers, though as you say, they’re all over the map.

      I don’t have the Julia version for the test I reported, as I simply gave my R version to the speaker whom I mentioned, and she ran comparisons between it and her Julia code. It would be interesting to see how your numbers change with a much newer version of R.

      R itself may be a moving target in terms of execution speed, due to projects aimed at refactoring it, e.g. the R JVM project.

      1. I just installed R 3.1.0 and timed your rw code again: rw(1000000) takes 0.05 seconds on my system – that’s a 2.5x slowdown relative to R 2.14.2. This leaves the most recent R version 5x slower than vectorized Julia and 10x slower than a straightforward iterative Julia version. R’s performance does seem to be a moving target, but in this particular instance at least, it’s moving in the wrong direction.

        It will be interesting to see what comes of projects like R JVM. Jan Vitek (one of the authors of the paper you linked to) graciously invited Jeff Bezanson and I to speak about Julia at SPLASH last year, and Jeff attended the associated DALI workshop “on dynamic languages for scalable data analytics” [1]. There was a lot of interest in various efforts to try to make R faster. However, it seems to me that Python has already been down the path that R is starting on, and despite high expectations, most of the hopes for broad performance improvements in Python have not panned out. I don’t, unfortunately, think that this is an accident – these attempts have all faced, whether they knew it or not, what I’ll call “the PyPy problem”. The PyPy problem is a catch-22 that faces projects like PyPy in language ecosystems that include a lot of libraries implemented in C. To make the language drastically faster, you have to change the internals significantly, but since the internals are what many libraries interface with, they are effectively part of the language and cannot be changed very much without breaking those libraries. In other words, because all the performance critical code in Python has to be written in C, you can’t change the C API without breaking all the performance-critical libraries. But without changing the C API, you can’t change the internals very much, so you can’t make the language go much faster. This is why attempts to improve the speed of Python tend to fall into two categories:

        1. Projects like PyPy that can get impressive, broad performance gains, but are of limited pracical utility because they force you to jettison libraries you need – not all libraries, of course, but specifically the high-performance ones that had to be written in C. If you needed Python to be faster, it’s problematic to lose all the high-performance libraries.

        2. Projects that allow very narrowly targeted performance improvements to specific regions of code without affecting the way other code works. Examples include Cython, Theano, NumExpr, and Numba. These systems require special annotations and usually only support a few native types. They often change both the syntax and semantics of your code, not always in well-defined ways.

        What you don’t see are dramatic, across-the-board performance improvements that transform the ecosystem like what happened to JavaScript when Google introduced the V8 Engine in Chrome. What made JavaScript different from Python and R was a complete lack of native library lock-in: it wasn’t *possible* to write libraries or extensions in anything besides JavaScript, so the V8 team was free do any crazy thing they wanted to with the language implementation, so long as it still behaved like JavaScript. They certainly did some crazy, brilliant things, and JavaScript has never been the same since.

        R and Python are not like this: writing libraries and extensions in C is not only possible, it’s standard – especially when performance matters. PyPy manages to make Python itself go faster, but it remains incompatible with NumPy let alone the rest of SciPy, despite years of effort [2]. What progress has been made towards compatibility seems to be entirely piecemeal, accomplished by porting individual NumPy functions to PyPy one at a time [3]. Clearly, that approach won’t scale to the rest of the Python ecosystem. In terms of performance enhancements, R currently seems to be where Python was in this process about ten years ago, but possibly with even more native library lock-in. Last year, recognizing “the PyPy problem”, Alex Rubinsteyn started a project exploring how much faster you can make Python without breaking C API compatibility [4]. The disappointing conclusion he came to is that you “might get up to 3X faster [but] more often, the gains hover around a meager 20%” [5]. Since R’s interpreter is not typically as fast as Python’s, it may be possible to gain more than that, but my suspicion is that neither R nor Python are never going to get close to C unless they shed all their native libraries. Of course, if you’re willing to shed all those libraries for better performance, one has to wonder, wouldn’t it be better to gain more than just speed and start afresh with a new language?

        [1] http://splashcon.org/2013/program/workshops/1105-nsf-dali-workshop-on-dynamic-languages-for-scalable-data-analytics

        [2] http://morepypy.blogspot.com/2014/04/numpy-on-pypy-status-update.html

        [3] http://buildbot.pypy.org/numpy-status/latest.html

        [4] https://github.com/rjpower/falcon

        [5] http://www.phi-node.com/2013/06/how-fast-can-we-make-interpreted-python.html

      2. Thanks for the interesting comments, which make sense. What seems NOT to make sense is a 5X slowdown for R-3.1 from R 2.14. This sounds like something worth tracking down!

        By the way, have you tried byte-compiling? I would guess it wouldn’t matter, since the code is vectorized, but just for fun it would be interesting to try.

        In case my original posting wasn’t clear, I certainly would stipulate that Julia is faster on some apps. I merely wanted to point out that the implied claim of universal speed dominance is false. I noted that (a) speed for data manipulation in R is now good with data.table and dplyr, and (b) for real speed one needs parallel computation, which at present R does better than the other two.

        I also noted the availability of libraries, which for most users is more important than speed.

      3. @Stefan

        So instead of inventing a new C API for Python (which would be implementable by everyone), you’re inventing a new language where you have to reimplement everything anyway ?

      4. @Foo the C API is far from the only problem with Python. There are certain language design issues that limit performance of any Python implementation, no matter what the C API ends up being. (See https://groups.google.com/forum/#!topic/julia-dev/B8AkQVMW6B0 for some recent discussion on this.) Given that changing the C API means the entire package ecosystem would need to be updated, you may as well fix the language design issues that get in the way of performance while you’re at it. At that point what you have isn’t really Python anymore.

        And if you like your existing language, you can keep it. Python and R aren’t going anywhere yet, Julia’s library support and user base is still young and nowhere near as comprehensive. Members of the Julia community are well aware of this, and have created packages specifically in order to simplify the interoperability of Julia with Python and R to leverage the huge existing codebases.

        There’s no need to be closed-minded as if language choice is an either-or proposition. Sometimes the best tool is a combination of multiple pieces working together in new ways (all of these languages and numerical libraries are implemented in a combination of C and Fortran, after all). One thing I’ve found wonderful about Julia is its pace of development and how low the barrier is between using the language and contributing to it. How often do you find yourself identifying *and* writing the patch that fixes a bug or adds a new feature in R or Python/NumPy/SciPy? Or when some R or Python package doesn’t build properly on Mac or Windows or whatever platform you’re using, is it easy to fix the problem, or do you decide to make do without that package and spend a bunch of time rolling your own solution?

  17. A clarification related to the authors of statistics in Python:

    Statsmodels was initially written by a statistician (Jonathan Taylor at Stanford). Until recently most large contributors and the maintainers came from an economics/econometrics background. An academic statistician recently contributed Generalized Estimating Equations and Linear Mixed Effects models. Other contributors do come from various science areas.

    I didn’t understand GEE until I figured out the connection to generalized method of moments and found a statistics analog in Quadratic Inference Functions. It took me a while to understand multiple testing problems, but I managed. https://twitter.com/ciamk/status/459091889393008641

    However, in my opinion one of the main points is that we use standard software development practices and have almost all our results verified against R or Stata.

    I usually stay out of language debates. If you need R, use R.
    statsmodels is not written **for** statisticians (at least not mainly). We are still missing a very large number of statistical models and tools. But between scikit-learn and statsmodels we are getting to a stage where many or most users can find most or all they need.
    (statsmodels is only 5 years old, and compared to when I started, I’m actually pretty happy that there is even the discussion of Python versus R. I thought it would still take a few more years to get into the “fight”.)

    I’m coming from a Matlab and Gauss background and still don’t know R very well. The main thing I was missing when I had to go back to programming in Matlab after working for a while with Python, was namespaces, large class hierarchies and that everything in Python is a reference.

    Statsmodels has about 100,000 lines of code (including unit tests and sandbox) and an elaborate class hierarchy. I wouldn’t want to do this without a full OO programming language.

    (I’m one of the two maintainers of statsmodels and a code reviewer for scipy.stats.)

    1. Well, someone had to do it, Carl. 🙂

      Yes, it’s my understanding (?) that some of the approaches in pqR is being considered for folding into R.

  18. The only point is that the title is misleading. R could be better than Python for statistic. Then Python is much much more than statistic, and Data Science is much more then statistic. So why R should be better than Python for Data Science?!

  19. I weigh more on the statistics and engineering background rather than programming. I find R very simple to use and learn but have a tough time with Python. I just cannot digest the Python’s way of programming. However, I encounter lot of bugs in R and less help online when dealing with APIs, parsing, JSON, XML, web scraping and other non-statistical tasks.

    Will R ever match-up with Python in these areas?

  20. Matloff wrote: “The actual question posed was whether Python or Julia would replace R in terms of popularity in data science. The answer I asserted is no.”

    I beg to differ for the following reasons. And, I’m just comparing Python and R here, from a ‘systems’ perspective.

    (1) Firstly, we are comparing library functions provided by R and Python here, and not the actual languages themselves. (In actual language comparison Python is clearly the winner as it is a ‘proper’ language. But we digress.) R does have a large number of library support functions provided by users for statistical inference. Python is trying to catch up. While R does have thousands of those support functions, realistically many of them are just variants of each other, or different implementations of the same provided by different people. The ‘core set’ of functions that are of immediate use in predictive modeling is much smaller for all practical purposes. Once, Python has that minimal support, R shall become less attractive, IMHO.

    (2) ‘Data science’ is no longer a statistician working on a small set of data in an isolated environment. Talking to databases, message queues, storage systems, web servers, application servers, distributed code execution, etc., are part of the overall platform of a modern commercial entity. Python has much better support for these tasks compared to R.

    (3) I can’t overemphasize the importance of object oriented design in a large software environment and I tend to believe that Python beats R here also.

    (4) As mentioned above, judged purely on the merits of a language (and not library functions) Python is better. Once that GIL issue is resolved, Python would be a pretty solid language as far as data structures, OO support, multi-threading, compilation, dynamic binding, etc. are concerned.

    (5) Exception handling, lack of a proper debugger support also make R less attractive than Python. R doesn’t even have a proper IDE. RStudio is not impressive.

    (6) I’m not fully sure to the extent that one can compile R code and hand that off to somebody, as opposed to source code. However, that is not a problem with Python. And, for protecting commercial interests and IP that is useful.

    The way I see it is that R is a competitor to SAS, but not to Python. I don’t see any reason for not adopting Python as a sole language of choice for data science work if steady progress continues to be made in building Python’s statistical library support.

    1. In my original posting, I highlighted the R libraries, which I believe can’t be ignored. Yes, there is some semi-duplication (though not as much as you seem to imply), but as a “consumer” I revel in the myriad of choices CRAN gives me. When I needed a nearest-neighbor routine, for instance, I had several to choose from, and selected one that really fit what I wanted to do. For many people, C++ = STL, and for many R users, R = CRAN.

      Concerning the languages themselves, I stated from the outset that Python is the cleaner, more elegant language. However, I could write my own Annoyances book on Python, e.g. the way variables in modules are handled.

      And unfortunately, I disagree with you regarding OO. I’ve never bought into that OO craze. I use it myself, but IMHO it often does more harm than good, making code harder to write, and harder to maintain. What’s left? I do like Python’s OO structure, as I said, and I use it to some degree myself when I write Python code, but not obsessively.

      Sorry, but I’m not an IDE fan either. For you, though, if you find RStudio to lack the sophistication you need, I urge you to contact the RStudio people. They are top-notch, creative programmers, and a really nice bunch of people who are open to suggestions. You can also try the Emacs package ESS, with an excellent debugger, or if you are an Eclipse fan, Stat ET is a very nice package.

      You can compile R routines to byte code, like Python, and thus not worry (much) about the IP issue.

      1. Thanks for reply. I guess we shall have to agree to disagree on OO design and programming.

        An important aspect that I did not mention before is that many R programmers follow the ‘batch’ model of coding – i.e., run a program against some data that is just sitting there. However, these days commercial systems that process billions of events per day are also designed with ‘online’ processing in mind – i.e., processing data as it is coming in (say for e.g., much like the package Storm). I’m not sure how one can even use R properly and most effectively in such a setting. Online processing is a very important aspect in data analysis together with batch processing.

        If one intends to do isolated statistical analyses then R is great. However, when it comes to making a modern, online, automatic data science and analytics platform, I just don’t see R being able to do the job compared to Python. It is not equipped to talk to other components that define a full system in a way Python does (i.e., databases, message queues, storage modules, web servers, application servers, distributed execution modules, etc.) Some companies, such as Revolution Analytics, are trying to bridge this gap though with the integration of R and Hadoop. But, IMHO it still doesn’t cut it.

        Bottom line is that until Python statical libraries take over R :-), use mostly Python, with some R as a stop gap measure.

  21. Again, I think we’re straying from the main topic I brought up. But it
    seems that you may not have fully looked into R. There are lots of ways to communicate with other programs, for example. You can set up pipes, for instance, and there are various higher-level mechanisms. There is even a nice R interface to MQ.

    And rmr is not the only way to use Hadoop from R (though on the other hand, it’s not clear yet whether Hadoop itself has staying power).

  22. Prof. Matloff, we are straying from the main topic if you narrowly define ‘data science’ to be an isolated batch processing of some static data. However, outside of academic setting, this narrow foucs doesn’t get the job done. For building a modern ‘data science’ platform one has to to look and integrate all required components. Furthermore, I have looked sufficiently closely into R. In fact, I do use it together with Python for data analytics. However, I am patiently looking forward to the day when I can do my work in Python alone and not need to drop down to R.

    1. Furthermore, yes you are right that there are some tools available for R to talk to some of the peripheral components that I mentioned. However, many are experimental and more like proof of concept in nature. For commercial, high volume work, one has to resort to industrial strength tools, and that is where IMHO Python has more strength.

      1. Hard for me to say, as I haven’t really done any sustained high-volume communication among components. As I mentioned before, R and Python are using the same OS services, so in principle there should be no difference, say with pipes, but it’s certainly possible that there may be memory issues when a system is strained. I’ve mentioned earlier in this thread that I’ve experienced such issues with Python too, in the context of very large-scale simulations, so it’s fragile too. Is R even more fragile? Probably no one knows.

  23. This should be pointed out: Julia’s parallel programming support is excellent and far beyond Python’s in every way.

    There is no comparing this support between Python and Julia. Python had it bolted on (poorly), while Julia was built for it from the ground up as a first-class feature.

    Somehow, you’ve gotten a wildly off-base impression of concurrency in Julia, and I encourage you to check it out in detail.


    1. According to your link (which I had also looked at for an update last week before making my original posting), Julia parallelism is through coroutines. Coroutines are concurrent but not parallel–only one coroutine can run at a time. So, Julia has the same problem as with Python’s GIL. By the way, if one actually wants to have coroutines, Python does that quite well, through generators.

      1. You must’ve been reading selectively, and missed “Starting with `julia -p n` provides n worker processes on the local machine.”

        Only one coroutine can run at a time *in a single process.* However starting multiple Julia processes, adding more at runtime with addprocs(), communicating and sharing work between them is straightforward.

      2. I believe you missed the ‘one process per core’ bit from that page, which takes the coroutine model and adds true concurrency across multiple processes and even multiple machines for free.

  24. Just wanted to chime in on numpy/pandas being good for financial modelers…

    As someone that has a new blog dedicated *to* financial data analytics/trading/backtesting/etc (*shameless plug* =P)…, well, Wes McKinney is no longer at AQR, and while I can’t speak to Java/C#/C++ and their free, open-source software financial libraries, R’s financial libraries are created by people in quantitative finance at the highest level (who I also have the pleasure of knowing personally–they’re awesome people), and, if you’re doing large-scale financial data analysis, there are some very dedicated software systems (not free), from proprietary vendors that are created for heavy lifting, E.G. Deltix.

    From personal experience with Python, the thing that always stopped me cold was that I simply didn’t know how to make Python automagically find and install libraries like I could with R–that is, import numpy didn’t automagically find it on the internet and install it. And I’m not sure what the go-to Python IDE is, either. From what I’ve seen with R, RStudio is just plain amazing, in that it does everything I need it to, at once, and it’s used in industrial-strength applications in at least one trading firm.

    So yeah.

    1. If you like RStudio, then in the Python world you’d probably like Spyder (https://pythonhosted.org/spyder/) or Pydev (http://pydev.org/). Spyder is more geared to scientific computing/interactive data analysis and is influenced heavily by MATLAB. Pydev is more for apps/Django coders, but it’s certainly slick as it’s built on Eclipse. Eclipse can run circles around the IDE powers of RStudio. Over the years the Eclipse universe of development products has had $800M of development thrown at it (http://asmarterplanet.com/blog/2011/11/ibm_and_eclipse_10_years.html). RStudio is still very new, but it’s impressive and continues to improve rapidly. For years I have used StatET (R IDE for Eclipse; http://www.walware.de/goto/statet) but with the progress in RStudio I am starting to consider whether I should move my development over. At the moment, I mostly miss aspects of StatET in RStudio and increasingly I find myself missing aspects of RStudio in StatET.

      For Python automagic library installation, try PyPI. It’s the CRAN of Python (https://pypi.python.org/pypi). To install a library, it’s as easy as “pip install “.

  25. The idea is that Julia runs one process per core, which communicate through a nice message passing and remote call interface that looks very easy to work with.

    This isn’t quite as polished as say Erlang’s model (which I work with extensively and love) but it’s miles ahead of Python, and not just because you don’t need to manually manipulate queues to pass messages around.

  26. Thanks to all for their interesting comments! As I mentioned in my original posting, I chose the title in order to provoke comments, and was gratified to see so many of them.

    Some readers are still posting comments, but I probably won’t reply, having said all I have to say on the subject for now. However, I’d like to summarize the discussion, as I see it (note the qualifier).

    As I expected, there are lots of rabid Python fans out there, and a few of the Julia persuasion (though fewer of the latter than I had expected). Again, from my point of view, there seems to be rather little evidence counter to the main theme of my original posting, which was that R won’t be replaced in its current data science role anytime soon.

    I very much like how one poster put it, which was that Julia seems headed to become another Matlab. As long as Julia has, say, the solution of PDEs as its model app, I just don’t see it becoming really strong as a language for data science.

    And yes, I do indeed consider data science to consist of more than statistics. But I consider statistics to be more than just “statistics” too, and when I said that R was developed by statisticians for statisticians, I did mean to include data manipulation. Most statisticians spend as much or more time on the latter than on formal statistics.

    I definitely am including Big Data in that. R is far from perfect for that realm (coincidentally, I’ve encountered some long-vector errors in the last few days), but no other language handles it very well either, and R is certainly headed in the right direction there.

    I stand corrected on the issue of parallelism in Julia. I have not tried it yet, but I must say that personally I’ve never liked the message-passing paradigm (I prefer the shared-memory model, even in the case of DSM), and in any case, I had assumed that at the very least someone would develop a Julia interface to MPI, as R and Python people have done. Given a choice of message-passing in Julia and DSM with Python’s “multiprocessing” module, I’d much prefer the latter, but would prefer the R tools even more.

    Thanks again for a great discussion.

    1. Little short-sighted to say that Julia has only a single “model app.” It doesn’t. If anything, data manipulation as you say is the common task that all scientists, engineers, statisticians, quants, etc spend much of their time and programming dealing with. R is a messy, inconsistent, often slow language that fewer and fewer people are going to bother learning now that better alternatives are around – and yes, I mean better at data manipulation, not even getting into other non-statistical calculations or tasks. Python’s nicer and still growing, but will never fully overcome its performance problems or the two-language issue – and it desperately needs a real type system.

      Someone has developed a Julia interface to MPI – https://github.com/lcw/MPI.jl doesn’t look particularly mature or widely used yet, but it’s absolutely possible.

      Thanks for provoking the discussion! I look forward to seeing the retrospective on where things stand in a few year’s time.

  27. “Ditto for computer scientists [except for the intellectual part 🙂 ].”

    Funnier is the following logic:

    “R is written in C. Python is written in C. They both have access to OS calls. So, taken literally, anything one can do, the other can do too.”

    At least, http://en.wikipedia.org/wiki/Argument_from_fallacy can save your claim, but not your logic 🙂 .

    Knowing the right order of what problem you need reduce to is not that hard and wouldn’t require too much intellect [except for the respect to computer science work :-)]


  28. I have also been a long time R and Python user and I have to say that although I like R for small and interactive data analysis jobs, writing larger robust programs in R is very painful. Among all the problems (text-processing etc) that have been discussed, I hate most the default type conversions. For example, the apply(… table) function would normally return a list of return values of table but could magically decide to return another type if the return values of table are trivial. read.table would by default change the header of input file and data.frame would produce different column names with different input types. All these little things make it extremely difficult to write robust programs because you have to constantly modify it for corner cases. In comparison, Python is a much cleaner and predictable language to work with and can be used to produce higher-quality applications.

    1. I strongly agree with you regarding the type annoyances. However, I disagree that that makes R less robust.

      In R, if one has a type mismatch, this generally will produce an execution error. Not nice, but it’s a lot better than Python, where misplaced indentation can easily lead to UNCAUGHT errors. Better caught than uncaught, don’t you agree?

      In terms of text processing, I do it in R all the time with no problem or annoyance. I’d be curious to hear what specifically you object to.

      1. Python vs. R arguments regarding robustness is kind of like pots and kettles arguing. The big gulf is between languages typed at compile-time and those typed at runtime. Various compile-time typed languages are nonetheless high level (Scala, Haskell, Rust), and they all provide much greater assurances than do Python or R that you understand the structure of your data.

        Both Python and R can carry along inappropriate data quite far through processing, and both will fail with errors at runtime when things get untenably bad. The details differ somewhat, but mostly that just means that you need to be familiar with the failure modes of each language to use it comfortably.

      2. Your last sentence hits the nail right on the head. I certainly agree.

        However, I generally react rather negatively when I see people (not you) make it sound like strong typing is a necessary and/or sufficient condition for code robustness.

      3. The examples I gave was not only about type annoyances, but also about the over-cleverness of R. For example,

        apply(matrix(c(rep(1, 10), rep(2,10)), ncol=4), 1, table)[,1]

        works, but

        apply(matrix(c(rep(1, 10), rep(1,10)), ncol=4), 1, table)[,1]

        does not. I would expect a dataframe with columns a, b for function data.frame(a=X, b=Y) but the column name can be, for example, b.d if Y is another data frame (in that case I would expect Y to be converted to be an unnamed array). Such things are everywhere in R to a point that I simply do not trust that my programs can tolerate the abusive use from other users (unless I try really hard to test them).

        As for text processing, Python, as a general-purpose programming language, is in my opinion more natural and expressive in terms of general text and string processing. For example, it is as trivial as “if a in b” to test if a is a substring of b in python, but I have to use something like grep in R. I believe that everything can be done in R, but I have found it tedious enough to process complex text and file formats in R so I usually turn to Python for these tasks and leave only the data analysis parts to R.

      4. As I said, I have my own pet annoyances with R.

        Since I started using Python before R (I did use the old S language, precursor to R, but not very much), I have probably spent just as much time coding in Python as in R. Whenever I’m coding in one, I find myself wishing it had features of the other.

        Having said that, though, for my purposes (note the qualifier), R makes me much, much more productive than Python does. This is so much the case that I rarely use Python these days, even for tasks that are not R’s forte’.

        Python is widely assumed to be better than R for text processing, and it is indeed nice to write if a in b. But if things like the latter are that important to you, you can write a set of wrappers. And guess what! — I often find text-processing tasks in which R is superior to Python.

        Well, let’s face it. Discussions like this tend to devolve into “religious debates,” and often a topic amounts to the old “Is it a bug or a feature?” question.

      5. It would not be a religious war because we both use Python and R and know their pros and cons. All I was saying was that Python is an excellent programming language that is not designed for and is not as good as R in data analysis. R is powerful and very useful for data analysis but is too sloppy to be used for large and complex applications. One can choose to master one of them for his work but I have found it easier to use both.

      6. I was talking about “religious wars” in general. Clearly, you and I mostly agree.

        One point I’ve mentioned before, but not in the current thread, is that the more work one has to do to code a given task, the higher the probability of errors. For large and complex applications, this gap in error rates, i.e. gap between the “safer” tool and the functionally more powerful but “less safe” one, likely grows nonlinearly. Potentially that effect could swamp other effects such as strong typing.

        This point is not taught in software engineering courses, as far as I know, and an amazing number of people (present company excepted) haven’t come to this realization on their own.

      7. R is sloppy and inconsistent. Consider this:

        m <- matrix(rnorm(24), nrow=6)
        dimnames(m) <- list(row_id=1:6, col_id=c('#(*)%$&#*', 9, 'colA', 'col*'))
        df1 <- as.data.frame(m)
        df2 <- data.frame(m)

        names(df1) # as.data.frame() doesn't silently change the names
        [1] "#(*)%$&#*" "9" "colA" "col*"

        names(df2) # data.frame() changes names as it internally calls make.names(), the generic column name cleaning function.
        [1] "X………" "X9" "colA" "col."

  29. Being strongly statically typed neither is necessary nor sufficient for robustness, but it helps. (It also helps with speed in some cases where the static typing is essential for correctness, though most of that can be recovered by doing global optimization if you didn’t actually use the dynamic features.)

    If you don’t need the help, then strong static typing occasionally gets in your way without providing much benefit. Personally, the smaller the project, the less I find I need the help. With huge projects, nothing seems to provide enough help. There’s an interesting area in the middle, though, where I at least find my productivity is much higher with statically typed languages, and where I don’t really find a great difference between R, Python, Matlab, etc. in terms of how my productivity drops off from the small-project case.

  30. Your thoughts on CRAN needing improvement are accurate. That’s why my colleagues and I set out to improve it last year by creating MRAN. It offers much improved indexing and search capabilities and is snapshot enabled, so you can always look back at the state of packages on a given day, and download and install them from a snapshot date if you want to. (using the `checkpoint` package which was created to work with MRAN snapshots.) https://mran.revolutionanalytics.com/packages/

    1. What I’ve said about CRAN is that it needs much better organization and indexing. Some of the Task Views are pretty good, while others are not. And I’ve suggested having a Yelp-style review feature.

    1. It’s not on GitHub, and I seldom update it. But what do you think needs updating? Python 3? Personally I don’t see the need for that, but I’d welcome people trying to convince me otherwise.

  31. Interesting article, although its implications about python and Julia don’t necessarily seem justified (benchmarks?)

    You are correct in saying that the CRAN library is vastly superior to anything the other have, and that the fact most of R’s libraries are written in C and Fortran do give it an advantage, However both other languages also have that capability!

    I would also like to point out that Julia is relatively new (It is as old as both python and R are older than it!) Although still buggy, it simplifies a lot of the writing potential, and more importantly makes accessing other languages (and their libraries) trivial. This means that depending on what you want your code to do, you can always apply another language (be it Perl for string parsing, R for the CRAN packages, Fortran for speed and parallelistaion or python for web interfacing?).

    It is not trying to replace R for data processing, although once evolved, it may have certain cases where it would be far more beneficial to do so. This goes back to the initial python vs fortran debate many physicists underwent. Where as it turns out, in the right hands, with under the correct circumstances, both are just as efficient as the other on an overall timescale.

    Either way if it is just pure runtime efficiency that we are challenging, then I’d vote Fortran hands down every time.

    1. Thanks for the interesting and sensible comments.

      As I said in my original posting, the fact that Doug Bates has become an advocate for Julia is quite enough for me to respect that language. Indeed, if I ever get some time, I’ll finally learn the language and probably use it to some degree.

      The main purpose of my blog posting was to object to all the hype about Julia (just like I object to the hype about Hadoop and Spark, about deep learning and so on). Julia may well have better potential for writing fast code, but the fact is that, for most people most of the time, speed is NOT an issue; the issue is convenience, i.e. fast WRITING of code rather than writing fast code.

      Yes, Julia is new and there isn’t a lot of library code available for it yet. But I don’t see how it could ever catch up to CRAN, both because of the huge head start CRAN has and because the Julia community doesn’t have many statisticians. I am NOT impressed by the Python libraries developed by the machine learning crowd, and I would guess that Julia will probably suffer the same fate.

      1. That for one I agree. However when it comes to machine learning code I do tend to just use rpy2 in python and use the CRAN libraries regardless (often with the addition of pandas for dataframes). And I can see myself doing the same with Julia, whereupon the production of a working code is simplified, and I have the ability to pluck out whatever libraries I need without the need to code in that specific language.

        One thing I would add however, is that I think many other languages would struggle to produce a plotting library that is both quite comprehensive and neat as those in R.

      2. Regarding Python, I’m a longtime fan, due to its elegant structure, so I have nothing against it. As I said in my post, these days I tend to write even true scripting applications in R (it serves me much better in the latter than I would have guessed), and find the R-Python bridges clunky.

  32. With R you need to use specialized packages such as data.table or dplyr, with different syntax and issues. The advantage of Julia is that you get a very fast language out-of-the-box. On the other hand Julia is still beta and it seems that it will improve even more.

    1. I continue to believe that the vast majority of users need functionality much more than they need speed. Julia, as far as I know, is designed by the applied math types, with solution of differential equations as their model, NOT data wrangling etc. I have no doubt that Julia will continue to improve in terms of speed and availability of mathematical operations, but I have serious doubts about its improving much for us data wranglers.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s