I’ve written before about the Julia language. As someone who is very active in the R community, I am biased of course, and have been (and remain) a skeptic about Julia. But I would like to report on a wonderful talk I attended today at Stanford. To my surprise and delight, the speaker, Viral Shah of Julia Computing Inc, focused on the “computer science-y” details, i.e. the internals and the philosophy, quite interesting and certainly very impressive.
I had not previously known, for instance, how integral the notion of typing was in Julia, e.g. integer vs. float, and the very extensive thought processses in the Julia group that led to this emphasis. And it was fun to see the various cool Julia features that appeal to a systems guy like me, e.g. easy viewing of the assembly language implementation of a Julia function.
I was particularly interested in one crucial aspect that separates R from other languages that are popular in data science applications — NA values. I asked the speaker about that during the talk, only to find that he had anticipated this question and had devoted space in his slides to it. After covering that topic, he added that this had caused considerable debate within the Julia team as to how to handle it, which turned out to be something of a compromise.
Well, then, given this latest report on Julia (new releases coming soon), what is MY latest? How do I view it now?
As I’ve said here before, the fact that such an eminent researcher and R developer, Doug Bates of the University of Wisconsin, has shifted his efforts from R to Julia is enough for me to hold Julia in high regard, sight unseen. I had browsed through some Julia material in the past, and had seen enough to confirm that this is a language to be reckoned with. Today’s talk definitely raised my opinion of the language even further. But…
I am both a computer scientist and a statistician. Though only my early career was in a Department of Statistics (I was one of the founders of the UC Davis Stat. Dept.), I have done statistics throughout my career. And my hybrid status plays a key role in how I view Julia.
As a computer scientist, especially one who likes to view things at the systems levels, Julia is fabulous. But as a statistician, speed is only one of many crucial aspects of the software that I write and use. The role of NA values in R is indispensable, I say, not something to be compromised. And even more importantly, what I call the “helper” infrastructure of R is something I would be highly loathe to part with, things like naming of vector elements and matrix rows for instance. Such things have led to elegant solutions to many problems in software that I write.
And though undoubtedly (and hopefully) more top statisticians like Doug Bates will become active Julia contributors, the salient fact about R, as I always say, is that R is written for statisticians by statisticians. It matters. I believe that R will remain the language of choice in statistics for a long time to come.
And so, though my hat is off to Viral Shah, I don’t think Julia is about to “go viral” in the stat world in the foreseeable future. 🙂
12 thoughts on “Latest on the Julia Language (vs. R)”
Handling NA values is a pain point in the Python numpy / pandas world as well. People seem to like using bit masked arrays for this.
Thanks, Clark, I hadn’t known this. My exposure to numpy/pandas has been quite limited.
I should have clarified that R is not the only language that handles MAs. Scilab, for instance, not only does so but actually features several different types of missing value codes, if I recall correctly.
I’ve been doing scientific computing a long time – since the options were FORTRAN and assembler, in fact. Until Julia showed up, R was the closest language to how I think that was freely available. And it had just the right combination of FORTRAN and Lisp concepts, even when it first arose as a way to run all the S code in libraries and books.
But now I want to learn Julia, and what I’ve seen leads me to believe it will replace R for personal projects. After all, the “heavy lifting” these days is mostly being done “in the cloud”, mostly in languages that run on the Java Virtual Machine. I want the types, I want the macros, I want the speed and a few other things that Julia offers. Once the API client libraries are there in Julia I don’t see an advantage to R over Julia.
Sure, I’m terribly spoiled by RStudio, especially with new goodies like Bookdown, Sparkly and Flexdashboards. But for exploration I don’t need those.
There may be an increasing number of people like you, who need only computation. But again, I believe that for most people who use R, they want more than just that. (And I don’t think the cloud is relevant to that point at hand.)
In terms of your being spoiled by RStudio (which by the way I do not use myself), there apparently is quite a nice IDE for Julia being developed, along with a nice debugger.
You may also want to check out Weave.jl as a repleacement for something like RMarkdown, and Escher.jl for interactive Web applications. In this vane Genie.jl looks like it’s going to be great, and it would be topped off if these Web applications were able to interface (with interactivity) with Plots.jl.
There’s also the jupyter notebook – unless that’s what you meant :). You can access Julia in that, much like Python and R.
No, I meant the Atom product. See the other reader comment.
I think it’s better to think of Julia as a language with a strong base for developing fast packages. Its setup as type dispatch and written within Julia itself means everything is “first-class”, so there is no performance advantage for the “Base” functions or anything you can make in a package.
This means that you can make anything: you can make your own numbers (we’re doing that with ArbFloats.jl, and others are doing it with things like DoubleDouble.jl), your own array types, your own linear algebra functions, etc. all within Julia.
The question then becomes, why would NAs be part of Base? Base is about facilitating this package ecosystem and giving it the right tools to easily make fast/efficient code, but NAs are inherently type unstable. This means you’d want to use them with statistical codes, but you would like to normally avoid allowing them since this amount of type instability would cause its associated code to be slower. Therefore it’s best in a package. Therefore, JuliaStats has embraced things like Dataframes.jl and NullableArrays.jl which incorporate NAs (and nulls) in the stats routines because of how useful we have found them in R.
I think that is the way to go forward. By the basic types not allowing arrays, you get good performance on everything that doesn’t require NAs (most non-stats applications). However, the stats routines can also dispatch onto these special types which allow for NAs, and there are data readers (for reading from databases, csv files, etc.) which will output these kinds of types, which means you get seamless use of NAs in stats routines. In time some of these types (like DataFrames) becomes quasi-canonical.
Julia is so powerful that its package ecosystem is about making… anything. A package could be a new type for other packages to build around, or a package can be to make macros which are actually full language parser to automatically parallelize your code (ParallelAccelerator.jl). Because of this, your thinking can change from “I wish ______ was in Base” to “I wish someone made a ______ package, and some functions which did _______ on a _______”.
Very nice explanation, and in fact very similar to what Viral said in his talk.
Was the talk recorded? I just a see a description/abstract linked on the page as a short PDF.
I’m pretty sure the talk was not recorded. We were told that he had given the same talk at DSC a few days earlier, but I doubt it was recorded there, given the very private nature of that conference. However, I’m sure that Viral would be happy to send you his slides if you contact him.