Bad Coder, Bad Coder!

My title here is in the sense of “Bad dog, bad dog!”, a scolding I sometimes see dog owners use to tame their pets, and is also an allusion to Bad Reporter, a sometimes hilarious and always irreverent political comic strip in the San Francisco Chronicle. And my title is intended to convey the point that I think that “good programming practice” rules are sometimes taken overly seriously.

I’ll comment on two aspects here:

  • Variable names, spacing etc.
  • “Side effects.”

The second is the more unsettling of the two, but the first is interesting because Hadley Wickham is involved, which always gets people’s attention. 🙂

What prompted this is that there is currently a discussion in the ASA Statistical Learning and Data Mining Section e-mail list on Hadley’s coding style guidelines.  Some of you may know that Google has also had its own R style guidelines for some years now.

Much as I admire Hadley and Google, I really think this is much ado about nothing. I do have my own personal style (if you are curious, just look at my code), but seriously, folks, I don’t think it really matters. As long as a programmer is consistent  and has given careful thought as to how to be clear — not only clear to others who read the code, but also clear to the programmer herself when she reads the code cold two months from now — it’s fine.

I would like to see better style in programmers’ English, though. Please STOP using the terms “curly braces” and “square brackets”! Braces ARE curly, by definition, and brackets ARE square, right? Or, at least be consistent in your redundancy, speaking of “round parentheses,” “dotty periods,” “long dashes,” “big capitals,” “elevated overbars” and so on. 🙂

And now, for the hard stuff, side effects. R is a functional language, meaning among other things that no variable is changed upon executing a function, other than assignment of the return value. In other words,

y <- f(x)

will change y but neither x nor any other variables, e.g. globals, will be changed.

For many in the R community, this restriction is followed with religious-like fervor. They consider it safe, clear and easy to debug.

Yes, yes, I understand the arguments, and the No Side Effecters have a point. But it is much too dogmatic, in my opinion, and often causes one to incur BIG performance penalties.

I am not here to praise Caeser, but I’m not here to bury him either. 🙂 If you want to subscribe to that anti-side effects philosophy, fine. My only point is that, like it or not, R is slowly moving away from the banning of side effects. I would cite a few examples:

  • Reference classes. R has always had some ways to create side effects for the antisocial 🙂 , one notable example being environments. Reference classes carry that loophole one step further.
  • The bigmemory package. Yes, it was developed as a workaround to R’s 32-bit memory constraints (somewhat but not completely resolved in more recent versions of R), but I’m told that one of the reasons for the package’s popularity is its ability change data in-place in memory. This is basically a side effects issue, as traditionally the assignment to y above has created a new, separate copy of y in memory, often a big performance killer. (Again, this too has been ameliorated somewhat in recent versions of R.)
  • The data.table package. Same comments here as for the “insidious” use of bigmemory above, but also consider the operation> setkey(mtc1,cyl).Ever notice that no assignment is made? If you rave about data.table, you should keep in mind what factors underly that lightning speed.

I won’t get into the global variables issue here, other than to say again that I think the arguments against them often tend to be more ideological than logical.

In summary, I believe that careful thought, rather than an obsession with rules, is your ticket to good — clear and efficient — code.

 

Latest on the Julia Language (vs. R)

I’ve written before about the Julia language. As someone who is very active in the R community, I am biased of course, and have been (and remain) a skeptic about Julia. But I would like to report on a wonderful talk I attended today at Stanford. To my surprise and delight, the speaker, Viral Shah of Julia Computing Inc, focused on the “computer science-y” details, i.e. the internals and the philosophy, quite interesting and certainly very impressive.

I had not previously known, for instance, how integral the notion of typing was in Julia, e.g. integer vs. float, and the very extensive thought processses in the Julia group that led to this emphasis. And it was fun to see the various cool Julia features that appeal to a systems guy like me, e.g. easy viewing of the assembly language implementation  of a Julia function.

I was particularly interested in one crucial aspect that separates R from other languages that are popular in data science applications — NA values. I asked the speaker about that during the talk, only to find that he had anticipated this question and had devoted space in his slides to it. After covering that topic, he added that this had caused considerable debate within the Julia team as to how to handle it, which turned out to be something of a compromise.

Well, then, given this latest report on Julia (new releases coming soon), what is MY latest? How do I view it now?

As I’ve said here before, the fact that such an eminent researcher and R developer, Doug Bates of the University of Wisconsin, has shifted his efforts from R to Julia is enough for me to hold Julia in high regard, sight unseen. I had browsed through some Julia material in the past, and had seen enough to confirm that this is a language to be reckoned with. Today’s talk definitely raised my opinion of the language even further. But…

I am both a computer scientist and a statistician. Though only my early career was in a Department of Statistics (I was one of the founders of the UC Davis Stat. Dept.), I have done statistics throughout my career. And my hybrid status plays a key role in how I view Julia.

As a computer scientist, especially one who likes to view things at the systems levels, Julia is fabulous. But as a statistician, speed is only one of many crucial aspects of the software that I write and use. The role of NA values in R is indispensable, I say, not something to be compromised. And even more importantly, what I call the “helper” infrastructure of R is something I would be highly loathe to part with, things like naming of vector elements and matrix rows for instance. Such things have led to elegant solutions to many problems in software that I write.

And though undoubtedly (and hopefully) more top statisticians like Doug Bates will become active Julia contributors, the salient fact about R, as I always say, is that R is written for statisticians by statisticians. It matters. I believe that R will remain the language of choice in statistics for a long time to come.

And so, though my hat is off to Viral Shah, I don’t think Julia is about to “go viral” in the stat world in the foreseeable future. 🙂

Student-Run Conference in Data Science

I’d like to urge all of you in Northern California to attend iidata,
a student-run conference in data science, to be held on the UC Davis campus on May 21. According to the Web page,

iidata is a one-day, collegiate-level Data Science convention aimed at educating students in the new, thrilling field of data science. We welcome all students, regardless of background knowledge, so long as you have a mindset to never stop learning. The convention will consist of guest speakers, workshops, and competitions.

There is sure to be a lot of R there, including a workshop on R. I would particularly recommend that you attend the talks by the guest speakers from industry.

Stat professor Duncan Temple Lang and I will be judges in the data analysis competition. (See the above link for an ancient picture of me. 🙂 ) Of course, you need not participate in the competition in order to attend,

Talk on regtools and P-Values

I’m deeply greatful to Hui Lin and the inimitable Yihui Xie for arranging for me to give a “virtual seminar talk” to the Central Iowa R Users Group. You can view my talk, including an interesting Q&A session, online. (The actual start is at 0:34.) There are two separate topics, my regtools package (related to my forthcoming book, From Linear Algebra to Machine Learning: Regression and Classification, with Examples in R), and the recent ASA report on p-values.

GTC 2016

I will be an invited speaker at GTC 2016, a large conference on GPU computation. The main topic will be usage of GPU in conjunction with R, and I will also speak on my Software Alchemy method, especially in relation to GPU computing..

GTC asked me to notify my “network” about the event, and this blog is the closest thing I have. 🙂  My talk is on April 7 at 3 pm, Session S6708. I hope to see some of you there.

Even Businessweek Is Talking about P-Values

The March 28 issue of Bloomberg Businessweek has a rather good summary of the problems of p-values, even recommending the use of confidence intervals and — wonder of wonders — “[looking] at the evidence as a whole.” What, statistics can’t make our decisions for us?  🙂

It does make some vague and sometimes puzzling statements, but for the p-values issue to actually find its way into such a nontechnical, mainstream publication as this one is pretty darn remarkable.  Thank you, ASA!

The article, “Lies, Damned Lies and More Statistics,” is on page 12. Unfortunately, I can’t find it online.

In my previous posts on the p-value issue, I took issue with the significance test orientation of the R language. I hope articles like this will push the R Core Team in the right direction.

P-values: the Continuing Saga

I highly recommend the blog post by Yoav Benjamini and Tal Galili in defense of (carefully used) p-values. I disagree with much of it, but the exposition is very clear, and there is a nice guide to relevant R tools, including for simultaneous inference, a field in which Yoav is one of the most prominent, indeed pre-eminent, researchers. I do have a few points to make.

First, regarding exactly what the ASA said, I would refer readers to my second post on the matter, which argues that the ASA statement was considerably stronger than Yoav and Tal took it to be.

Second, Yoav and Tal make the point that one can’t beat p-values for simplicity of assumptions. I’d add to that point the example of permutation tests. Of course, my objections remain, but putting that aside, I would note that I too tend to be a minimalist in assumptions — I’ve never liked the likelihood idea, for instance — and I would cite my example in my second post of much-generalized Scheffe’ intervals as an example. Those who read my 50% draft book on regression and classification will see this as a recurring theme.

I of course agree strongly with Yoav and Tal’s point about problems with just checking whether a confidence interval contains 0, a point I had made too.

What I would like to see from them, though, is what I mentioned several times in the last couple of days — a good, convincing example in which p-values are useful. That really has to be the bottom line.