Some Comments on Donoho’s “50 Years of Data Science”

An old friend recently called my attention to a thoughtful essay by Stanford statistics professor David Donoho, titled “50 Years of Data Science.” Given the keen interest these days in data science, the essay is quite timely. The work clearly shows that Donoho is not only a grandmaster theoretician, but also a statistical philosopher. The paper should be required reading in all Stat and CS Departments. But as a CS person with deep roots in statistics, I believe there are a few points Donaho should have developed more, which I will discuss here, as well as other points on which his essay really shines.

Though no one seems to claim to know what data science is — not even on an “I know it when I see it” basis — everyone seems to agree that it is roughly a combination of statistics and computer science. Fine, but what does that mean? Let’s take the computer science aspect first.

By CS here, I mean facility with computers, and by that in turn I mean more than programming. By happenstance, I was in a conversation today with some CS colleagues as to whether material on computer networks should be required for CS majors. One of my colleagues said there were equally deserving topics, such as Hadoop. My reply was that Hadoop is SLOW (so much so that many are predicting its imminent demise), and maximizing its performance involves, inter alia, an understanding of…computer networks. Donoho doesn’t cover this point about computation (nor, it seems, do most data science curricula), limiting himself to programming languages and libraries.

But he does a fine job on the latter. I was pleased that his essay contains quite a bit of material on R, such as the work of Yihui Xie and Hadley Wickham. That a top theoretician devotes so much space in a major position paper to R is a fine tribute to the status R has attained in this profession.

(In that context, I feel compelled to note that in attending a talk at Rice in 2012 I was delighted to see Manny Parzen, 86 years old and one of the pioneers of modern statistical theory, interrupt his theoretical talk with a brief exposition on the NINE different definitions of quantile available in calls to R’s quantile() function. Bravo!)

Donoho notes, however, that the Berkeley data science curriculum uses Python instead of R. He surmises that this is due to Python handling Big Data better than R, but I suspect it has more to do with the CS people at UCB being the main ones designing the curriculum, acting on a general preference in CS for the “more elegant” language Python.

But is Python the better tool than R for Big Data? Many would say so, I think, but a good case can be made for R. For instance, to my knowledge there is nothing in Python like CRAN’s bigmemory package, giving a direct R interface to shared memory at the C++ level. (I also have a parallel computation package, Rdsm, that runs on top of bigmemory.)

Regretably, the Donoho essay contains only the briefest passing reference to parallel computation. But again, he is not alone. Shouldn’t a degree in data science, ostensibly aimed in part at Big Data, include at least some knowledge of parallel computation? I haven’t seen any that do. Note, though, that coverage of such material would again require some knowledge of computer system infrastructure, and thus being at odds with the “a little of this, a little of that, but nothing in depth” philosophy taken so far in data science curricula.

One topic I was surprised to see the essay omit was the fact that so much data today is not in the nice “rectangular” — observations in rows, variables in equal numbers of columns — form that most methodology assumes. Ironically, Donoho highlights Hadley Wickham’s plyr package, as rectangular as can be. Arguably, data science students ought to be exposed more to sophisticated use of R’s tapply(), for instance.

Now turning to the stat aspect of Data Science, a key theme in the essay is, to borrow from Marie Davidian, aren’t WE (statistics people) Data Science? Donoho does an excellent job here of saying the answer is Yes (or if not completely Yes, close enough so that the answer could be Yes with a little work). I particularly liked this gem:

It is striking how, when I review a presentation on today’s data science, in which statistics is superficially given pretty short shrift, I can’t avoid noticing that the underlying tools, examples, and ideas which are being taught as data science were all literally invented by someone trained in Ph.D. statistics, and in many cases the actual software being used was developed by someone with an MA or Ph.D. in statistics. The accumulated efforts of statisticians over centuries are just too overwhelming to be papered over completely, and can’t be hidden in the teaching, research, and exercise of Data Science.

Yes! Not only does it succinctly show that there is indeed value to theory, but also it illustrates that point that many statisticians are not computer wimps after all. Who needs data science? 🙂

I believe that Donoho, citing Leo Breiman, is too quick to concede the prediction field to Machine Learning. As we all know, prediction has been part of Statistics since its inception, literally for centuries. Granted, modern math stat has an exquisitely developed theory of estimation, but I have seen too many Machine Learning people, ones I otherwise respect highly, make the absurd statement, “ML is different from statistics, because we do prediction.”

Indeed, one of Donoho’s most salient points is that having MORE methods available for prediction is not the same as doing BETTER prediction. Indeed, he shows the results of some experiments he conducted with Jiashun Jin on some standard real data sets, in which a very simple predictor is compared to various “fancy” ones:

Boosting, Random Forests and so on are dramatically more complex and have correspondingly higher charisma in the Machine Learning community. But against a series of pre-existing benchmarks developed in the Machine Learning community, the charismatic methods do not outperform the homeliest of procedures…

This would certainly be a shock to most students in ML courses — and to some of their instructors.

Maybe the “R people” (i.e. Stat Departments) have as much to contribute to data science as the “Python people” (CS) after all.

Advertisements

23 thoughts on “Some Comments on Donoho’s “50 Years of Data Science””

  1. Fantastic commentary on Donaho’s excellent essay. To extend on the “ML is different from statistics, because we do prediction” mindset, it is often associated with “Classical statistics is only concerned with hypothesis testing”, which, for us in statistics, is utterly untrue. I feel this is a way to diminish someone’s contribution so that we can elevate some other flavour. It maybe shocking to young ML professionals and students to learn that statistics has focused on predictions for a hundred years or more, and that maybe hard to accept. But it’s not their fault if they were never taught in the first place.

    1. Thanks for the nice comments. Good point on the further reduction of statistics in a chain Prediction -> Estimation -> Hypothesis Testing. The current skeptical attention being paid to p-values, in part motivated by Ionnidis’ commentaries, might help a little in that regard, but the entrenched momentum is almost impossible to remedy, especially because, as you point out, there is a lot of bad teaching out there.

      1. To your sentence “The accumulated efforts of statisticians over centuries are just too overwhelming to be papered over completely, and can’t be hidden in the teaching, research, and exercise of Data Science.” You should share this 1984 paper on the Journal of the American Stat Association on using cross-validation in predictive models…
        https://www.jstor.org/stable/2288403?seq=1#page_scan_tab_contents

        I always find it irritating when authors compare a poorly specified “classical” model such as logistic reg to a more modern model such as RF or SVM and declare the victory of the latter, thus feeding the myth of dated models. Upon closer inspection, the LR model resembles more something a student wrote for a stat101 homework. I echo James, Witten, Hastie, Tibshirani (2013) when they write: “However, for historical reasons, the use of non-linear kernels is much more widespread in the context of SVMs than in the context of logistic regression or other methods”.

        There’s also a large body of work by Peter Austin who compares classical (well specified) methods to more modern ones: http://works.bepress.com/peter_austin/

        or even this one, illustrating how many modern techniques are data hungry:
        http://www.citeulike.org/user/harrelfe/article/13467382

        It’s important then that the “Data Science” curricula do not view “classical methods” as dated and unfairly penalize them in favor of the new -and admittedly catchier sounding -methods.

      2. Thanks for the useful links. By the way, Hastie et al also talk of the “hype” regarding neural networks. I was pleasantly surprised to see this.

    2. YES, you are absolutely correct, Statistics and it’s enormous branches are being in use for ever, and to say Johny come Lately its off-spring disciplines can not take all the credit . AI bit the dust , ES — they are now embedded integral into Systems
      Decision Making .. So, let us not say some body has found the Central Limit Theorem again .. or RAO-CRAMER r Inequality .. nor Late Prof. Dr. Sir GEP Box – JENKINS ARIMA and Forecasting – Models ,

      1. Ha. I’ve been toying with the idea of writing a piece on my blog about what the term means to me, but have been discouraged because it could open a can of worms. Your comment might be the kick I need to finally write it.

  2. I’ve lead a discussion about this recently, as I’m trying to learn ML/data science, starting as a statistician. The thing which struck me about the ML literature vs the statistical literature is the emphasis on prediction. I’ve seen example after example of model building/hypothesis testing in the statistical literature, with no prediction whatsoever. Note that I’m using ‘prediction’ to mean ‘out of sample’ prediction, whether in a hold-out sample, cross-validation or a future sample collected after the modeling was done.

    1. The out-of-sample prediction is the root of the biggest contradiction in the ML literature, in my view. On the one hand, the ML people insist on not basing their analyses on the assumption that the data are NOT a sample from population (real or conceptual); to them, our data is “the whole thing.” But on the other hand, they want to predict “new” data, which is ABSURD unless the new data come from the same population as the original data.

  3. OUT OF SAMPLE PREDICTION: An interesting example of out of sample prediction comes from “The First Census Optical Character Recognition Systems Conference” in 1990.
    http://www.nist.gov/customcf/get_pdf.cfm?pub_id=900652

    “The U.S. Bureau of the Census (Census) and the National Institute of Systems and Technology (NIST) sponsored this Conference as part of ongoing research into recognition of hand-print.”
    page 1

    “One result of the Conference was that those recognition systems trained solely on the SD3 database generally displayed inferior TD1 recognition to those trained on a superset of this data, i.e. one including SD3 as a subset, or other proprietary datasets. The notion that SD3 as “clean” or “constrained” relative to the TD1 dataset was suggested by the writer profiles; SD3 was obtained from motivated permanent Census field personnel whereas TD1 was obtained from variously motivated more diverse and cosmopolitan high school students. An example is that
    the European crossed seven is far more abundant in TD1 than SD3.”
    page 20

    This suggests a number of points:
    1. The population was undefined
    2. Splitting one sample into “Test” and “Training” is NOT equivalent
    to two separately collected samples (how far “out of sample” do you want to test?)
    3. Is it possible to define a “population” for an OCR system? I could hand print a sample right now that could (in theory) be run against the hardware and software of 25 years ago; this is different from official statistics that might wish to determine for example the prevalence of disease in a single year in a single location. One is open-ended and the other is closed. An open ended population could introduce new variants (such as the crossed seven) rarely or not seen at all in the original sample AND population . If a type is not seen in the original sample or population it will not show up in the sample no matter how split or re-sampled.

  4. — But on the other hand, they want to predict “new” data, which is ABSURD unless the new data come from the same population as the original data.

    a Ph.D. math stat I once worked for said it thus: “thou shalt not predict beyond the range of the data”, which is more strict. time series analysts do so all the time, of course.

  5. In World War II code breaking and birdwatching certain types are known to exist even if they are not observed in the samples obtained to date; they are assigned
    a small, but importantly non-zero probability. I believe Turing’s assistant IJ Good wrote about this. Thus, if one had an OCR sample without a slashed seven one might need to
    “spike” (as in “to lace (a drink) with liquor”) the training and test samples with an artificially created example until the example is found in the wild (and test with and without the “spike”).
    https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation

  6. By “spiking” the training and test sample with a labeled value as yet unobserved, one could even train a learner to search for new items such as a “Higgs Boson” or “Planet 9”!

  7. out of sample inference has been called analytic inference by Deming (as opposed to explanatory analysis that he called enumerative studies). This was presented in: Deming, W.E. (1953) On the distinction between enumerative and analytic studies. Journal of the American Statistical Association, 48, pp. 244–255.
    To have discussants refer to analytic studies as “ABSURD” is evidence of one of the most self inflicted barriers statisticians. The reason data science is picking up where statistics should have expanded is because of this absurd self inflicted barrier.

  8. To strengthen the point that out of sample inference should concern statisticians here is a quote by Deming:“Tests of variables that affect a process are useful only if they predict what will happen if this or that variable is increased or decreased. It is only with material
    produced in statistical control that one may talk about an experiment as a conceptual sample that could be extended to infinite size. Unfortunately, this supposition, basic to the analysis of variance and to many other statistical techniques, is as unrealizable as it is vital in much experimentation carried out
    in industry, agriculture, and medicine.Statistical theory as taught in the books is valid and leads to operationally verifiable tests and criteria for an enumerative study. Not so with an analytic problem, as the conditions of the experiment will not be duplicated in the next
    trial. Unfortunately, most problems in industry are analytic.”*
    From dedicated preface to the Economic Control of Quality of Manufactured product by W. Shewhart, 1931.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s