Some Comments on Donoho’s “50 Years of Data Science”

January 23, 2016 matloff 30 Comments

An old friend recently called my attention to a thoughtful essay by Stanford statistics professor David Donoho, titled “50 Years of Data Science.” Given the keen interest these days in data science, the essay is quite timely. The work clearly shows that Donoho is not only a grandmaster theoretician, but also a statistical philosopher. The paper should be required reading in all Stat and CS Departments. But as a CS person with deep roots in statistics, I believe there are a few points Donaho should have developed more, which I will discuss here, as well as other points on which his essay really shines.

Though no one seems to claim to know what data science is — not even on an “I know it when I see it” basis — everyone seems to agree that it is roughly a combination of statistics and computer science. Fine, but what does that mean? Let’s take the computer science aspect first.

By CS here, I mean facility with computers, and by that in turn I mean more than programming. By happenstance, I was in a conversation today with some CS colleagues as to whether material on computer networks should be required for CS majors. One of my colleagues said there were equally deserving topics, such as Hadoop. My reply was that Hadoop is SLOW (so much so that many are predicting its imminent demise), and maximizing its performance involves, inter alia, an understanding of…computer networks. Donoho doesn’t cover this point about computation (nor, it seems, do most data science curricula), limiting himself to programming languages and libraries.

But he does a fine job on the latter. I was pleased that his essay contains quite a bit of material on R, such as the work of Yihui Xie and Hadley Wickham. That a top theoretician devotes so much space in a major position paper to R is a fine tribute to the status R has attained in this profession.

(In that context, I feel compelled to note that in attending a talk at Rice in 2012 I was delighted to see Manny Parzen, 86 years old and one of the pioneers of modern statistical theory, interrupt his theoretical talk with a brief exposition on the NINE different definitions of quantile available in calls to R’s quantile() function. Bravo!)

Donoho notes, however, that the Berkeley data science curriculum uses Python instead of R. He surmises that this is due to Python handling Big Data better than R, but I suspect it has more to do with the CS people at UCB being the main ones designing the curriculum, acting on a general preference in CS for the “more elegant” language Python.

But is Python the better tool than R for Big Data? Many would say so, I think, but a good case can be made for R. For instance, to my knowledge there is nothing in Python like CRAN’s bigmemory package, giving a direct R interface to shared memory at the C++ level. (I also have a parallel computation package, Rdsm, that runs on top of bigmemory.)

Regretably, the Donoho essay contains only the briefest passing reference to parallel computation. But again, he is not alone. Shouldn’t a degree in data science, ostensibly aimed in part at Big Data, include at least some knowledge of parallel computation? I haven’t seen any that do. Note, though, that coverage of such material would again require some knowledge of computer system infrastructure, and thus being at odds with the “a little of this, a little of that, but nothing in depth” philosophy taken so far in data science curricula.

One topic I was surprised to see the essay omit was the fact that so much data today is not in the nice “rectangular” — observations in rows, variables in equal numbers of columns — form that most methodology assumes. Ironically, Donoho highlights Hadley Wickham’s plyr package, as rectangular as can be. Arguably, data science students ought to be exposed more to sophisticated use of R’s tapply(), for instance.

Now turning to the stat aspect of Data Science, a key theme in the essay is, to borrow from Marie Davidian, aren’t WE (statistics people) Data Science? Donoho does an excellent job here of saying the answer is Yes (or if not completely Yes, close enough so that the answer could be Yes with a little work). I particularly liked this gem:

It is striking how, when I review a presentation on today’s data science, in which statistics is superficially given pretty short shrift, I can’t avoid noticing that the underlying tools, examples, and ideas which are being taught as data science were all literally invented by someone trained in Ph.D. statistics, and in many cases the actual software being used was developed by someone with an MA or Ph.D. in statistics. The accumulated efforts of statisticians over centuries are just too overwhelming to be papered over completely, and can’t be hidden in the teaching, research, and exercise of Data Science.

Yes! Not only does it succinctly show that there is indeed value to theory, but also it illustrates that point that many statisticians are not computer wimps after all. Who needs data science? 🙂

I believe that Donoho, citing Leo Breiman, is too quick to concede the prediction field to Machine Learning. As we all know, prediction has been part of Statistics since its inception, literally for centuries. Granted, modern math stat has an exquisitely developed theory of estimation, but I have seen too many Machine Learning people, ones I otherwise respect highly, make the absurd statement, “ML is different from statistics, because we do prediction.”

Indeed, one of Donoho’s most salient points is that having MORE methods available for prediction is not the same as doing BETTER prediction. Indeed, he shows the results of some experiments he conducted with Jiashun Jin on some standard real data sets, in which a very simple predictor is compared to various “fancy” ones:

Boosting, Random Forests and so on are dramatically more complex and have correspondingly higher charisma in the Machine Learning community. But against a series of pre-existing benchmarks developed in the Machine Learning community, the charismatic methods do not outperform the homeliest of procedures…

This would certainly be a shock to most students in ML courses — and to some of their instructors.

Maybe the “R people” (i.e. Stat Departments) have as much to contribute to data science as the “Python people” (CS) after all.

30 thoughts on “Some Comments on Donoho’s “50 Years of Data Science””

Pingback: Some Comments on Donaho’s “50 Years of Data Science”-IT大道
Thomas Speidel says:

January 23, 2016 at 10:23 am

Fantastic commentary on Donaho’s excellent essay. To extend on the “ML is different from statistics, because we do prediction” mindset, it is often associated with “Classical statistics is only concerned with hypothesis testing”, which, for us in statistics, is utterly untrue. I feel this is a way to diminish someone’s contribution so that we can elevate some other flavour. It maybe shocking to young ML professionals and students to learn that statistics has focused on predictions for a hundred years or more, and that maybe hard to accept. But it’s not their fault if they were never taught in the first place.

Reply
1. matloff says:
  
  January 23, 2016 at 10:41 am
  
  Thanks for the nice comments. Good point on the further reduction of statistics in a chain Prediction -> Estimation -> Hypothesis Testing. The current skeptical attention being paid to p-values, in part motivated by Ionnidis’ commentaries, might help a little in that regard, but the entrenched momentum is almost impossible to remedy, especially because, as you point out, there is a lot of bad teaching out there.
  
  Reply
  1. Thomas says:
    
    January 25, 2016 at 7:34 am
    
    To your sentence “The accumulated efforts of statisticians over centuries are just too overwhelming to be papered over completely, and can’t be hidden in the teaching, research, and exercise of Data Science.” You should share this 1984 paper on the Journal of the American Stat Association on using cross-validation in predictive models…
    https://www.jstor.org/stable/2288403?seq=1#page_scan_tab_contents
    
    I always find it irritating when authors compare a poorly specified “classical” model such as logistic reg to a more modern model such as RF or SVM and declare the victory of the latter, thus feeding the myth of dated models. Upon closer inspection, the LR model resembles more something a student wrote for a stat101 homework. I echo James, Witten, Hastie, Tibshirani (2013) when they write: “However, for historical reasons, the use of non-linear kernels is much more widespread in the context of SVMs than in the context of logistic regression or other methods”.
    
    There’s also a large body of work by Peter Austin who compares classical (well specified) methods to more modern ones: http://works.bepress.com/peter_austin/
    
    or even this one, illustrating how many modern techniques are data hungry:
    http://www.citeulike.org/user/harrelfe/article/13467382
    
    It’s important then that the “Data Science” curricula do not view “classical methods” as dated and unfairly penalize them in favor of the new -and admittedly catchier sounding -methods.
    
    Reply
    1. matloff says:
      
      January 25, 2016 at 11:50 pm
      
      Thanks for the useful links. By the way, Hastie et al also talk of the “hype” regarding neural networks. I was pleasantly surprised to see this.
      
      Reply
2. CHANDRASEKHARA S. "C.S." GANTI says:
  
  January 31, 2016 at 7:25 am
  
  YES, you are absolutely correct, Statistics and it’s enormous branches are being in use for ever, and to say Johny come Lately its off-spring disciplines can not take all the credit . AI bit the dust , ES — they are now embedded integral into Systems
  Decision Making .. So, let us not say some body has found the Central Limit Theorem again .. or RAO-CRAMER r Inequality .. nor Late Prof. Dr. Sir GEP Box – JENKINS ARIMA and Forecasting – Models ,
  
  Reply
xi'an says:

January 23, 2016 at 1:52 pm

Do you know you mispelled David Donoho’s name into “Donaho” and “Donahue” for the entire post?

Reply
1. matloff says:
  
  January 23, 2016 at 2:27 pm
  
  Oops! Thanks for pointing that out. I’ve fixed it now, and done a couple of other edits.
  
  Reply
Ari Lamstein says:

January 23, 2016 at 4:11 pm

Great post Norm. I’m glad that I’m not the only one who doesn’t know what data science is!

Reply
1. matloff says:
  
  January 23, 2016 at 4:46 pm
  
  Not only doesn’t anyone know what it is, I have yet to find anyone who even claims to know what it is. 🙂
  
  Reply
  1. Ari Lamstein says:
    
    January 23, 2016 at 6:20 pm
    
    Ha. I’ve been toying with the idea of writing a piece on my blog about what the term means to me, but have been discouraged because it could open a can of worms. Your comment might be the kick I need to finally write it.
    
    Reply
    1. matloff says:
      
      January 24, 2016 at 12:17 am
      
      I look forward to seeing your take on it!
      
      Reply
Wayne Gray - RPI says:

January 23, 2016 at 4:35 pm

Great essay thanks. I just discovered “Donoho’s” essay recently (and was also corrected on a misspelling). Yours is a good add-on. Thanks.

Reply
Pingback: Distilled News | Data Analytics & R
Barry says:

January 25, 2016 at 12:43 pm

I’ve lead a discussion about this recently, as I’m trying to learn ML/data science, starting as a statistician. The thing which struck me about the ML literature vs the statistical literature is the emphasis on prediction. I’ve seen example after example of model building/hypothesis testing in the statistical literature, with no prediction whatsoever. Note that I’m using ‘prediction’ to mean ‘out of sample’ prediction, whether in a hold-out sample, cross-validation or a future sample collected after the modeling was done.

Reply
1. matloff says:
  
  January 25, 2016 at 11:48 pm
  
  The out-of-sample prediction is the root of the biggest contradiction in the ML literature, in my view. On the one hand, the ML people insist on not basing their analyses on the assumption that the data are NOT a sample from population (real or conceptual); to them, our data is “the whole thing.” But on the other hand, they want to predict “new” data, which is ABSURD unless the new data come from the same population as the original data.
  
  Reply
Jim Callahan says:

January 26, 2016 at 10:01 am

OUT OF SAMPLE PREDICTION: An interesting example of out of sample prediction comes from “The First Census Optical Character Recognition Systems Conference” in 1990.
http://www.nist.gov/customcf/get_pdf.cfm?pub_id=900652

“The U.S. Bureau of the Census (Census) and the National Institute of Systems and Technology (NIST) sponsored this Conference as part of ongoing research into recognition of hand-print.”
page 1

“One result of the Conference was that those recognition systems trained solely on the SD3 database generally displayed inferior TD1 recognition to those trained on a superset of this data, i.e. one including SD3 as a subset, or other proprietary datasets. The notion that SD3 as “clean” or “constrained” relative to the TD1 dataset was suggested by the writer profiles; SD3 was obtained from motivated permanent Census field personnel whereas TD1 was obtained from variously motivated more diverse and cosmopolitan high school students. An example is that
the European crossed seven is far more abundant in TD1 than SD3.”
page 20

This suggests a number of points:
1. The population was undefined
2. Splitting one sample into “Test” and “Training” is NOT equivalent
to two separately collected samples (how far “out of sample” do you want to test?)
3. Is it possible to define a “population” for an OCR system? I could hand print a sample right now that could (in theory) be run against the hardware and software of 25 years ago; this is different from official statistics that might wish to determine for example the prevalence of disease in a single year in a single location. One is open-ended and the other is closed. An open ended population could introduce new variants (such as the crossed seven) rarely or not seen at all in the original sample AND population . If a type is not seen in the original sample or population it will not show up in the sample no matter how split or re-sampled.

Reply
Robert Young says:

January 26, 2016 at 10:03 am

— But on the other hand, they want to predict “new” data, which is ABSURD unless the new data come from the same population as the original data.

a Ph.D. math stat I once worked for said it thus: “thou shalt not predict beyond the range of the data”, which is more strict. time series analysts do so all the time, of course.

Reply
Jim Callahan says:

February 9, 2016 at 6:47 am

In World War II code breaking and birdwatching certain types are known to exist even if they are not observed in the samples obtained to date; they are assigned
a small, but importantly non-zero probability. I believe Turing’s assistant IJ Good wrote about this. Thus, if one had an OCR sample without a slashed seven one might need to
“spike” (as in “to lace (a drink) with liquor”) the training and test samples with an artificially created example until the example is found in the wild (and test with and without the “spike”).
https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation

Reply
Jim Callahan says:

February 9, 2016 at 7:00 am

By “spiking” the training and test sample with a labeled value as yet unobserved, one could even train a learner to search for new items such as a “Higgs Boson” or “Planet 9”!

Reply
rkenett says:

July 30, 2016 at 11:19 pm

out of sample inference has been called analytic inference by Deming (as opposed to explanatory analysis that he called enumerative studies). This was presented in: Deming, W.E. (1953) On the distinction between enumerative and analytic studies. Journal of the American Statistical Association, 48, pp. 244–255.
To have discussants refer to analytic studies as “ABSURD” is evidence of one of the most self inflicted barriers statisticians. The reason data science is picking up where statistics should have expanded is because of this absurd self inflicted barrier.

Reply
rkenett says:

July 31, 2016 at 10:54 am

To strengthen the point that out of sample inference should concern statisticians here is a quote by Deming:“Tests of variables that affect a process are useful only if they predict what will happen if this or that variable is increased or decreased. It is only with material
produced in statistical control that one may talk about an experiment as a conceptual sample that could be extended to infinite size. Unfortunately, this supposition, basic to the analysis of variance and to many other statistical techniques, is as unrealizable as it is vital in much experimentation carried out
in industry, agriculture, and medicine.Statistical theory as taught in the books is valid and leads to operationally verifiable tests and criteria for an enumerative study. Not so with an analytic problem, as the conditions of the experiment will not be duplicated in the next
trial. Unfortunately, most problems in industry are analytic.”*
From dedicated preface to the Economic Control of Quality of Manufactured product by W. Shewhart, 1931.

Reply
1. matloff says:
  
  July 31, 2016 at 3:54 pm
  
  I have no problem with conceptual populations.
  
  Reply
Sufiyan Sheikh says:

September 23, 2019 at 9:37 pm

Data science is really a good feild and this have amzing blog and articles about it.

Reply
Khan Faisal says:

September 23, 2019 at 9:43 pm

amazing and motivational article to come in the feild of data science.

Reply
data expert says:

May 14, 2020 at 11:47 pm

Your amazing insightful information entails much to me and especially to my peers. ExcelR Data Scientist Course In Pune

Reply
kajal says:

March 28, 2021 at 12:33 am

Amazing analysis.

Reply
DataRock says:

December 22, 2021 at 7:09 pm

I am a new user of this site, so here I saw several articles and posts published on this site, I am more interested in some of them, will provide more information on these topics in future articles.
data science course in london

Reply
priyanka preethi says:

February 17, 2022 at 6:04 am

Hi, Thanks for sharing wonderful articles…

for More:

https://ammaiya.com/mobile-web/laravel-web-development-services-india.html

Reply
Anonymous says:

December 3, 2023 at 10:23 pm

Thanks for providing an amazing article, you can also check out Best Data Science Course in Pune to know more on this.

Reply

	Anonymous on Just How Good Is ChatGPT in Da…
	Quantile Regression… on Quantile Regression with Rando…
	Anonymous on Quantile Regression with Rando…
	Sina Özdemir on qeML Example: Nonparametric Qu…
	Anonymous on qeML Example: Nonparametric Qu…

Mad (Data) Scientist

Some Comments on Donoho’s “50 Years of Data Science”

30 thoughts on “Some Comments on Donoho’s “50 Years of Data Science””

Leave a comment Cancel reply

Musings, useful code etc. on R and data science

Share this:

Related

30 thoughts on “Some Comments on Donoho’s “50 Years of Data Science””

Leave a comment Cancel reply

Musings, useful code etc. on R and data science