The American Statistical Association (ASA) leadership, and many in Statistics academia. have been undergoing a period of angst the last few years, They worry that the field of Statistics is headed for a future of reduced national influence and importance, with the feeling that:

- The field is to a large extent being usurped by other disciplines, notably Computer Science (CS).
- Efforts to make the field attractive to students have largely been unsuccessful.

I had been aware of these issues for quite a while, and thus was pleasantly surprised last year to see then-ASA president Marie Davidson write a plaintive editorial titled, “Aren’t *We* Data Science?”

Good, the ASA is taking action, I thought. But even then I was startled to learn during JSM 2014 (a conference tellingly titled “Statistics: Global Impact, Past, Present and Future”) that the ASA leadership is so concerned about these problems that it has now retained a PR firm.

This is probably a wise move–most large institutions engage in extensive PR in one way or another–but it is a sad statement about how complacent the profession has become. Indeed, it can be argued that the action is long overdue; as a friend of mine put it, “They [the statistical profession] lost the PR war because they never fought it.”

In this post, I’ll tell you the rest of the story, as I see it, viewing events as a statistician, computer scientist and R enthusiasist.

# CS vs. Statistics

Let’s consider the CS issue first. Recently a number of new terms have arisen, such as *data science*, *Big Data*, and *analytics*, and the popularity of the term *machine learning* has grown rapidly. To many of us, though, this is just “old wine in new bottles,” with the “wine” being Statistics. But the new “bottles” are disciplines outside of Statistics–especially CS.

I have a foot in both the Statistics and CS camps. I’ve spent most of my career in the Computer Science Dept. at the University of California, Davis, but I began my career in Statistics at that institution. My mathematics doctoral thesis at UCLA was in probability theory, and my first years on the faculty at Davis focused on statistical methodology. I was one of the seven charter members of the Department of Statistics. Though my departmental affiliation later changed to CS, I never left Statistics as a field, and most of my research in Computer Science has been statistical in nature. With such “dual loyalties,” I’ll refer to people in both professions via third-person pronouns, not first, and I will be critical of both groups. (A friend who read a draft of this post joked it should be titled “J’accuse” but of course this is not my intention.) However, in keeping with the theme of the ASA’s recent actions, my essay will be Stat-centric: What is poor Statistics to do?

Well then, how did CS come to annex the Stat field? The primary cause, I believe, came from the CS subfield of Artificial Intelligence (AI). Though there always had been some probabilistic analysis in AI, in recent years the interest has been almost exclusively in predictive analysis–a core area of Statistics.

That switch in AI was due largely to the emergence of Big Data. No one really knows what the term means, but people “know it when they see it,” and they see it quite often these days. Typical data sets range from large to huge to astronomical (sometimes literally the latter, as cosmology is one of the application fields), necessitating that one pay key attention to the computational aspects. Hence the term *data science*, combining quantitative methods with speedy computation, and hence another reason for CS to become involved.

Involvement is one thing, but usurpation is another. Though not a deliberate action by any means, CS is eclipsing Stat in many of Stat’s central areas. This is dramatically demonstrated by statements that are made like, “With machine learning methods, you don’t need statistics”–a punch in the gut for statisticians who realize that machine learning really IS statistics. ML goes into great detail in certain aspects, e.g. text mining, but in essence it consists of parametric and nonparametric curve estimation methods from Statistics, such as logistic regression, LASSO, nearest-neighbor classification, random forests, the EM algorithm and so on.

Though the Stat leaders seem to regard all this as something of an existential threat to the well-being of their profession, I view it as much worse than that. The problem is not that CS people are doing Statistics, but rather that they are doing it poorly: Generally the quality of CS work in Stat is weak. It is not a problem of quality of the researchers themselves; indeed, many of them are very highly talented. Instead, there are a number of **systemic reasons** for this, structural problems with the CS research “business model”:

**CS, having grown out of research on fast-changing software and hardware systems, became accustomed to the “24-hour news cycle”**–very rapid publication rates, with the venue of choice being (refereed) frequent conferences rather than slow journals. This leads to research work being less thoroughly conducted, and less thoroughly reviewed, resulting in poorer quality work. The fact that some prestigious conferences have acceptance rates in the teens or even lower doesn’t negate these realities.- Because CS Depts. at research universities tend to be housed in Colleges of Engineering, there is
**heavy pressure to bring in lots of research funding, and produce lots of PhD student**s. Large amounts of time is spent on trips to schmooze funding agencies and industrial sponsors, writing grants, meeting conference deadlines and managing a small army of doctoral students–instead of time spent in careful, deep, long-term contemplation about the problems at hand. This is made even worse by the rapid change in the fashionable research topic*de jour,*making it difficult to go into a topic in any real depth. Offloading the actual research onto a large team of grad students can result in faculty not fully applying the talents they were hired for; I’ve seen too many cases in which the thesis adviser is not sufficiently aware of what his/her students are doing. **There is rampant “reinventing the wheel.”**The above-mentioned lack of “adult supervision” and lack of long-term commitment to research topics results in weak knowledge of the literature. This is especially true for knowledge of the Stat literature, which even the “adults” tend to have very little awareness of. For instance, consider a paper on the use of mixed labeled and unlabeled training data in classification. (I’ll omit names.) One of the two authors is one of the most prominent names in the machine learning field, and the paper has been cited over 3,000 times, yet the paper cites nothing in the extensive Stat literature on this topic, consisting of a long stream of papers from 1981 to the present.- Again for historical reasons, CS research is largely empirical/experimental in nature. This causes what in my view is
**one of the most serious problems plaguing CS research in Stat–lack of rigor**. Mind you, I am not saying that every paper should consist of theorems and proofs or be overly abstract; data- and/or simulation-based studies are fine. But there is no substitute for precise thinking, and in my experience, many (nominally) successful CS researchers in Stat do not have a solid understanding of the fundamentals underlying the problems they work on. For example, a recent paper in a top CS conference incorrectly stated that the logistic classification model cannot handle non-monotonic relations between the predictors and response variable; the paper really stressed this point, yet actually, one can add quadratic terms and so on to model this. **This “engineering-style” research model causes a cavalier attitude towards underlying models and assumptions.**Most empirical work in CS doesn’t have any models to worry about. That’s entirely appropriate, but in my observation it creates a mentality that inappropriately carries over when CS researchers do Stat work. A few years ago, for instance, I attended a talk by a machine learning specialist who had just earned her PhD at one of the very top CS Departments in the world. She had taken a Bayesian approach to the problem she worked on, and I asked her why she had chosen that specific prior distribution. She couldn’t answer–she had just blindly used what her thesis adviser had given her–and moreover, she was baffled as to why anyone would want to know why that prior was chosen.**Again due to the history of the field, CS people tend to have grand, starry-eyed ambitions–laudable, but a double-edged sword.**On the one hand, this is a huge plus, leading to highly impressive feats such as recognizing faces in a crowd. But this mentality leads to an oversimplified view of things, with everything being viewed as a paradigm shift. Neural networks epitomize this problem. Enticing phrasing such as “Neural networks work like the human brain” blinds many researchers to the fact that neural nets are not fundamentally different from other parametric and nonparametric methods for regression and classification. (Recently I was pleased to discover–“learn,” if you must–that the famous book by Hastie, Tibshirani and Friedman complains about what they call “hype” over neural networks; sadly, theirs is a rare voice on this matter.) Among CS folks, there is often a failure to understand that the celebrated accomplishments of “machine learning” have been mainly the result of applying a lot of money, a lot of people time, a lot of computational power and prodigious amounts of tweaking to the given problem–not because fundamentally new technology has been invented.

All this matters–a LOT. In my opinion, the above factors result in highly lamentable opportunity costs. Clearly, I’m not saying that people in CS should stay out of Stat research. But the sad truth is that the usurpation process is causing precious resources–research funding, faculty slots, the best potential grad students, attention from government policymakers, even attention from the press–to go quite disproportionately to CS, even though Statistics is arguably better equipped to make use of them. This is not a CS vs. Stat issue; Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.

# Making Statistics Attractive to Students

This of course is an age-old problem in Stat. Let’s face it–the very word *statistics* sounds hopelessly dull. But I would argue that a more modern development is making the problem a lot worse–the Advanced Placement (AP) Statistics courses in high schools.

Professor Xiao-Li Meng has written extensively about the destructive nature of AP Stat. He observed, “Among Harvard undergraduates I asked, the most frequent reason for not considering a statistical major was a ‘turn-off’ experience in an AP statistics course.” That says it all, doesn’t it? And though Meng’s views predictably sparked defensive replies in some quarters, I’ve had exactly the same experiences as Meng in my own interactions with students. No wonder students would rather major in a field like CS and study machine learning–without realizing it is Statistics. It is especially troubling that Statistics may be losing the “best and brightest” students.

One of the major problems is that AP Stat is usually taught by people who lack depth in the subject matter. A typical example is that a student complained to me that even though he had attended a top-quality high school in the heart of Silicon Valley, his AP Stat teacher could not answer his question as to why it is customary to use n-1 rather than n in the denominator of s^{2 }. But even that lapse is really minor, compared to the lack among the AP teachers of the broad overview typically possessed by Stat professors teaching university courses, in terms of what can be done with Stat, what the philosophy is, what the concepts really mean and so on. AP courses are ostensibly college level, but the students are not getting college-level instruction. The “teach to the test” syndrome that pervades AP courses in general exacerbates this problem.

The most exasperating part of all this is that AP Stat officially relies on TI-83 pocket calculators as its computational vehicle. The machines are expensive, and after all we are living in an age in which R is free! Moreover, the calculators don’t have the capabilities of dazzling graphics and analyzing of nontrivial data sets that R provides–exactly the kinds of things that motivate young people.

So, unlike the “CS usurpation problem,” whose solution is unclear, here is something that actually can be fixed reasonably simply. If I had my druthers, I would simply ban AP Stat, and actually, I am one of those people who would do away with the entire AP program. Obviously, there are too many deeply entrenched interests for this to happen, but one thing that can be done for AP Stat is to switch its computational vehicle to R.

As noted, R is free and is multi platform, with outstanding graphical capabilities. There is no end to the number of data sets teenagers would find attractive for R use, say the Million Song Data Set.

As to a textbook, there are many introductions to Statistics that use R, such as Michael Crawley’s *Statistics: an Introduction Using R,* and Peter Dalgaard’s *Introductory Statistics Using R*. But to really do it right, I would suggest that a group of Stat professors collaboratively write an open-source text, as has been done for instance for Chemistry. Examples of interest to high schoolers should be used, say this engaging analysis on OK Cupid.

This is not a complete solution by any means. There still is the issue of AP Stat being taught by people who lack depth in the field, and so on. And even switching to R would meet with resistance from various interests, such as the College Board and especially the AP Stat teachers themselves.

But given all these weighty problems, it certainly would be nice to do *something*, right? Switching to R would be doable–and should be done.

Dear Norm,

This is a fascinating post, thank you for writing it!

I am interested in promoting statistical thinking here in Israel, and your attitude to AP is giving me a lot to think about.

Yours,

Tal

Thanks for the comment, Tal. Is the AP Stat course taught in Israel?

As far as I know it is not. I had thought for awhile that I may want to try and push teaching stats in high-schools here. But after reading your post, and thinking about it further, I decided it might do more harm than good.

I am now trying to think of what might be the best way to get high-school students involved in a stats project, with the guidance of someone who loves stats. I started discussing it with several ISA members here (the Israel equivalent for the ASA), and will probably bring this up in one of our next “council meetings”.

So thanks again 🙂

Best,

Tal

Good luck, Tal! By the way, I just made another post on the topic.

Great article. I completely agree that the statistics field is due for a makeover. Much of my graduate statistics courses were taught by past their prime professors who had little interests in using tools like R to relate the learning to real world problems. To your point, CS is more attractive because it focuses on real world problems and the best ways to solve them. Much of the underlying theory is statistics, however; most statistics course only focus on the theory and never get to the “Cool” statistics that is all around us today.

Dear Professor Matloff,

Thanks for your thoughtful post. I’m wondering if you can point me out on where to find papers on using labelled and unlabeled data:

” yet the paper cites nothing in the extensive Stat literature on this topic, consisting of a long stream of papers from 1981 to the present. ”

I googled them for some time and can only find one paper from the stat literature:

http://arxiv.org/pdf/0710.4618.pdf

I’m extremely interested in reading this literature.

Thanks a lot,

Kry

Actually, it goes back even earlier than 1981. I would suggest starting with “Inference based on regression estimators in double sampling,” Biometrika (1978), 65, 419-427, and then looking at papers that cite it (I’ve cited it myself).

Thank you very much for the great post and for starting a great discussion. I’d just shared a similar set of thoughts with a graduate student at NC State in the last week.

IMHO, statistics is best served by embracing what others have contributed and then moving the discipline further forward. We shouldn’t feel threatened or get defensive when there are advances. Rather than consider it an encroachment, let’s consider it an enrichment of our discipline and our toolbox.

Speaking as someone in the applied world that also shares “dual loyalties” to statistics and computer science, my job each day is to provide the best solution I can to real problems. And what motivates me more than my loyalty to any discipline is my passion to learn and to see the results of my work generating value in the real world.

It would be very remiss of me not to mention a great paper by Leo Breiman that discusses some of these topics: http://projecteuclid.org/euclid.ss/1009213726

While I don’t agree with everything Breiman said, I applauded his bold approach to open the discussion.

Thank you for likewise stimulating the conversations. This is the open door to move things forward.

Yes, good point about the Breiman essay. Actually, Leo was ahead of his time, working on interesting “big data” problems decades ago. His work is one of the few exceptions to my “rule” that CS people are generally not familiar with the Stat literature.

I hope my remarks were not misconstrued to mean that CS has nothing useful to offer to Stat people. I certainly don’t hold such a view.

Breiman also blamed some/ a lot? of the divorce on Bayesian focus on “uncertainty” and the concentration on machinery for getting posteriors rather than the problem itself. I can link. (I’m an outsider here. I could even use the party analogy Norm used earlier today on my blog, never mind.)

Well said.

As a side note regarding open-source statistics text, I recommend that you check out “OpenIntro Statistics” here: http://www.openintro.org/stat/textbook.php

酬謝, 余(?)先生! By the way, your site’s name, Statistics City reminds me of another site, Capital of Statistics.

Oops, mouse error, meant to write 謝謝. 😦

Dear Professor Matloff,

Greatly appreciate your help 🙂

Kry

Dear Professor Matloff,

Thanks for writing this post. Statistics and probability has had such an illustrious history of academic and practical accomplishments, and that was done by combining rigor with pragmatism, something that does seem amiss in the short attention spans of the fad-loving engineering departments.

Regarding your last point, I think the open-intro statistics textbooks fill in admirably well as an open source statistics text with R:

http://www.openintro.org/stat/textbook.php

Turning them into a wiki would be even better!

Instead of AP stats taught by un-trained turn-offs, we should institute a fixed coursera curriculum for high school students. Have the students watch the videos, complete the assignments, and then do one interesting data analysis project of their choice at the end of the year. There have been plenty of introductory courses on Statistics on online platforms, including one with R taught by a dynamic COPSS medalist! (https://www.coursera.org/course/introstats). Surely there are enough credible statistics courses/resources available for free online to ensure a good foundation for high school students.

Glad to see you support open-source textbooks! I have one of my own, somewhat higher level.

However, I’m pretty negative regarding online education, and have written an op-ed on that topic.

Since coders get to re-invent any wheel which strikes their fancy, nearly always ignorant of the wheel’s history, CS will always have a FOTM style/paradigm/language/framework/foo that is perceived as “new”. Stat, on the other hand, looks to the newbie as hoary. That guy named Student? Feller? Snedecor? All dead and gone. Can’t match The Zuck. The theory bits are largely very old and too mathy, while the applied bits are generally done in Excel (it’s good enough for the London Whale, after all) with a few hours of training.

The Big Data bit is largely applied descriptive stats.

To the extent that theoretical stats is finding new knowledge, this new knowledge needs attachment to the real world to be attractive to the ADHD crowd. CS is often a synonym for software engineering, which is to say, not theory.

Thanks for the humorous though absolutely on-point comments.

I think you’d find that even among Stat grad students, Zuck would have far more name recognition that Feller. 😦

Hi Norm: Didn’t know you had a blog, I will link to it. This issue, the data science/stat divide has come up a lot. I first heard about it I think on Normal Deviate’s blog, and then increasingly frequently. The leaders in machine learning also think they are leaders in philosophy and claim to have an easy solution to the problem of induction: use iid samples and only make inferences to restricted kinds of observables.

when it doesn’t work it just shows it was an ill-formed problem. But perhaps it’s all you need for marketing, advertising, homeland security?

Thanks, Deborah. I’ve been linking to your blog as well.

Yes, marketing is a supreme consideration, due to the huge pressure to bring in research dollars.

There are tons of philosophical inconsistencies in machine learning circles. Is the data a sample from a population, or is it itself the entire population? Many ML people insist it is the latter, yet act as if it’s the former, by calculating p-values (!) and always talking about how the fitted model will perform on “new data” (huh?).

Well here are some links stemming from a conference with machine learners, stat people, and a few philosophers, but if you look at Cherkassky, you have to check my corrections of what he says at the bottom of the post.

http://errorstatistics.com/2012/06/26/deviates-sloths-and-exiles-philosophical-remarks-on-the-ockhams-razor-workshop/

http://errorstatistics.com/2012/07/06/vladimir-cherkassky-responds-on-foundations-of-simplicity/

Did you notice some connections between your links here and some points I made in today’s blog post? The point about mysticism particularly struck me, as it connects to my remark about “starry-eyed” CS people attributing superhuman powers to neural networks. (OK, a slight exaggeration on my part, but only slight.)

Also, from what I know of Vapnik and his followers (that latter word sounds mystical too!), they epitomize my point about denying the data is a sample from a population. I wonder whether he’d disagree.

One thing I’d point out is that Vapnik’s characterization of machine learning as producing “black boxes” is not quite correct. The popularity of the LASSO, for example, among ML people is a good counterexample, and of course models like the logit, mixtures of multivariate normals and so on show that parametric methods are common in ML too.

Very interesting, and while I can’t claim to understand everything you’ve written here, I am one of those “AP stats students” you refer to. And I agree with your points. I was “turned off” of stats (and math in general) even though my teacher was quite good and got my undergrad in the arts. But I’ve recently re-discovered stats in graduate school and have imagined up several times over what I would have liked my high school course to consist of.

The main thing I imagined in addition to what you have already said is one or more practical projects. The class can decide on something they would like to know about the student body and then make a plan for how to sample the student body and calculate the resulting statistics. Results could then be presented in the school newsletter or some other venue. Perhaps this draws on more areas than just statistics, but I am confident they will take more away from it.

The idea of a project is good, and was mentioned by a previous poster. But it MUST be accompanied by a solid understanding of the meaning of the methods that are applied in the project.

I’m totally confused by the concept of “best and brightest” graduate students, which you mentioned at least twice. Really, it doesn’t come down to enthusiasm/motivation/curiosity, and good instruction from the elders? It’s

generic talent?In my opinion, yes. (Some would disagree.)

Obviously mine is a minority view on AP statistics, and it’s almost entirely from a mother’s perspective. I encouraged my son to take it, but was prepared to have gripes about the text. But it turned out to be quite good. Not only was I surprised that it included a lot on testing model assumptions, it did a splendid job interpreting tests and confidence intervals—head and shoulders above what I read about every day in confused articles criticizing these methods. Maybe some of these people would have been much better off in AP stat than in whatever methods courses they picked up their statistical recipes. I wound up giving a guest lecture on philosophical foundations of statistics in my son’s course–I have no idea what the students got out of it, but the teacher gets credit for even inviting me. And, finally, I’m really thankful that I can talk statistics with my son–who wound up as a music major in college–thanks to AP statistics in H.S.

Glad to hear your son had such an insightful, dedicated teacher. Do you remember the title or author of the text?

My daughter took five or six AP courses (but not Stat). I was impressed by all the teachers, but I didn’t think any of the courses themselves were very good. I do believe that my daughter would have gotten much richer courses had she taken them in college.

It was Peck, Olsen and someone. It’s in my office at the university because I sometimes used it in in a seminar I taught last spring.

From what you say and I’ve generally heard,it means that the AP courses are better taken in college. The solution might be 3 years of high school, at least as an option for fast trackers. On the other hand, I know students who have been helped financially by having had so many AP credits in advance of college.

I touched upon nearly the same issues in my article on Liason (the Statistical Society of Canada Newsletter) p.61

http://www.ssc.ca/webfm_send/1532

Needless to say, I agree with everything. If statisticians have a fault is that they have been uninterested in promoting the excitment of our work. The popularity of R is helping and I think it should be leveraged more in intro stat courses which in turn can be made more fun and applied. Statistical literacy is another BIG problem, especially in organizations. Statistical certifications are largely unkwnown outside of research institutions. One largely unaffected field is that of medical research. Some say it’s because it has historically been a stronghold of statisticians. I actually think along the line of Nate Silver when he wrote that bad models in this field kill people.

My turn to say “needless to say”: Needless to say, I think your linked article is outstanding, right on the mark.

Actually, the medical field has further harmed the image of Stat, with the recent papers on lack of reproducibility of studies. (They are not all medical, but most of the ones under scrutiny are, I believe.) So many statistical results just don’t replicate in new studies. Part of this is due to lack of awareness of things like lab-to-lab variability, and part due to lack of awareness of the multiple comparisons problem–both Stat issues.

I’m afraid I view Silver as worsening the problem too. Yes, yes, I agree that he has done a lot to (start to) make Stat look cool. But I have problems with his methods. First, he claims to be a Bayesian, meaning using subjective priors, even though he is not. And worse, he puts down people who object to Bayesian analysis. He does everything by intuition, which has worked well–in the U.S. He failed when he tried to predict UK elections.

Nice post.

Statistics may be like Operations Research and “AI”. The concepts get picked up for mainstream use, and are no longer credited to the original developers/field. (Voice recognition used to be considered AI, for example. So did vehicle routing.)

But the problem you discuss is more serious: poor understanding of the foundations, and lots of resulting errors, regardless of whose terminology.

Greetings!

I totally agree with your post. When I started taking courses in Machine Learning, I could see the focus on statistics and probability and I realized I did not have good understanding if it.

I researched online and found the book Understanding Probability by Henk Tijms, which is a gem in this topic. After finishing this and brushing a few more stats concepts, I felt comfortable going through Elements of Statistical Learning. I have been working on text classification (sentiment analysis) in a startup, and now I thoroughly enjoy seeing these as problems in computational statistics and pptimization rather than anything else.

(short bio: I am from India, graduated from IIT Bombay and excited about math, stats and algorithms!)

Greetings!

I totally agree with your post. When I started taking courses in Machine Learning, I could see the focus on statistics and probability and I realized I did not have good understanding of it.

I researched online and found the book Understanding Probability by Henk Tijms, which is a gem in this topic. After finishing this and brushing a few more stats concepts, I felt comfortable going through Elements of Statistical Learning. I have been working on text classification (sentiment analysis) in a startup, and now I thoroughly enjoy seeing these as problems in computational statistics and optimization rather than anything else.

(short bio: I am from India, graduated from IIT Bombay and excited about math, stats and algorithms!)

Great post. Concerning new teaching approaches to stats and the use of R I would recommend a look at “Start Teaching with R”, a preliminary edition written by Randall Pruim, Nicholas J. Horton and Daniel T. Kaplan; here is a link: http://cran.r-project.org/web/packages/mosaic/vignettes/V2StartTeaching.pdf

Your style is really unique compared to other folks I’ve read

stuff from. Thanks for posting when you have the opportunity, Guess I

will just book mark this web site.

I can relate to the boring statistics programs. In my case the math department gave the statistics classes to CS students. By the time I took it I had been working with Rayleigh and Mie distributions to render the atmospheric scattering at the Computer Graphics class or using the logistic regression on the Fractals class.

I came with a very high-bar for the statistics class which was very mathy (expected) but dull at best, disconnected from reality at worst. It is sadly no wonder students choose CS as it is traditionally more centered on tinkering with the models, even if I believe it is no excuse for CS to dimiss the mathematical rigor usually found on statistics.

Until researchers/professors in the statistics field do not understand that, the CS field will continue taking over the data side and statistics will continue to recede into mathematics.

Machine learning is not “old wine in new bottles”. Statistics is concerned with theoretical results, proofs, correctness, tests, distributions, etc. Machine learning is generally not concerned with any of those questions (despite a brief invasion by theoreticians, which is thankfully over); what matters in machine learning is what works in practice, something that can be determined empirically. In different words, statistics is to machine learning roughly what pure math is to physics or biology, with the same kind of carping by the mathematicians/statisticians about how one should be taken more seriously.

There are almost no practically relevant or useful theoretical results in machine learning today. That’s a shame, but it is a failure of statisticians to produce useful theoretical results, not of people in machine learning to use them.

Actually, very little of academic statistics involves proofs and the like.

I disagree with your analogy. Statisticians use the LASSO, for instance, and machine learning people use the LASSO.

I don’t agree either that machine learning is “old wine in new bottle”. But when someone writes that statistics is “concerned with theoretical results, proofs, correctness, tests, distributions, etc” it shows a fundamental misunderstanding of what statistics is or does. Something more in tune with a hasty conclusion based on a stat 101 course (by the way, many recognize that’s a problem as the discussion on this post highlights).

From sampling methods to survival analysis, from experimental design to LASSO, from geostatistics to biostatistics. Statistics is just as concerned with practical relevance; it’s also concerned to understand when something does not work in practice and what the costs of that not working is.

We can’t get away from uncertainty, so we better have something that speaks its language: http://thisisstatistics.org/

There is clearly conflict abounding. Revolving around a few higher level questions. I’ll offer some answers.

Q: Is statistics a theortical study anymore?

A: Yes, but only narrowly. If we look at the body of stats since WWII, what new results can be codified as truly “theory”, a la quantum theory for the physicist, for instance? Not a lot, I’d say. New ways to calculate squared differences, and there’s been a boat load of ’em, don’t count.

Q: Then stats is an applied practice, much like civil engineering?

A: Yes, as well. Most of what passes for stats, and most of those doing this work, are using calculations that are (or nearly) a century old or more. Much of what has been created since WWII could reasonably be analogized to better bulldozers and stronger concretes, not quantum theory (disagree as one sees fit). Geometers, bored with Euclid, created alternatives from whole cloth. We went Bayesian. And, of course, our financial quant brethren crashed the world, leaving a rather bad taste with those considering the field. If civil engineers routinely made bridges that fall down go boom, they’d have some difficulty getting young-uns to sign on, ya think?

Q: What should be the pedegogy to entice the young-uns into stat rather than CS or EE?

A: Much tougher to answer. Like EE, theoretical/math stat really is mostly math, with proofs. I suspect EE faces the same dilemma. CS is mostly about writing java, or may be C++. CS offers up a lottery ticket (WhatsApp, and such) to riches, if only one can coerce a commercially tasty little program out of one’s brain. Said program need not be especially productive to the commonweal. Oddly, CS, at bottom, is still just new ways to skin the Von Neumann cat. A field can procreate prodigously on hoary foundations. Financial engineering quants can make outsized $$$, but the commonweal risks its own life in doing so (The Great Recession, London Whale, and such). To truly comprehend what the stat calcs are, and when their use is appropriate, requires knowing a good deal of probability and measure theory and so on. Lots o algebra there.

Q: So far as AP stat courses are concerned, are students in AP math say, also unhappy?

A: Good question. I suspect so. The point of AP courses is to build the knowledge foundation needed in the subject, so as to allow the student to take more advanced courses when in college; skip the freshman routine. In other words, the hoary parts that college instructors have had to force feed freshman for decades; they’re really tired of having to do that. (I sure didn’t like doing it as a TA.) The courses aren’t, on whole, based on expanding the student’s self expression in the field of study. They don’t yet know enough to self express (note the contrast with CS???). But leave that answer to the discussion.

Q: Is stats headed the way of COBOL?

A: Possibly. The COBOL coders (many of which are IIT freshers in Fortune X00 companies; I was present for the invasion in one such company) keep old code functioning by adding more arms and legs to a corpse, since rewriting into C++/java/PHP (on mainframes, nearly always) is viewed as a waste of money. There’s lots of useful work being done by COBOL code, just as there’s lots of useful work being done with long-known stat routines. CRAN has passed 6,000, IIRC. That’s a lot wrenches in the toolbox. Maintenance is viewed as bottom-of-the-barrel in nearly any brain-work, so there’s going to be plenty of that in stats. COBOLers who deeply know a 30 or 40 year old application (and the idioim of COBOL used to create it) can make a very comfortable living. They just don’t get to “do it my way”, all that much. To the extent that stats devolves to such as state, keeping the young-uns out does reduce the competition for wages. Perhaps we should embrace our obsolescence?

Hello There. I found your blog using msn. This is an extremely well written article.

I’ll make sure to bookmark it and come back to read more of your useful information. Thanks for the post.

I’ll definitely comeback.

Its not my first time to pay a visit this website, i am visiting this site dailly and get fastidious data from here all the time.

You can become a follower, and thus be notified of postings, which are sporadic. Or, better, sign up for the R-bloggers blog digest, an excellent service.

FYI, the Sept. 7 and 12 comments are spam, not real. They are totally generic – they could be posted on any blog. My wordpress.com filter had a wave of this garbage ; something happened to their filter.

Hello There. I found your blog using msn. This is a really well

written article. I’ll be sure to bookmark it and return to read more of your useful

info. Thanks for the post. I will certainly comeback.

Thank you, I’ve recently beenn searching for info approximately thos subject forr a while and

yours is the best I have came upon till now. However, what concerning the bottom line?

Are you certain concerning the source?

Not sure what you mean by “source.”

I just like the helpful information you provide for your articles.

I’ll bookmark your weblog and check once more right here frequently.

I am relatively sure I will be told a lot of new stuff proper here!

Good luck for the following!

This is brilliantly written.

I am a CS guy and just started using Pandas and Sci-kit by learning at work. Statistics is an interesting field and we sure have missed a lot since school.

Thanks for the comments, and welcome to Statistics. 🙂

Haha. This could be strange, I’m a prospective applicant to the graduate program at UCD, looking forward to learn and work with you 🙂

great publish, very informative. I wonder why the other experts of this sector don’t understand

this. You should continue your writing. I am sure, you have a great readers’ base already!

Thanks for the nice words.

Reblogged this on Scottedwards2000’s Weblog.

Reblogged this on kannan dreams.

I believe that is among the most important info for

me. And i’m happy studying your article. But wanna commentary on some normal issues,

The site style is wonderful, the articles is in point of fact excellent : D.

Just right activity, cheers