Hi Norm.
I’ve just had a look to this post and as a statistician I feel very identified with some of the comments you make.
For example, when I agree a lot when you mention Statistical correctness of R and its learning curve. Definetely, it’s much easier to learn when most of the tools you need are in the base package.
Also, I would say that Python is more advanced in the ML community and the packages they offer. However, I haven’t seen Python packages related to design and analysis of complex survey sampling. This is definitely a win for R with the ‘sampling’ and ‘survey’ packages.
On “language unity,” I’d be interested to hear more about your views of the tidyverse, as I find myself using functions from dplyr frequently in my data scientist job. Are there alternatives in base R or other packages that you prefer?
I agree with pretty much all of what you said. As a statistician, it does bother me that data science doesn’t seem to acknowledge the shoulders of the statistical giants they are standing on, and your comment about standardising to 0 mean and variance 1 makes me believe if it out of ignorance (and perhaps hubris) as much as anything else. I read a data science blog once that said that the normal distribution is another name for the binomial! That’s not misleading (actually, it is) it is just plain wrong. Also cannot agree with you more with this tidyverse stuff. Feels like they’ve forked (or a word that sounds similar…) R rather than integrated with it.
Thanks for your blog, it is nice to have the perspective of someone who is clearly across it rather than the usual cacophony.
I agree with both points — especially the Tidyverse. Why complicate things further (and take up more grey matter) when existing Base functions work very well.
The only really controversial answer to me is the one to “language unity”. I don’t think the dialects are more unintelligible to each other than pandas to python. Data scientists “live” in pandas, whose method chaining looks a lot like dplyr. It is not very pythonic, but it’s ok. One can switch back and forth.
Also, I think the tidyverse is *vastly* superior to ordinary R. It so simple, it can be learned without knowing much R. It has consistent syntax across many base functions: manipulation, presentation, graphing, data handling, list comprehensions. And it is memorable. I still forget aspects of data.table. The tidyverse is easy to remember or to guess. Ultimately, it’s up to the users to load the packages. Rstudio doesn’t enforce a behavior, neither does Hadley Wickham. Many users vote with their library(.) expressions.
“I don’t think the dialects are more unintelligible to each other than pandas to python…[the tidyverse] can be learned without knowing much R.” Isn’t there a contradiction in there? Doesn’t that second statement support my point about mutual unintelligibility?
data.table is not base R. I heartily agree with Prof. Matloff on this post, including the “language unity” component. I find the Tidyverse less memorable than base R, and its myriad functions to implement routine, if a little esoteric syntactically, data manipulation operations using quirky function names unwieldy and unnecessary. I also find the pipe operator far less readable than plain R’s compositional syntax; and that the Tidyverse–which came later–will not necessarily work well with base R only widens the divide. I do wonder whether a background in mathematics (and statistics) leads to more comfort with the standard compositional syntax (i.e., the way mathematical notation works), while users who are not as familiar with mathematical notation (and “reading” or “thinking” this way) find the “pipe” organization more intuitive, being more “natural language” or “cooking recipe” format. Researchers I work with from non-mathematical fields, such as epidemiology and biology, do sometimes find the pipe operator more accessible than the base R, old-style syntax. I am a long-time S/S+/R user (~30 years), so I also have to admit that some of my perspective may be skewed simply from years of familiarity with the base syntax, though I do adopt changes that I find more efficient (e.g., dplyr’s arrange and summarize functions).
I think it’s highly insulting to R novices to suggest that they are incapable of learning and using g(f(x), as RStudio seems to think. It’s not that deep!
I did not say or mean to imply “incapable” (though perhaps that’s RStudio’s operational position). Rather I meant from a learning curve, already developed-intuition perspective. I meant people of different backgrounds now routinely using tools not traditionally used in their disciplines, and adoption of essentially one language interface may be more intuitive than another simply due to experience.
That’s pretty abstract. I’ll say this, which I said in the Twitter discussion: I’ve taught several different subjects — math, stat, CS and even ESL. In the latter, I taught the lowest level, to adult students of very limited education. So my experience on teaching and learning is not just for CS students. I’ve given deep thought to the learning process over many, many years of teaching. So I think I do have at least some perspective here. By contrast, the people at RStudio have little or no teaching experience.
What are your thoughts about Julia? Do you think it’s likely that it will replace R in the medium-term? Recently, Julia’s popularity experienced a steep increase, as far as I’m aware. My current view is that it all depends on the packages. And there seems no slowing down in published packages for R at the moment.
I’ve written about Julia in this blog before, at a time (3 years ago) when I thought Julia would really take off in Data Science. I no longer think that.
As a systems guy, I’m really impressed by Julia; there are so many ways to tweak performance. But for data science, no. It’s not written by data scientists, in contrast to R, which is my complaint about Python as well.
Personally, I think you under-cooked the statistical completeness section. One tangible benefit of R are the help pages. I guess because of it’s by and for academic statistician legacy, these are very good at pointing you at the definitive algorithm and/or a definitive reference. Likewise, authors of definitive texts publish an R package alongside their work. As an applied statistician, I think you are *far* more likely to find a credible package that supports your work (generalised linear mixed models are my main interest) in R. However, If your data are in an SQL database, and you want a statistical analysis I think the answer may well be in CRAN.
However, I think Python wins if you are working with non-rectangular data – I think it’s much better at custom manipulations. Finally, and being really nebulous, I think Python lends itself to better coding practices. I get the impression Unit Testing is more established in Python, whilst I initially hated the fortran like dependence on formatting it makes it easier to apply style rules and use linters.
Paul
Yes, the online help for R is much better than for Python, and the custom (though not requirement) of having vignettes furthers that point.
I hear the claim “Python is better for nonrectangular data” a lot, and I can never see it. The case commonly cited is image data — which is rectangular! I did mention that R’s lack of pointers makes it hard to deal with things like binary trees; maybe you meant that kind of thing?
I don’t think so. Most data scientists will be productive knowing only a subset of Python and R’s syntax, and a very small subset of their libraries. The tidyverse is such a subset is just a subset of existing R, magnified. I think the goal of the Tidyverse was to provide a large set of capabilities by adopting a simple and intuitive syntax extension (piping) and a data organization (tidy data frames). A base R user can easily read tidyverse code. A tidyverse user *can* read base R code, and mix it with tidyverse code effortlessly. Personally, I haven’t used lapply in a while since the introduction of purrr, but the cognitive effort to switch is minimal; I still use the `[` operator all the time, and sometimes loops. Not unlike choosing between loops, list comprehensions and group-like methods in python, and certainly nothing close to the effort needed to switch between C++ OOP and template metaprogramming, or OOP vs functional programming in Scala (*those* are mutually unintelligible).
A new user reading your post may think that the tidyverse is a negative for R. Instead, it’s a huge positive and a competitive advantage of its ecosystem. That’s the only reason I felt the need to comment on your post, and I hope that the new user will read my arguments and others’ as a counterpoint to yours. And maybe they will reconsider.
Talking about contradictory statements, I find your “I really don’t consider dplyr to be in the tidyverse, though some would disagree” odd. dplyr and purrr are the quintessential components of the tidyverse, more so than ggplot2 (which predates it, and it shows), and tibble/readr/tidyr (which can be easily used as non-tidyverse packages, i.e., in a base R context).
Summing up: a) I don’t think R is being splintered and complicated by tidyverse, the way C++ was by its language extensions and programming styles; b) And I am confident that, without the tidyverse, R would have fallen into irrelevancy and been abandoned in favor of Python; c) if imitation is the sincerest form of flattery, then the addition of pipes and dplyr verbs to pandas and the increased adoption of method chaining in python speaks volumes.
I would also add “useless verbosity: Loss for Python” to the list. I just debugged someone else’s Python code, and one line in particular struck me – using Python data structures and syntax, it took essentially 76 characters and six functions + methods to do a simple operation that could have been done in R in 8 characters, including spaces.
Congrats, that is a really helpful comparison. I always wondered if I should switch to Python as a lot of people says it’s “much better”, so an unbiased comparison like this helps a lot!
Also good to hear a statement about the tidyverse.
Thank you Dr. Matloff for your reasoned comparision. I’ve felt for the past 2 years the tidyverse vs base divide has not served R well in terms of the in-roads Python has made into data science adoption.
Having learned Python before R (switched full time 3 years ago), for applied statistical modeling & general data science for marketing analytics business use; I found the biggest difference being the speed of “thought”.
R, especially base, always feels more natural & intuitive to the thought process of solving a business problem using data.
I found, in contrast, Python to be much more clunky in terms of forcing me away from the analytical thought process to become more of a programmer mindset.
This was especially true, in the need to define far more code & custom functions than in R (base).
While I realize this point of view may be in the minority these days, I’m sticking to base R as it does the job, but without all of the other unnecessary cognitive overhead not desired in solving business challenges.
Hi Norm.
I’ve just had a look to this post and as a statistician I feel very identified with some of the comments you make.
For example, when I agree a lot when you mention Statistical correctness of R and its learning curve. Definetely, it’s much easier to learn when most of the tools you need are in the base package.
Also, I would say that Python is more advanced in the ML community and the packages they offer. However, I haven’t seen Python packages related to design and analysis of complex survey sampling. This is definitely a win for R with the ‘sampling’ and ‘survey’ packages.
On “language unity,” I’d be interested to hear more about your views of the tidyverse, as I find myself using functions from dplyr frequently in my data scientist job. Are there alternatives in base R or other packages that you prefer?
I really don’t consider dplyr to be in the tidyverse, though some would disagree.
I agree with pretty much all of what you said. As a statistician, it does bother me that data science doesn’t seem to acknowledge the shoulders of the statistical giants they are standing on, and your comment about standardising to 0 mean and variance 1 makes me believe if it out of ignorance (and perhaps hubris) as much as anything else. I read a data science blog once that said that the normal distribution is another name for the binomial! That’s not misleading (actually, it is) it is just plain wrong. Also cannot agree with you more with this tidyverse stuff. Feels like they’ve forked (or a word that sounds similar…) R rather than integrated with it.
Thanks for your blog, it is nice to have the perspective of someone who is clearly across it rather than the usual cacophony.
I agree with both points — especially the Tidyverse. Why complicate things further (and take up more grey matter) when existing Base functions work very well.
The only really controversial answer to me is the one to “language unity”. I don’t think the dialects are more unintelligible to each other than pandas to python. Data scientists “live” in pandas, whose method chaining looks a lot like dplyr. It is not very pythonic, but it’s ok. One can switch back and forth.
Also, I think the tidyverse is *vastly* superior to ordinary R. It so simple, it can be learned without knowing much R. It has consistent syntax across many base functions: manipulation, presentation, graphing, data handling, list comprehensions. And it is memorable. I still forget aspects of data.table. The tidyverse is easy to remember or to guess. Ultimately, it’s up to the users to load the packages. Rstudio doesn’t enforce a behavior, neither does Hadley Wickham. Many users vote with their library(.) expressions.
“I don’t think the dialects are more unintelligible to each other than pandas to python…[the tidyverse] can be learned without knowing much R.” Isn’t there a contradiction in there? Doesn’t that second statement support my point about mutual unintelligibility?
data.table is not base R. I heartily agree with Prof. Matloff on this post, including the “language unity” component. I find the Tidyverse less memorable than base R, and its myriad functions to implement routine, if a little esoteric syntactically, data manipulation operations using quirky function names unwieldy and unnecessary. I also find the pipe operator far less readable than plain R’s compositional syntax; and that the Tidyverse–which came later–will not necessarily work well with base R only widens the divide. I do wonder whether a background in mathematics (and statistics) leads to more comfort with the standard compositional syntax (i.e., the way mathematical notation works), while users who are not as familiar with mathematical notation (and “reading” or “thinking” this way) find the “pipe” organization more intuitive, being more “natural language” or “cooking recipe” format. Researchers I work with from non-mathematical fields, such as epidemiology and biology, do sometimes find the pipe operator more accessible than the base R, old-style syntax. I am a long-time S/S+/R user (~30 years), so I also have to admit that some of my perspective may be skewed simply from years of familiarity with the base syntax, though I do adopt changes that I find more efficient (e.g., dplyr’s arrange and summarize functions).
I think it’s highly insulting to R novices to suggest that they are incapable of learning and using g(f(x), as RStudio seems to think. It’s not that deep!
I did not say or mean to imply “incapable” (though perhaps that’s RStudio’s operational position). Rather I meant from a learning curve, already developed-intuition perspective. I meant people of different backgrounds now routinely using tools not traditionally used in their disciplines, and adoption of essentially one language interface may be more intuitive than another simply due to experience.
That’s pretty abstract. I’ll say this, which I said in the Twitter discussion: I’ve taught several different subjects — math, stat, CS and even ESL. In the latter, I taught the lowest level, to adult students of very limited education. So my experience on teaching and learning is not just for CS students. I’ve given deep thought to the learning process over many, many years of teaching. So I think I do have at least some perspective here. By contrast, the people at RStudio have little or no teaching experience.
What are your thoughts about Julia? Do you think it’s likely that it will replace R in the medium-term? Recently, Julia’s popularity experienced a steep increase, as far as I’m aware. My current view is that it all depends on the packages. And there seems no slowing down in published packages for R at the moment.
I’ve written about Julia in this blog before, at a time (3 years ago) when I thought Julia would really take off in Data Science. I no longer think that.
As a systems guy, I’m really impressed by Julia; there are so many ways to tweak performance. But for data science, no. It’s not written by data scientists, in contrast to R, which is my complaint about Python as well.
Hi Norm,
Thanks for this. It’s very interesting.
Personally, I think you under-cooked the statistical completeness section. One tangible benefit of R are the help pages. I guess because of it’s by and for academic statistician legacy, these are very good at pointing you at the definitive algorithm and/or a definitive reference. Likewise, authors of definitive texts publish an R package alongside their work. As an applied statistician, I think you are *far* more likely to find a credible package that supports your work (generalised linear mixed models are my main interest) in R. However, If your data are in an SQL database, and you want a statistical analysis I think the answer may well be in CRAN.
However, I think Python wins if you are working with non-rectangular data – I think it’s much better at custom manipulations. Finally, and being really nebulous, I think Python lends itself to better coding practices. I get the impression Unit Testing is more established in Python, whilst I initially hated the fortran like dependence on formatting it makes it easier to apply style rules and use linters.
Paul
Very interesting comments.
Yes, the online help for R is much better than for Python, and the custom (though not requirement) of having vignettes furthers that point.
I hear the claim “Python is better for nonrectangular data” a lot, and I can never see it. The case commonly cited is image data — which is rectangular! I did mention that R’s lack of pointers makes it hard to deal with things like binary trees; maybe you meant that kind of thing?
I don’t think so. Most data scientists will be productive knowing only a subset of Python and R’s syntax, and a very small subset of their libraries. The tidyverse is such a subset is just a subset of existing R, magnified. I think the goal of the Tidyverse was to provide a large set of capabilities by adopting a simple and intuitive syntax extension (piping) and a data organization (tidy data frames). A base R user can easily read tidyverse code. A tidyverse user *can* read base R code, and mix it with tidyverse code effortlessly. Personally, I haven’t used lapply in a while since the introduction of purrr, but the cognitive effort to switch is minimal; I still use the `[` operator all the time, and sometimes loops. Not unlike choosing between loops, list comprehensions and group-like methods in python, and certainly nothing close to the effort needed to switch between C++ OOP and template metaprogramming, or OOP vs functional programming in Scala (*those* are mutually unintelligible).
A new user reading your post may think that the tidyverse is a negative for R. Instead, it’s a huge positive and a competitive advantage of its ecosystem. That’s the only reason I felt the need to comment on your post, and I hope that the new user will read my arguments and others’ as a counterpoint to yours. And maybe they will reconsider.
Talking about contradictory statements, I find your “I really don’t consider dplyr to be in the tidyverse, though some would disagree” odd. dplyr and purrr are the quintessential components of the tidyverse, more so than ggplot2 (which predates it, and it shows), and tibble/readr/tidyr (which can be easily used as non-tidyverse packages, i.e., in a base R context).
Summing up: a) I don’t think R is being splintered and complicated by tidyverse, the way C++ was by its language extensions and programming styles; b) And I am confident that, without the tidyverse, R would have fallen into irrelevancy and been abandoned in favor of Python; c) if imitation is the sincerest form of flattery, then the addition of pipes and dplyr verbs to pandas and the increased adoption of method chaining in python speaks volumes.
I cannot read tidy code. Chock full of calls to functions that I don’t know.
Your comment that without Tidyverse, R would have lost (further) ground to Python baffles me.
I would also add “useless verbosity: Loss for Python” to the list. I just debugged someone else’s Python code, and one line in particular struck me – using Python data structures and syntax, it took essentially 76 characters and six functions + methods to do a simple operation that could have been done in R in 8 characters, including spaces.
Congrats, that is a really helpful comparison. I always wondered if I should switch to Python as a lot of people says it’s “much better”, so an unbiased comparison like this helps a lot!
Also good to hear a statement about the tidyverse.
Thank you Dr. Matloff for your reasoned comparision. I’ve felt for the past 2 years the tidyverse vs base divide has not served R well in terms of the in-roads Python has made into data science adoption.
Having learned Python before R (switched full time 3 years ago), for applied statistical modeling & general data science for marketing analytics business use; I found the biggest difference being the speed of “thought”.
R, especially base, always feels more natural & intuitive to the thought process of solving a business problem using data.
I found, in contrast, Python to be much more clunky in terms of forcing me away from the analytical thought process to become more of a programmer mindset.
This was especially true, in the need to define far more code & custom functions than in R (base).
While I realize this point of view may be in the minority these days, I’m sticking to base R as it does the job, but without all of the other unnecessary cognitive overhead not desired in solving business challenges.