I have a new short writeup, showing common R design patterns, implemented side-by-side in base-R and Tidy.
As readers of this blog know, I strongly believe that Tidy is a poor tool for teaching R learners who have no coding background. Relative to learning in a base-R environment, learners using Tidy take longer to become proficient, and once proficient, find that they are only equipped to work in a very narrow range of operations. As a result, we see a flurry of online questions from Tidy users asking “How do I do such-and-such,” when a base-R solution would be simple and straightforward.
I believe the examples here illustrate that base-R solutions tend to be simpler, and thus that base-R is a better vehicle for R learners. However, another use of this document would be as a tutorial for base-R users who want to learn Tidy, and vice versa.
19 thoughts on “Base-R and Tidyverse Code, Side-by-Side”
You don’t seem to know `dplyr` very well. And there’s a lot of typo
“A lot of typo” is itself a typo. 🙂 If you have improvements to the dplyr code, please let me know.
I am not a dplyr use, but keep in mind that good usage is in the eye of beholder. I you have alternative code you think is better, please let me known.
This was really interesting, and eye-opening, in a way. I started out in R using base R for a couple years before switching over to the tidyverse. I felt like it made my life a lot easier to use the tidyverse instead. So this post surprised me. What do you think about showcasing an entire data wrangling workflow, of extensive data wrangling, from beginning to end, in base R and tidyverse? I think that’s why I prefer tidyverse, because of the pipe and the way my brain works step by step, where I can perform as many operations as I want in one big pipe-chain of commands, with very little redundant code. I also think there are a lot of cases where more complex case_when changes would be easier in tidyverse, which is a common thing to encounter in real world data.
That’s an excellent idea, showing an entire data wrangling session. I don’t have time for it now, but I believe that base-R would come out the winner by far. Re pipes, I actually oppose teaching them to beginners, whether in base-R or Tidy form. See my full Tidyverse Skeptic writeup, for the reasons. Re your point on case_when(), I totally agree with you, and in fact note at the start of the document that I don’t claim base-R is ALWAYS easier. That is why I recommend teaching a mix, right?
It would be interesting to do an entire end-to-end analysis, but I just don’t have the time for it. As for pipes, I write in my essay that they tend to be difficult to debug. Debugging is a huge issue, and I once I wrote a book about it.
Also, for code clarity, instead of h | g | f(a) I think it’s much better to do
tmp <- (f)a
tmp <- g(tmp)
to get both the programmer and others who read the code a chance to "catch their breath" in writing/reading the code, and of course that solves the debugging issue.
I also see students struggling with the tidyverse, and much coding is based on wild copy-pasting rather than recalling and understanding the code. I am very thankful for your initiative!
A few remarks:
in base R you might use with() and within() to remove redundant mentions of the name of the data.frame, e.g.
gr <- with(mtcars, tapply(mpg, list(cyl,am), mean))
mtcars <- within(mtcars, hwratio <- hp/wt)
And to me there seems to be a much simpler solution to the last example (combines use of tidy and base):
mtcars$gear <- c("","","three","four","five")[mtcars$gear]
This will work with a named vector as well, if the indexing vector is a character vector. Its also possible to go via a factor, what might be easier to read (one may coerce the result into a character vector if needed):
mtcars$gear <- factor(mtcars$gear,
levels = c("3","4","5"),
labels = c("three","four","five")))
Yes, I considered using with() and within(), but decided that they qualify as “advanced,” and I had promised to avoid using “advanced” functions because the theme here is what is the best way to teach R beginners with no coding background. For the same reason, I use “simpler” to also mean “more straightforward,” which again would exclude the alternatives you give here IMO.
I should add: I’m up theirein years. I took my first programming course in 1966! I’ve taught a lot of programming students at various levels, worked in the open source realm (not just R), have worked in industry managing programmers, have seen tons of programming languages, etc. That does not mean my views are worthier than those of others. My point, though, is that my views do come from very extensive experience, not from some offhand comments.
I agree tidy verse is too opiniated and too verbose with words and verbs everywhere . Also the long tunnels of piped instructions it promotes are bad when you want to modify part of it or simply debug. Your document is interesting because it goes against the current idea that tidyverse is necessarily better. In effect when you refactor code for maintainable code, the monolithic tunnels of pipes tend to disappear in favor of small functions, and you naturally return to more base R and less dependencies.
Hi Mr. Matloff, thank you for your insight. I am a beginner. I have a hard time with the apply family of functions although I get your base code is quite direct. I concur that those across calls in tidyverse are hard to follow. Even though, maybe because I find the tidyverse separates layers of thought explicitly, for example when groups with group_by instead of an unnamed parameter, with the tidyverse I end up getting the idea faster of what it is trying to accomplish and what shape is the result going to end up having.
By teaching a mix of Tidy and base-R, the instructor would equip students with a broadly-applicable set of tools. Each person can then use whatever tool he/she personally feels is best in any given situation. Reasonable? BTW, if you are referring to tapply(), its arguments are indeed named, as is the case for the other base-R functions.
Yes, I believe that the base of knowledge has to be base R. I know I have to get stronger with those functions… I was referring to tapply, vapply, lapply, sapply and also the map set of functions in tidyverse. What tidyverse simplifies for my in that regard is that is explicit that is grouping by a certain field, and it can’t be implicit, while you could code without making it explicit.
Once again, you should use whatever coding style you feel most comfortable with. To me, though,in my examples it’s the base-R code that is explicit while the Tidy code is not so much so. To add the vector x to the data frame d, writing d$x <- x is as explicit as one can get.
I wouldn’t mind teaching a mix in principle as you suggest but especially when teaching non-computer scientists, there’s not enough time as it is, and the “Tidyverse” simply eats up too much resource. Even ggplot2 is overkill, by comparison with base-R’s alternatives – if your goal is to teach the basics so that students can move on to bigger and better things on their own, and use their existing domain knowledge and interests to do what R was originally invented for, it seems, exploratory data analysis and efficient data visualization. Having said that, looking forward to using your new “The Art of Machine Learning” with R in class next spring.
I agree that even ggplot2 is overkill, but it is the most-cited reason for using the Tidyverse.–even though it really isn;t part of the Tidyvers, just described that way for marketing. ML is in in the final editing stages, looking forward to your comments.
Thank you for your comparison of base R and Tidy approaches for beginners. I used a mix of approaches when developing analytical code for an emergency service. I used dplyr a lot not so much because it was ‘better’, whatever better might be, but because other analysts in the team who were not skilled in R could to an extent understand the approach a little easier than they could when the base R apply functions were used. However, R itself proved in many cases simply too complex for non-programmers in my team, and regardless of use of the Tidyverse or not R in my analytical team came to be seen as a specialist backwater compared to using non-programming data visualisation tools which could be hooked up to relational databases and Excel spreadsheets relatively easily.
Very interesting comments, thanks. The notion of getting people to merely READ code, rather than WRITE it is an especially interesting angle. Of course, my first thought on that is to raise the question as to whether a simplistic read can lead to misunderstanding. Lots of food for thought here.