I’m nearing completion of writing my new book, The Art of Machine Learning: Algorithms+Data+R, to be published by the whimsically named No Starch Press. I’m making a rough, partial draft available, and welcome corrections, suggestions and comments.
I’ve been considering doing such a project for some time, intending to write a book that would on the one hand serve as “machine learning for the masses” while ardently avoiding being of a “cookbook” nature. In other words, the book has two goals:
- The math content is kept to a minimum. Readers need only be able to understand scatter plots and the like, and know the concept of the slope of a line. (For readers who wish to delve into the math, a Math Companion document will be available.)
- There is strong emphasis on building a solid intuitive understanding of the methods, empowering the reader to conduct effective, penetrating ML analysis.
As I write in the preface (“To the Reader”),
“Those dazzling ML successes you’ve heard about come only after careful, lengthy tuning and thought on the analyst’s part, requiring real insight. This book aims to develop that insight.”
The language of instruction is R, using standard CRAN packages. But as I also write,
“…this is a book on ML, not a book on using R in ML. True, R plays a major supporting role and we use prominent R packages for ML throughout the book, with code on almost every page. But in order to be able to use ML well, the reader should focus on the structure and interpretation of the ML models themselves; R is just a tool toward that end.”
So, take a look, and let me know what you think!
Thanks for sharing the draft. I like what I see so far. Been a big fan of your “The art of R programming” book. When is this book going to be published? Thanks, Eddie.
Thanks for the comments. Not sure about the timing. I expect to be done with the remaining chapters by the end of the month, after which the editing process begins. Maybe end of the year.
I’m eagerly awaiting your new book. I see that you have invested considerable time in expanding the regtools package for machine learning, so I have to be patient. Do you have a new ETA for the book you can share Dr. Matloff?
Yes, very sorry for the delay. Yes, I actually have redone all the examples using my new regtools functions. This will make things very easy for readers! I should be able to get the redone manuscript to the publisher within 4-5 weeks. Thanks so much for your interest and your patience!
I started reading this and on the first chapter I get the following, after I installed regtools:
> library(regtools)
Error: package ‘mvtnorm’ required by ‘regtools’ could not be found
> data(day1)
Warning message:
In data(day1) : data set ‘day1’ not found
I also installed mvtnorm but I still got the same message afterwards.
You need the latest regtools, from github.com/matloff.
I tried to download from github but was not able to. If you want people to read the book you will need to fix these bugs, if on page 4 people are having issues following the code, they will stop bothering with it, like I will.
Please explain exactly how you tried with github.
Well, when I get to the github, I see a regtools link, click on that and I see information about the regtools but nothing about downloading the package. You have to understand, I have been using R for several years in school and all packages I have ever used I have installed via install.packages(), so I dont know what do to unless I have clear instructions.
First install devtools, e.g. install.packages(devtools)
library(devtools)
install_github(‘matloff/regtools’)
I would suggest putting this R code in the text. I also had trouble and downloaded it from CRAN, which had a different dataset name (day vs. day1) and some different values than the book
Only just starting it, but really simple and easy to follow so far. Would suggest labelling the weather degrees in 1.6 example (28 degrees is very different C vs. F degrees)
Thanks. I’ll an appendix on package installation.
It Needed package “glue” to, apperently…
ps, i needed double appostrofes!
this worked:
install_github(“matloff/regtools”)
Thanks and good luck with the book!
That’s strange. What OS are you using?
You’ll be missing out on some good material then.
Section 1.2.3 typo ‘marix’ instead of matrix.
This contribution looks like just what I need to better understand the reason for using various techniques in ML, and to help me decide the best approach to the problems I want to solve. Thank you.
Thanks!
Thanks for sharing this, I just began reading it but already looks good.
One small comment, in section 1.10.4 Scaling, ‘ […] we divide each predictor/feature by its standard deviation. This gives everything a standard deviation of 1. We also subtract the mean[…] ‘.
Even if the sentence doesn’t explicitly claim a sequence, being the first step mean substraction maybe a sentence reordering would make it more clear?
Good point.
Hello Norm, there are some issues installing regtools from Github (v1.2.1). I already have regtools 1.1 on my machine (installed with no issues through CRAN) but it does not have day1 dataset available. Any plans on providing the latest regtools through CRAN?
See my preceding comment here as to how to install.
The package is certainly due for an update on CRAN. There is a lot involved there, but I’ll try to find time this evening for it.
Uninstalled the old version of regtools and installed the github version afresh. Installed successfully with no issues. Thanks.
So far, great book! Trying to run the code in 1.10.1.1
day1x <- day1[,1:5]
tot <- day1$tot
knnout <- knn(day1x,tot,c(1,12.0,11.8,0.23,5),5)
knnout
reciveing "Error in knn(day1x, tot, c(1, 12, 11.8, 0.23, 5), 5) :
'train' and 'class' have different lengths"
Can't find the problem.
Not sure I replied. Please update your regtools.
Tried updating regtools from Github. Got an error message to install Rtools. Tried to install R tools but received a message that it is not available for R version 4.0.0
I don’t have the updated regtools on CRAN yet. Please go to github.com/matloff/regtools
I installed regtools from github and the installation went well. However, I am still recieving the same message “Error in knn(day1x, tot, c(1, 12, 11.8, 0.23, 5), 5) : ‘train’ and ‘class’ have different lengths
The correct function name is ‘kNN’.
Sorry, but when I write kNN i receive “Error in kNN(day1x, tot, c(1, 12, 11.8, 0.23, 5), 5) :
could not find function “kNN”
Sounds like you did not execute “library(regtools)”.
Of course a loaded regtools.
I ran lsf.str(“package:regtools”) – and didn’t find a function called knn or kNN.
I am sorry, but I just don’t understand what is going on.
Please contact me by e-mail.
Running the code at the top of page 20 produces the following error:
knnout <- kNN(day1x,tot,newx=day1x,8,allK=TRUE)
Error in hasFactors(newwx) : object 'newwx' not found
The problem may be on line 77 of Nonpar.R file:
if (is.factor(newx) || is.data.frame(newx) && hasFactors(newwx))
stop('change to dummies, factorsToDummies()')
The hasFactors(newwx) argument should be hasFactors(newx)?
Fixed. Thanks!
I ran the following code:
knnout <- kNN(x = day1x, y = tot, newx = day1x, kmax = 8, allK = TRUE)
after reinstalling regtools from github with the devtools package (Aug 2), the function still throws the following:
Error in hasFactors(newwx) : object 'newwx' not found.
Do I have to wait for the fix to make its way into regtools?
BTW, the book is the most lucid treatment of ML I've encountered. Looking forward to place the printed version on my shelf.
Sorry, I thought I fixed this. Anyway, fixed now. Thanks for the nice comment.
Hi, I am looking forward to this book. I intend to use your other book, Probability and Statistics for Data Science / Math + R + Data (CRC Press), as one of the textbooks for an data science intro course this fall. Next sommer, I want to follow up with an intro course on ML, so this book would be very welcome! Question: do you have sample answers to the exercises in the earlier “Probability and Statistics”? Would love to check if I got it right 😉 – Also, really enjoyed your “TidyverseSceptic” essay – most of the information was new to me & I appreciated your open-source call to arms immensely! Sorry to hog your comment space here …Thanks and cheers from Berlin
Thanks very much for the comments and support. I do have some sample solutions to exercises at heather.cs.ucdavis.edu/PSDS, still only partially complete. If there is an exercise you are particularly interested in, please contact me.
Thank you! Will do. Good luck with finishing the other book. Looking forward to it.
Will there be any insights to tuning xgboost?
The primary goal of the book is for the reader to get a good practical UNDERSTANDING of ML concepts. Facility with any particular library package is NOT a goal, though the book does use major packages.
Thanks. Just did a project laying a minefield, then walking into it.
Google believes the Machine Learning pattern HELPED a lot in integrating data by the companies into their core businesses adds more meaning and significant value has grown considerably, with an increase in interest by more than four times in the last 5 years.
Love the ML book. Unfortunately only able to install regtools from cran and that is version 1.1, so for example kNN function is not found.
When I run …
library(devtools)
install_github(“matloff/regtools”)
… I receive the following error message (any direction would be appreciated):
Installing package into ‘C:/Users/greg.blevins/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
* installing *source* package ‘regtools’ …
** using staged installation
** R
** data
** inst
** byte-compile and prepare package for lazy loading
Error: (converted from warning) package ‘FNN’ was built under R version 4.0.3
Execution halted
ERROR: lazy loading failed for package ‘regtools’
* removing ‘C:/Users/greg.blevins/Documents/R/win-library/4.0/regtools’
Error: Failed to install ‘regtools’ from GitHub:
(converted from warning) installation of package ‘C:/Users/GREG~1.BLE/AppData/Local/Temp/RtmpOwaJbC/fileef476121a86/regtools_1.5.0.tar.gz’ had non-zero exit status
You apparently need to update FNN.
Dear Prof
Thank you for a a wonderful book. It is logical, concise and reads very easily. I reckon it will remain my favourite ML book.
I am working through the examples. On page 20 you introduce the argument allK for the kNN() function. I receive the following message: “Error in kNN(day1x, tot, day1, 8, allK = TRUE) :
allK option currenttly disable”.
Please tell me how I should enable allK.
Thank you
Kind regards
Joe Lippert
Thanks for your support, very much appreciated!
The allK option became too cumbersome to maintain. But I’ve now added kNNallK(), a restored previous version of kNN().
Thank you for your response.
Kind regards
Dear Prof
I am steadily continuing to read and test your code. So far, so good.
I am unstuck at 3.1.3.1, pages 51/52.
My code is:
## Load the dataset. Due to the size I chose fread() ####
songs <- fread("YearPredictionMSD.txt", header = FALSE)
# Look at the dimensions of the dataset
dim(songs)
songs[1, ] # V1 is the outcome. The rest (V2:V91) are the features.
# How long does it take to predict one data point – the first row of data
system.time(kNN(songs[, -1], songs[[1]], songs[1, -1], 25))
# let's apply dimension reduction via PCA
newx <- songs[1, -1]
newx[1, ] <- 32.6
knn_songs <- kNN(songs[, -1], songs[[1]], newx = newx, 50, PCAcomps = 20)
The response I receive is:
Error in kNN(songs[, -1], songs[[1]], newx = newx, 50, PCAcomps = 20) :
PCA now must be done separately
Please assist
Kind regards
Joe Lippert
Yes, the kNN() function does not do its own PCA now. You must apply PCA yourself, say using prcomp(). However, regtools now has new functions in a qe*-series, including qeKNN(). I will be adding a PCA capability to the functions in that series within the next few days.
the smoothingFtn=median option on page 17 results in regests being just the index of the first neighbour
regtools 1.5.1
Thanks, fixed now, both in the code and the man page.
For errata or corrections if still possible:
ms <- fread( … : It takes a little guessing to download and unzip the music file, but possible of course.
page 109 "in spite" seems an illogical term
section 7.6.3. data = yr , yr is unknown to the reader; likely it is ms.
same section 7.6.3: the use of y = yr[idxs, 1] results in an error if the data is still a data.table (from fread) and not a data.frame
the listings.csv on airbnb site is not in line with the one used in the book. on the site it's a summary file. the better file is likely the gz one which can be easily imported with fread.
none of the airbnb price entries are ' ' (empty) as of october 2020 and square_feet has been removed from their side. There is no weekly_price nor monthly_price. There are some columns in the listings gz file data that are all NA.
for 7.7.1.3 also note, fread does not import characters as factor by default, so the reader should not be surprised when the lm call does not see factors.
Therefore it is recommendable having a fixed dataset on the book-site or github as airbnb might be changing things. That goes for all data sets from the internet.
This is really valuable. I’ll see what we can do with Air B&B.
Have thoroughly enjoyed the draft. Do you plan to release a Python companion or has anyone reached out with their own? It’d be good to compare notes.
That would be nice, but as you see, the book relies heavily on R libraries written by others.
A slightly delaying factor, when I work with the material from here and from the previous book, is that the text does not specify arguments when calling a function. It is faster to comprehend a function call if the arguments are named/specified. Eg, xvalknn(data = mlb, ycol = ‘Weight’, predvars = c(‘Height’, ‘Age’), k = 25, p = 2/3) rather than xvalknn ( mlb , 5 , c ( 4 , 6 ) , 2 5 , 2 / 3 )
This is a really good idea! The book is being edited by the publisher right now, but hopefully I can make some changes along these lines.
Hi, I like the draft and the approach and want to test this in a new undergrad ML course (next spring term). Participants are R savy. Any chance to pre-order or get an inspection copy before? Cheers!
We are in the final editing stages. Please contact me by e-mail to see a PDF of the latest draft, which is very different from (and much better than) the posted one. matloff@cs.ucdavis.edu
Is there a feeling when/if RegTools might be updated on cran? Maybe under another name due to some lack of backward compatibility.
And might the author and maintainer group expand?
Working on that now.
ok, so qeML is now the spinoff from RegTools. And the book is out in August 2022. A bit of time to wait; will look into the github repos meanwhile but please do update/post if there is more documentation coming out in between.
Running as fast as I can to stay in place. 🙂 The doc for qeML is not very good at this point, but the README should be good enough to start with. Of course, contact me with any questions that arise. The book is currently in the editing stage, so there is light at the end of the tunnel. 🙂
Learning from books and online resources, and having immigrated three times (Denmark to German to UAE to Singapore), I realize a certification is needed in DS. Otherwise, one is likely to not be considered for a work visa. What certifications in DS might you recommend that mostly avoid tidyverse and python, ie give the student freedom of choice in tools?
Quite a trek! Where will you go next? 🙂 Sadly, I haven’t seen any certifications that I consider worthy.
Sounds like, the best method is to certify oneself and present the output succinctly.
Sad to see the book postponed again, to June 2023. Hope it will be all the better. Will check out the latest qeML.
There has been much progress recently. Hopefully it will be out earlier than that. Thanks for your support!