New Book on Machine Learning

I’m nearing completion of writing my new book, The Art of Machine Learning: Algorithms+Data+R, to be published by the whimsically named No Starch Press. I’m making a rough, partial draft available, and welcome corrections, suggestions and comments.

I’ve been considering doing such a project for some time, intending to write a book that would on the one hand serve as “machine learning for the masses” while ardently avoiding being of a “cookbook” nature. In other words, the book has two goals:

  • The math content is kept to a minimum. Readers need only be able to understand scatter plots and the like, and know the concept of the slope of a line. (For readers who wish to delve into the math, a Math Companion document will be available.)
  • There is strong emphasis on building a solid intuitive understanding of the methods, empowering the reader to conduct effective, penetrating ML analysis.

As I write in the preface (“To the Reader”),

“Those dazzling ML successes you’ve heard about come only after careful, lengthy tuning and thought on the analyst’s part, requiring real insight. This book aims to develop that insight.”

The language of instruction is R, using standard CRAN packages. But as I also write,

“…this is a book on ML, not a book on using R in ML. True, R plays a major supporting role and we use prominent R packages for ML throughout the book, with code on almost every page. But in order to be able to use ML well, the reader should focus on the structure and interpretation of the ML models themselves; R is just a tool toward that end.”

So, take a look, and let me know what you think!


73 thoughts on “New Book on Machine Learning”

  1. Thanks for sharing the draft. I like what I see so far. Been a big fan of your “The art of R programming” book. When is this book going to be published? Thanks, Eddie.

    1. Thanks for the comments. Not sure about the timing. I expect to be done with the remaining chapters by the end of the month, after which the editing process begins. Maybe end of the year.

      1. I’m eagerly awaiting your new book. I see that you have invested considerable time in expanding the regtools package for machine learning, so I have to be patient. Do you have a new ETA for the book you can share Dr. Matloff?

        1. Yes, very sorry for the delay. Yes, I actually have redone all the examples using my new regtools functions. This will make things very easy for readers! I should be able to get the redone manuscript to the publisher within 4-5 weeks. Thanks so much for your interest and your patience!

  2. I started reading this and on the first chapter I get the following, after I installed regtools:
    > library(regtools)
    Error: package ‘mvtnorm’ required by ‘regtools’ could not be found
    > data(day1)
    Warning message:
    In data(day1) : data set ‘day1’ not found

    I also installed mvtnorm but I still got the same message afterwards.

      1. I tried to download from github but was not able to. If you want people to read the book you will need to fix these bugs, if on page 4 people are having issues following the code, they will stop bothering with it, like I will.

          1. Well, when I get to the github, I see a regtools link, click on that and I see information about the regtools but nothing about downloading the package. You have to understand, I have been using R for several years in school and all packages I have ever used I have installed via install.packages(), so I dont know what do to unless I have clear instructions.

              1. I would suggest putting this R code in the text. I also had trouble and downloaded it from CRAN, which had a different dataset name (day vs. day1) and some different values than the book
                Only just starting it, but really simple and easy to follow so far. Would suggest labelling the weather degrees in 1.6 example (28 degrees is very different C vs. F degrees)

              2. ps, i needed double appostrofes!
                this worked:
                Thanks and good luck with the book!

  3. Section 1.2.3 typo ‘marix’ instead of matrix.
    This contribution looks like just what I need to better understand the reason for using various techniques in ML, and to help me decide the best approach to the problems I want to solve. Thank you.

  4. Thanks for sharing this, I just began reading it but already looks good.

    One small comment, in section 1.10.4 Scaling, ‘ […] we divide each predictor/feature by its standard deviation. This gives everything a standard deviation of 1. We also subtract the mean[…] ‘.
    Even if the sentence doesn’t explicitly claim a sequence, being the first step mean substraction maybe a sentence reordering would make it more clear?

  5. Hello Norm, there are some issues installing regtools from Github (v1.2.1). I already have regtools 1.1 on my machine (installed with no issues through CRAN) but it does not have day1 dataset available. Any plans on providing the latest regtools through CRAN?

    1. See my preceding comment here as to how to install.

      The package is certainly due for an update on CRAN. There is a lot involved there, but I’ll try to find time this evening for it.

  6. So far, great book! Trying to run the code in
    day1x <- day1[,1:5]
    tot <- day1$tot
    knnout <- knn(day1x,tot,c(1,12.0,11.8,0.23,5),5)

    reciveing "Error in knn(day1x, tot, c(1, 12, 11.8, 0.23, 5), 5) :
    'train' and 'class' have different lengths"

    Can't find the problem.

      1. Tried updating regtools from Github. Got an error message to install Rtools. Tried to install R tools but received a message that it is not available for R version 4.0.0

          1. I installed regtools from github and the installation went well. However, I am still recieving the same message “Error in knn(day1x, tot, c(1, 12, 11.8, 0.23, 5), 5) : ‘train’ and ‘class’ have different lengths

              1. Sorry, but when I write kNN i receive “Error in kNN(day1x, tot, c(1, 12, 11.8, 0.23, 5), 5) :
                could not find function “kNN”

                1. Of course a loaded regtools.
                  I ran lsf.str(“package:regtools”) – and didn’t find a function called knn or kNN.
                  I am sorry, but I just don’t understand what is going on.

  7. Running the code at the top of page 20 produces the following error:

    knnout <- kNN(day1x,tot,newx=day1x,8,allK=TRUE)
    Error in hasFactors(newwx) : object 'newwx' not found

    The problem may be on line 77 of Nonpar.R file:

    if (is.factor(newx) || && hasFactors(newwx))
    stop('change to dummies, factorsToDummies()')

    The hasFactors(newwx) argument should be hasFactors(newx)?

      1. I ran the following code:
        knnout <- kNN(x = day1x, y = tot, newx = day1x, kmax = 8, allK = TRUE)

        after reinstalling regtools from github with the devtools package (Aug 2), the function still throws the following:
        Error in hasFactors(newwx) : object 'newwx' not found.

        Do I have to wait for the fix to make its way into regtools?

        BTW, the book is the most lucid treatment of ML I've encountered. Looking forward to place the printed version on my shelf.

  8. Hi, I am looking forward to this book. I intend to use your other book, Probability and Statistics for Data Science / Math + R + Data (CRC Press), as one of the textbooks for an data science intro course this fall. Next sommer, I want to follow up with an intro course on ML, so this book would be very welcome! Question: do you have sample answers to the exercises in the earlier “Probability and Statistics”? Would love to check if I got it right 😉 – Also, really enjoyed your “TidyverseSceptic” essay – most of the information was new to me & I appreciated your open-source call to arms immensely! Sorry to hog your comment space here …Thanks and cheers from Berlin

    1. Thanks very much for the comments and support. I do have some sample solutions to exercises at, still only partially complete. If there is an exercise you are particularly interested in, please contact me.

    1. The primary goal of the book is for the reader to get a good practical UNDERSTANDING of ML concepts. Facility with any particular library package is NOT a goal, though the book does use major packages.

  9. Google believes the Machine Learning pattern HELPED a lot in integrating data by the companies into their core businesses adds more meaning and significant value has grown considerably, with an increase in interest by more than four times in the last 5 years.

  10. Love the ML book. Unfortunately only able to install regtools from cran and that is version 1.1, so for example kNN function is not found.

    When I run …


    … I receive the following error message (any direction would be appreciated):

    Installing package into ‘C:/Users/greg.blevins/Documents/R/win-library/4.0’
    (as ‘lib’ is unspecified)
    * installing *source* package ‘regtools’ …
    ** using staged installation
    ** R
    ** data
    ** inst
    ** byte-compile and prepare package for lazy loading
    Error: (converted from warning) package ‘FNN’ was built under R version 4.0.3
    Execution halted
    ERROR: lazy loading failed for package ‘regtools’
    * removing ‘C:/Users/greg.blevins/Documents/R/win-library/4.0/regtools’
    Error: Failed to install ‘regtools’ from GitHub:
    (converted from warning) installation of package ‘C:/Users/GREG~1.BLE/AppData/Local/Temp/RtmpOwaJbC/fileef476121a86/regtools_1.5.0.tar.gz’ had non-zero exit status

  11. Dear Prof

    Thank you for a a wonderful book. It is logical, concise and reads very easily. I reckon it will remain my favourite ML book.

    I am working through the examples. On page 20 you introduce the argument allK for the kNN() function. I receive the following message: “Error in kNN(day1x, tot, day1, 8, allK = TRUE) :
    allK option currenttly disable”.

    Please tell me how I should enable allK.

    Thank you
    Kind regards
    Joe Lippert

    1. Thanks for your support, very much appreciated!

      The allK option became too cumbersome to maintain. But I’ve now added kNNallK(), a restored previous version of kNN().

  12. Dear Prof

    I am steadily continuing to read and test your code. So far, so good.

    I am unstuck at, pages 51/52.

    My code is:
    ## Load the dataset. Due to the size I chose fread() ####

    songs <- fread("YearPredictionMSD.txt", header = FALSE)

    # Look at the dimensions of the dataset
    songs[1, ] # V1 is the outcome. The rest (V2:V91) are the features.

    # How long does it take to predict one data point – the first row of data
    system.time(kNN(songs[, -1], songs[[1]], songs[1, -1], 25))

    # let's apply dimension reduction via PCA
    newx <- songs[1, -1]
    newx[1, ] <- 32.6
    knn_songs <- kNN(songs[, -1], songs[[1]], newx = newx, 50, PCAcomps = 20)

    The response I receive is:
    Error in kNN(songs[, -1], songs[[1]], newx = newx, 50, PCAcomps = 20) :
    PCA now must be done separately

    Please assist

    Kind regards
    Joe Lippert

    1. Yes, the kNN() function does not do its own PCA now. You must apply PCA yourself, say using prcomp(). However, regtools now has new functions in a qe*-series, including qeKNN(). I will be adding a PCA capability to the functions in that series within the next few days.

  13. the smoothingFtn=median option on page 17 results in regests being just the index of the first neighbour
    regtools 1.5.1

  14. For errata or corrections if still possible:

    ms <- fread( … : It takes a little guessing to download and unzip the music file, but possible of course.

    page 109 "in spite" seems an illogical term

    section 7.6.3. data = yr , yr is unknown to the reader; likely it is ms.

    same section 7.6.3: the use of y = yr[idxs, 1] results in an error if the data is still a data.table (from fread) and not a data.frame

    the listings.csv on airbnb site is not in line with the one used in the book. on the site it's a summary file. the better file is likely the gz one which can be easily imported with fread.

    none of the airbnb price entries are ' ' (empty) as of october 2020 and square_feet has been removed from their side. There is no weekly_price nor monthly_price. There are some columns in the listings gz file data that are all NA.

    for also note, fread does not import characters as factor by default, so the reader should not be surprised when the lm call does not see factors.

    Therefore it is recommendable having a fixed dataset on the book-site or github as airbnb might be changing things. That goes for all data sets from the internet.

  15. Have thoroughly enjoyed the draft. Do you plan to release a Python companion or has anyone reached out with their own? It’d be good to compare notes.

  16. A slightly delaying factor, when I work with the material from here and from the previous book, is that the text does not specify arguments when calling a function. It is faster to comprehend a function call if the arguments are named/specified. Eg, xvalknn(data = mlb, ycol = ‘Weight’, predvars = c(‘Height’, ‘Age’), k = 25, p = 2/3) rather than xvalknn ( mlb , 5 , c ( 4 , 6 ) , 2 5 , 2 / 3 )

      1. Hi, I like the draft and the approach and want to test this in a new undergrad ML course (next spring term). Participants are R savy. Any chance to pre-order or get an inspection copy before? Cheers!

  17. Is there a feeling when/if RegTools might be updated on cran? Maybe under another name due to some lack of backward compatibility.
    And might the author and maintainer group expand?

      1. ok, so qeML is now the spinoff from RegTools. And the book is out in August 2022. A bit of time to wait; will look into the github repos meanwhile but please do update/post if there is more documentation coming out in between.

        1. Running as fast as I can to stay in place. 🙂 The doc for qeML is not very good at this point, but the README should be good enough to start with. Of course, contact me with any questions that arise. The book is currently in the editing stage, so there is light at the end of the tunnel. 🙂

          1. Learning from books and online resources, and having immigrated three times (Denmark to German to UAE to Singapore), I realize a certification is needed in DS. Otherwise, one is likely to not be considered for a work visa. What certifications in DS might you recommend that mostly avoid tidyverse and python, ie give the student freedom of choice in tools?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.