Category Archives: Uncategorized

New Book on Machine Learning

I’m nearing completion of writing my new book, The Art of Machine Learning: Algorithms+Data+R, to be published by the whimsically named No Starch Press. I’m making a rough, partial draft available, and welcome corrections, suggestions and comments.

I’ve been considering doing such a project for some time, intending to write a book that would on the one hand serve as “machine learning for the masses” while ardently avoiding being of a “cookbook” nature. In other words, the book has two goals:

  • The math content is kept to a minimum. Readers need only be able to understand scatter plots and the like, and know the concept of the slope of a line. (For readers who wish to delve into the math, a Math Companion document will be available.)
  • There is strong emphasis on building a solid intuitive understanding of the methods, empowering the reader to conduct effective, penetrating ML analysis.

As I write in the preface (“To the Reader”),

“Those dazzling ML successes you’ve heard about come only after careful, lengthy tuning and thought on the analyst’s part, requiring real insight. This book aims to develop that insight.”

The language of instruction is R, using standard CRAN packages. But as I also write,

“…this is a book on ML, not a book on using R in ML. True, R plays a major supporting role and we use prominent R packages for ML throughout the book, with code on almost every page. But in order to be able to use ML well, the reader should focus on the structure and interpretation of the ML models themselves; R is just a tool toward that end.”

So, take a look, and let me know what you think!

FOCI: a new method for feature selection

It’s been a while since I’ve posted here, not for lack of material but just having too long a TO DO list. I’ve added some important features to my regtools package, for instance, and hope to discuss them here soon.

For the present post, though, I will discuss FOCI, a highly promising new approach to feature selection, developed by Mona Azadkia and Sourav Chatterjee of the Stanford Stat Dept. There is now an R package of the same name on CRAN.

I am one of the authors of the package, but was not involved in developing the method or the associated theory. My connection here was accidental. A few months ago, Sourav gave a talk on FOCI at Stanford. I must admit that I had been skeptical when I read the abstract. I had seen so many proposals on this topic, and had considered them impractical, full of unverifiable assumptions and rather opaque indications regarding asymptotic behavior. So, I almost didn’t attend, even though I was on the Stanford campus that day.

But I did attend, and was won over. The method has only minimal assumptions, explicit asymptotic behavior, and lo and behold, no tuning parameters.

I tried FOCI on some favorite datasets that evening, and was pleased with the results, but the algorithm was slow even on moderate-sized datasets. So I wrote to Sourav, praising the method but suggesting that they parallelize the software. He then asked if I might get involved in that, which I did.

You can find Mona and Sourav’s arXiv paper at https://arxiv.org/pdf/1910.12327.pdf, which extends earlier work by Sourav. The theory is highly technical, I’d say very intricate even by math stat standards, thus not for the faint of heart. Actually, it was amusing during Sourav’s talk when one of the faculty said, “Wait, don’t leave that slide yet!”, as the main estimator (T_n, p.3) shown on that slide, is a challenge to interpret.

The population-level quantity, Eqn. (2.1), is more intuitive. Say we are predicting Y from a vector-valued X, and are considering extending the feature set to (X,Z). The expression is still a bit complex — and please direct your queries to Mona and Sourav for details, not to me 🙂 — but at its core it is comparing Var(Y | X,Z) to Var(Y | X). How much smaller is the former than the latter? Of course, there are expectations applied etc., but that is the intuitive heart of the matter.

So here is a concrete example, the African Soil remote sensor dataset from Kaggle. After removing ID, changing a character column to R factor then to dummies, I had a data frame with n = 1157 and p = 3578. So p > n, with p being almost triple the size of n, a nice test for FOCI. For Y, there are various outcome variables; I chose pH.

The actual call is simple:

fout <- foci(afrsoil[,3597],afrsoil[,1:3578])

This specifies the outcome variable and features. There are some other options available. The main part of the output is the list of selected variables, in order of importance. After that, the cumulative FOCI correlations are shown:

> fout$selectedVar
    index    names
 1:  3024 m1668.14
 2:  3078    m1564
 3:  2413 m2846.45
 4:  3173  m1380.8
 5:  3267 m1199.52
 6:  1542 m4526.16
 7:  2546 m2589.96
 8:  3537 m678.828
 9:  3469 m809.965
10:  2624 m2439.54
11:  1874  m3885.9
12:  2770 m2157.98
> fout$stepT[1:12]
 [1] 0.1592940 0.3273176 0.4201255 0.5271957 0.5642336 0.5917753 0.6040584
 [8] 0.6132360 0.6139810 0.6238901 0.6300294 0.6303839

Of course, those feature names, e.g. ‘m1564’, are meaningful only to a soil scientist, but you get the idea. Out of 3578 candidate features, FOCI chose 12, attaining very strong dimension reduction. Actually, the first 8 or so form the bulk of the selected feature set, with the remainder yielding only marginal improvements.

But is it a GOOD feature set? We don’t know in this case, but Mona and Sourav have a very impressive example in the paper, in which FOCI outperforms random forests, LASSO, SCAD and so on. That, combined with the intuitive simplicity of the estimator etc., makes FOCI a very promising addition to anyone’s statistical toolkit. I’m sure more empirical comparisons are forthcoming.

And what about the computation speed? Two types of parallelism are available, with best results generally coming from using both in conjunction: 1. At any step, the testing of the remaining candidate features can be done simultaneously, on multiple threads on the host machine. 2. On a cluster or in the cloud, one can ask that the data be partitioned in chunks of rows, with each cluster node applying FOCI to its chunk. The union of the results is then formed, and fed through FOCI one more time to adjust the discrepancies. The idea is that that last step will not be too lengthy, as the number of candidate variables has already been reduced. A cluster size of r may actually produce a speedup factor of more than r (Matloff 2016).

Approach (2) also has the advantage that it gives the
user an idea of the degree of sampling variability in the FOCI
results.

So there you have it. Give FOCI a try!

R Journal July Issue

As the current Editor-in-Chief of the R Journal, I must apologize for the delay in getting the July issue online, due to technical and other matters. In the meantime, though, please take a look at the many interesting articles slated for publication in this and upcoming issues.

Various improvements in technical documentation, as well as the pending hire of the journal’s first-ever editorial assistant, should shorten the review and publication processes in the future.

By the way, I’ve made a couple of tweaks to the Instructions for Authors. First, I note that the journal’s production software really does require following the instructions carefully. For instance, the \author field in your .tex file must start with “by” in order to work properly; it’s not merely a matter of, say, aesthetics. And your submission package must not have more than one .bib file or more than one .tex file other than RJwrapper.tex.

Finally, a reminder to those whom we ask to review article submissions: Your service would be greatly appreciated, a valuable contribution to R. If you must decline, though, please respond to our e-mail query, stating so, so that we may quickly search for other possible reviewers.

Prob/Stat for Data Sci: Math + R + Data

My new book, Probability and Statistics for Data Science: Math + R + Data, pub. by the CRC Press, was released on June 24!

This book arose from an open-source text I wrote and have been teaching from. The open source version will still be available, though rather different from the published one.

This is a math stat book, but different from all others, as the subtitle states: Math + R + Data. Even the topic ordering is rather different, with a goal of bringing in data science-relevant material as early as possible.

I’ve placed an excerpt from the book at http://tinyurl.com/y6hf7x66. I believe it epitomizes the intent and style of the book. Also, I’ve placed the front matter at http://tinyurl.com/yy9hx6db 

r/finance, 1 year later

The prominent conference R/Finance, held annually in Chicago, had a great program yesterday and today. As I wrote following last year’s conference, the organizers were criticized for including no women in its speaker lineup. The problem was that no women had submitted papers for consideration; no input, thus no output.

I’m a member of the Editorial Board of the R Journal, and out of curiosity, yesterday I did a gender count among papers I reviewed during my first two years of service, 2017 and 2018. I considered only first-author status, and found that I had accepted 54% of the papers by men, and 67% of those by women.

That seems good, but only 20% of these papers were by women. I’m sure that most journals have low numbers of submissions by women. For instance, in the current issue of the Journal of Computational and Graphical Statistics, only 3 of 18 paper have women as first authors.

Thus I felt that the activists’ criticisms last year were unfair. Not only had there been no submissions by women, hence no women speakers, but also the conference organizers quickly made amends when the problem was pointed out. They quickly arranged a special talk by a woman who had presented in a previous year, and also made room in the schedule for a talk by R Ladies on improving conditions for women in conferences. They promised to be proactive in encouraging women to submit papers this year.

The organizers did take strong proactive measures to improve things this year, and the results were highly impressive. There were 12 women presenters by my count out of 50-something, including an excellent keynote by Prof. Genevera Allen of Rice University. In addition, there were two women on the Program Committee.

We all know that finance is a male-dominated field.  Thus it is not too surprising that the conference received no submissions by women last year (though, as noted, they had had women speakers in the past).  But they are to be highly commended for turning things around, and indeed should serve as a model.

(Phrasing clarified, November 30, 2019.)