Prob/Stat for Data Sci: Math + R + Data

My new book, Probability and Statistics for Data Science: Math + R + Data, pub. by the CRC Press, was released on June 24!

This book arose from an open-source text I wrote and have been teaching from. The open source version will still be available, though rather different from the published one.

This is a math stat book, but different from all others, as the subtitle states: Math + R + Data. Even the topic ordering is rather different, with a goal of bringing in data science-relevant material as early as possible.

I’ve placed an excerpt from the book at http://tinyurl.com/y6hf7x66. I believe it epitomizes the intent and style of the book. Also, I’ve placed the front matter at http://tinyurl.com/yy9hx6db 

Advertisements

My Free Online Tutorial on R

I’m continuing to add more lessons to my free online R tutorial, 17 of them so far, adding more from time to time. Aimed specifically at nonprogrammers, though those with C or Python background should find it helpful too. Comments and suggestions welcome!

r/finance, 1 year later

The prominent conference R/Finance, held annually in Chicago, had a great program yesterday and today. As I wrote following last year’s conference, the organizers were criticized for including no women in its speaker lineup. The problem was that no women had submitted papers for consideration; no input, thus no output.

I’m a member of the Editorial Board of the R Journal, and out of curiosity, yesterday I did a gender count among papers I reviewed during my first two years of service, 2017 and 2018. I considered only first-author status, and found that I had accepted 54% of the papers by men, and 67% of those by women. That seems good, but only 20% of these papers were by women. I’m sure the numbers for my fellow board members were similar, and indeed for other journals in data science. For instance, in the current issue of the Journal of Computational and Graphical Statistics, only 3 of 18 paper have women as first authors.

Thus I felt that the activists’ criticisms last year were unfair. Not only had there been no submissions by women, hence no women speakers, but also the conference organizers quickly made amends when the problem was pointed out. They quickly arranged a special talk by a woman who had presented in a previous year, and also made room in the schedule for a talk by R Ladies on improving conditions for women in conferences. They promised to be proactive in encouraging women to submit papers this year.

The organizers did take strong proactive measures to improve things this year, and the results were highly impressive. There were 12 women presenters by my count out of 50-something, including an excellent keynote by Prof. Genevera Allen of Rice University. In addition, there were two women on the Program Committee.

We all know that finance is a male-dominated field.  Thus it is not too surprising that the conference received no submissions by women last year (though, as noted, they had had women speakers in the past).  But they are to be highly commended for turning things around, and indeed should serve as a model.

nice student project

In all of my undergraduate classes, I require a term project, done in groups of 3-4 students. Though the topic is specified, it is largely open-ended, a level of “freedom” that many students are unaccustomed to. However, some adapt quite well. The topic this quarter was to choose a CRAN package that does not use any C/C++, and try to increase speed by converting some of the code to C/C++.

Some of the project submissions were really excellent. I decided to place one on the course Web page, and chose this one. Nice usage of Rcpp and devtools (neither of which was covered in class), very nicely presented.

R > Python: a Concrete Example

I like both Python and R, and teach them both, but for data science R is the clear choice. When asked why, I always note (a) written by statisticians for statisticians, (b) built-in matrix type and matrix manipulations, (c) great graphics, both base and CRAN, (d) excellent parallelization facilities, etc. I also like to say that R is “more CS-ish than Python,” just to provoke my fellow computer scientists. 🙂

But one aspect that I think is huge but probably gets lost when I cite it is R’s row/column/element-name feature. I’ll give an example here.

Today I was dealing with a problem of ID numbers that are nonconsecutive.  My solution was to set up a lookup table. Say we have ID data (5, 12, 13, 9, 12, 5, 6). There are 5 distinct ID values, so we’d like to map these into new IDs 1,2,3,4,5. Here is a simple solution:

 

> x <- c(5,12,13,9,12,5,6)
> xuc <- as.character(unique(x))
> xuc
[1] "5" "12" "13" "9" "6"
> xLookup <- 1:length(xuc)
> names(xLookup) <- xuc
> xLookup
5 12 13 9 6
1 2 3 4 5

So, from now on, to do the lookup, I just use as subscript the character from of the original ID, e.g.

> xLookup['12']
12
2

Of course, I did all this within program code. So to change a column of IDs to the new ones, I wrote

 ul[as.character(testSet[,1])])

Lots of other ways to do this, of course, but it shows how handy the names can be.