r/finance, 1 year later

The prominent conference R/Finance, held annually in Chicago, had a great program yesterday and today. As I wrote following last year’s conference, the organizers were criticized for including no women in its speaker lineup. The problem was that no women had submitted papers for consideration; no input, thus no output.

I’m a member of the Editorial Board of the R Journal, and out of curiosity, yesterday I did a gender count among papers I reviewed during my first two years of service, 2017 and 2018. I considered only first-author status, and found that I had accepted 54% of the papers by men, and 67% of those by women. That seems good, but only 20% of these papers were by women. I’m sure the numbers for my fellow board members were similar, and indeed for other journals in data science. For instance, in the current issue of the Journal of Computational and Graphical Statistics, only 3 of 18 paper have women as first authors.

Thus I felt that the activists’ criticisms last year were unfair. Not only had there been no submissions by women, hence no women speakers, but also the conference organizers quickly made amends when the problem was pointed out. They quickly arranged a special talk by a woman who had presented in a previous year, and also made room in the schedule for a talk by R Ladies on improving conditions for women in conferences. They promised to be proactive in encouraging women to submit papers this year.

The organizers did take strong proactive measures to improve things this year, and the results were highly impressive. There were 12 women presenters by my count out of 50-something, including an excellent keynote by Prof. Genevera Allen of Rice University. In addition, there were two women on the Program Committee.

We all know that finance is a male-dominated field.  Thus it is not too surprising that the conference received no submissions by women last year (though, as noted, they had had women speakers in the past).  But they are to be highly commended for turning things around, and indeed should serve as a model.


Free online r course

Recently a young relative mentioned that the campus R course she hoped to attend was full. What online alternatives did she have? So, I decided to start one of my own! https://github.com/matloff/fasteR  Designed for complete beginners.

I now have six lessons up on the site. I hope to add one new lesson per week.

nice student project

In all of my undergraduate classes, I require a term project, done in groups of 3-4 students. Though the topic is specified, it is largely open-ended, a level of “freedom” that many students are unaccustomed to. However, some adapt quite well. The topic this quarter was to choose a CRAN package that does not use any C/C++, and try to increase speed by converting some of the code to C/C++.

Some of the project submissions were really excellent. I decided to place one on the course Web page, and chose this one. Nice usage of Rcpp and devtools (neither of which was covered in class), very nicely presented.

R > Python: a Concrete Example

I like both Python and R, and teach them both, but for data science R is the clear choice. When asked why, I always note (a) written by statisticians for statisticians, (b) built-in matrix type and matrix manipulations, (c) great graphics, both base and CRAN, (d) excellent parallelization facilities, etc. I also like to say that R is “more CS-ish than Python,” just to provoke my fellow computer scientists. 🙂

But one aspect that I think is huge but probably gets lost when I cite it is R’s row/column/element-name feature. I’ll give an example here.

Today I was dealing with a problem of ID numbers that are nonconsecutive.  My solution was to set up a lookup table. Say we have ID data (5, 12, 13, 9, 12, 5, 6). There are 5 distinct ID values, so we’d like to map these into new IDs 1,2,3,4,5. Here is a simple solution:


> x <- c(5,12,13,9,12,5,6)
> xuc <- as.character(unique(x))
> xuc
[1] "5" "12" "13" "9" "6"
> xLookup <- 1:length(xuc)
> names(xLookup) <- xuc
> xLookup
5 12 13 9 6
1 2 3 4 5

So, from now on, to do the lookup, I just use as subscript the character from of the original ID, e.g.

> xLookup['12']

Of course, I did all this within program code. So to change a column of IDs to the new ones, I wrote


Lots of other ways to do this, of course, but it shows how handy the names can be.

Example of Overfitting

I occasionally see queries on various social media as to overfitting — what is it?, etc. I’ll post an example here. (I mentioned it at my talk the other night on our novel approach to missing values, but had a bug in the code. Here is the correct account.)

The dataset is prgeng, on wages of programmers and engineers in Silicon Valley as of the 2000 Census. It’s included in our polyreg package, which we developed as an alternative to neural networks. But it is quite useful in its own right, as  it makes it very convenient to fit multivariate polynomial models. (Not as easy as it sounds; e.g. one must avoid cross products of orthogonal dummy variables, powers of those variables, etc.)

First I split the data into training and test sets:


> set.seed(88888)
> getPE()
> pe1 <- pe[,c(1,2,4,6,7,12:16,3)]
> testidxs <- sample(1:nrow(pe1),1000)
> testset <- pe1[testidxs,]
> trainset <- pe1[-testidxs,]

As a base, I fit an ordinary degree-1 model and found the mean absolute prediction error:

> lmout <- lm(wageinc ~ .,data=trainset)
> predvals <- predict(lmout,testset[,-11])
> mean(abs(predvals - testset[,11]))
[1] 25456.98

Next, I tried a quadratic model:

> pfout <- polyFit(trainset,deg=2)
> mean(abs(predict(pfout,testset[,-11]) 
   - testset[,11]))
[1] 24249.83

Note that, though originally there were 10 predictors, now with polynomial terms we have 46.

I kept going:

deg MAPE # terms
3 23951.69 118
4 23974.76 226
5 24340.85 371
6 24554.34 551
7 36463.61 767
8 74296.09 1019

One must keep in mind the effect of sampling variation, and repeated trials would be useful here, but it seems that the data can stand at least a cubic fit and possibly as much as degree 5 or even 6. To be conservative, it would seem wise to stop at degree 3. That’s also consistent with the old Tukey rule of thumb that we should have p <- sqrt(n), n being about 20,000 here.

In any event, the effects of overfitting here are dramatic, starting at degree 7.

It should be noted that I made no attempt to clean the data, nor to truncate predicted values at 0, etc.

It should also be noted that, starting at degree 4, R emitted warnings, “prediction from a rank-deficient fit may be misleading.” It is well known that at high degrees, polynomial terms can have multicollinearity issues.

Indeed, this is a major point we make in our arXiv paper cited above. There we argue that neural networks are polynomial models in disguise, with the effective degree of the polynomial increasing a lot at each successive layer, and thus multicollinearity increasing from layer to layer. We’ve confirmed this empirically. We surmise that this is a major cause of convergence problems in NNs.

Finally, whenever I talk about polynomials and NNs, I hear the standard (and correct) concern that polynomial grow rapidly at the edges of the data. True, but I would point out that if you accept NNs = polynomials, then the same is true for NNs.

We’re still working on the polynomials/NNs project. More developments to be announced soon. But for those who are wondering about overfitting, the data here should make the point.

Manifold Visualization: Second Example

In last night’s post, I introduced prVis(), a new visualization tool which we have invented, available in our polyreg package. Recall that prVis() is intended as a simpler alternative to recent visualization tools like t-SNE and UMAP. Here I will post another example.

The dataset is prgeng, included in the package. It consists of wage income, age, gender, and so on, of Silicon Valley programmers and engineers, from the 2000 Census. We first load the data and then choose some of the variables (age, gender, education and occupation):

pe1 <- pe[,c(1,2,6:7,12:16)]

So, let’s plot the graph:

The graph consists of streaks, about a dozen of them. What do they represent? To investigate that question, we call another polyreg function:


This will write the row numbers of 16 random points from the dataset onto the graph that I just plotted, which now looks like this:

Due to overplotting, the numbers are difficult to read, but are also output to the R console:

[1] “highlighted rows:”
[1] 2847
[1] 5016
[1] 5569
[1] 6568
[1] 6915
[1] 8604
[1] 9967
[1] 10113
[1] 10666
[1] 10744
[1] 11383
[1] 11404
[1] 11725
[1] 13335
[1] 14521
[1] 15462

Rows 2847 and 10666 seem to be on the same streak, so they must have something in common. Let’s take a look.

> pe1[2847,]
         age sex ms phd occ1 occ2 occ3 occ4 occ5
2847 32.3253   1  1   0    0    0    0    0    0
> pe1[10666,]
          age sex ms phd occ1 occ2 occ3 occ4 occ5
10666 45.36755  1  1   0    0    0    0    0    0

Aha! Except for age, these two workers are identical in terms of gender (male), education (Master’s) and occupation (occ. category 6). Now those streaks make sense; each one represents a certain combination of the categorical variables.

Well, then, let’s see what UMAP does:


The result is

The pattern here, if any, is not clear.

So in both examples, both last night’s and tonight’s, prVis() was not only simpler but also much more visually interpretable than UMAP.

In fairness, I must point out:

  • I just used the default values of umap() in these examples. It would be interesting to explore other values. On the other hand, it may be that UMAP simply is not suitable for partially categorical data, as we have in this second example.
  • For most other datasets I’ve tried, prVis() and UMAP give similar results.

Even so, these two points show the virtues of using prVis() . We are getting equal or better quality while not having to worry about settings for various hypeparameters.

Manifold Visualization: Polynomials to the Rescue

Our arXiv paper and the associated R package polyreg caused a bit of a stir, both pro and con, when we first announced them here in June. The discussion even spread as far as Twitter, Reddit and Hacker News. We’ll be announcing a revised paper, and various new features to the package, very soon.

But the purpose of this blog post is to focus on one particular new feature, a visualization tool. Over the years a number of “nonlinear” methods generalizing Principal Components Analysis (PCA) have been proposed, such as ICA and KPCA some time ago, and more recently t-SNE and UMAP.

I’ve long felt that applying PCA to “polynomial-ized” versions of one’s data should do well too, and of course much more simply, a major virtue. So, now  that we have machinery, the polyreg package, to conveniently build multivariate polynomial models — including for categorical variables, an important case — I developed a new function in polyreg, named prVis(). I’ll illustrate it in this blog post, and compare it to UMAP.

A popular example in the “manifold visualization” (MV) business is the Swiss Roll model, which works as follows: A 4-component mixture of bivariate normals is generated, yielding a 2-column data frame whose column names are ‘x’ and ‘y’. Now derive from that a 3-column data frame, consisting triples of the form (x cos (x), y, x sin(x)).

The original (i.e. 2-column) data looks like this:

Here is the goal:

Using one of the MV methods on the 3-column data, and not knowing that the 2-column data was a 4-component mixture, could a person discern that latter fact?

We’ll try UMAP (said to be faster than t-SNE) and prVis(). But what UMAP implementation to use? Several excellent ones are available for R, such as umapr, umap and uwot. I use the last one, as it was the easiest for me to install, and especially because it includes a predict() method.

Here is the code. We read in the data from a file included in the polyreg package into sw, and change the mode of the last column to an R factor. (Code not shown.) The data frame sw actually has 4 columns, the last being the mixture component ID, 1-4.

We first try PCA:


That swirl is where the Swiss Roll model gets its name, and clearly, the picture gives no hint at all as to how many components were in the original data. The model was constructed in such a way that PCA would fail.

So let’s try UMAP.


We get this:

So, how many components are there? (Remember, we’re pretending we don’t know it’s 4.) On the left, for instance, does that loop consists of just1 component? 2? 3? We might go for 1. Well, let’s un-pretend now, and use the component labels, color-coded:


Wow, that “loop” actually contained 2 of the original components, not just 1. We were led astray.

Let’s try prVis().


Still rather swirly, but it definitely suggests 4 components. Let’s check, revealing the color-coded actual components:


Very nice! Each of the 4 apparent components really did correspond to an actual component.

Of course, usually the difference is not this dramatic, but in applying prVis() to a number of examples, I’ve found that it does indeed do as well as, or better than, UMAP and t-SNE, but in a simpler, more easily explained manner, as noted a major virtue.

I’ll post another example, showing an interesting interpretation of a real dataset, in the next day or so.