Our arXiv paper and the associated R package polyreg caused a bit of a stir, both pro and con, when we first announced them here in June. The discussion even spread as far as Twitter, Reddit and Hacker News. We’ll be announcing a revised paper, and various new features to the package, very soon.
But the purpose of this blog post is to focus on one particular new feature, a visualization tool. Over the years a number of “nonlinear” methods generalizing Principal Components Analysis (PCA) have been proposed, such as ICA and KPCA some time ago, and more recently t-SNE and UMAP.
I’ve long felt that applying PCA to “polynomial-ized” versions of one’s data should do well too, and of course much more simply, a major virtue. So, now that we have machinery, the polyreg package, to conveniently build multivariate polynomial models — including for categorical variables, an important case — I developed a new function in polyreg, named prVis(). I’ll illustrate it in this blog post, and compare it to UMAP.
A popular example in the “manifold visualization” (MV) business is the Swiss Roll model, which works as follows: A 4-component mixture of bivariate normals is generated, yielding a 2-column data frame whose column names are ‘x’ and ‘y’. Now derive from that a 3-column data frame, consisting triples of the form (x cos (x), y, x sin(x)).
The original (i.e. 2-column) data looks like this:
Here is the goal:
Using one of the MV methods on the 3-column data, and not knowing that the 2-column data was a 4-component mixture, could a person discern that latter fact?
We’ll try UMAP (said to be faster than t-SNE) and prVis(). But what UMAP implementation to use? Several excellent ones are available for R, such as umapr, umap and uwot. I use the last one, as it was the easiest for me to install, and especially because it includes a predict() method.
Here is the code. We read in the data from a file included in the polyreg package into sw, and change the mode of the last column to an R factor. (Code not shown.) The data frame sw actually has 4 columns, the last being the mixture component ID, 1-4.
We first try PCA:
That swirl is where the Swiss Roll model gets its name, and clearly, the picture gives no hint at all as to how many components were in the original data. The model was constructed in such a way that PCA would fail.
So let’s try UMAP.
So, how many components are there? (Remember, we’re pretending we don’t know it’s 4.) On the left, for instance, does that loop consists of just1 component? 2? 3? We might go for 1. Well, let’s un-pretend now, and use the component labels, color-coded:
Wow, that “loop” actually contained 2 of the original components, not just 1. We were led astray.
Let’s try prVis().
Still rather swirly, but it definitely suggests 4 components. Let’s check, revealing the color-coded actual components:
Very nice! Each of the 4 apparent components really did correspond to an actual component.
Of course, usually the difference is not this dramatic, but in applying prVis() to a number of examples, I’ve found that it does indeed do as well as, or better than, UMAP and t-SNE, but in a simpler, more easily explained manner, as noted a major virtue.
I’ll post another example, showing an interesting interpretation of a real dataset, in the next day or so.
7 thoughts on “Manifold Visualization: Polynomials to the Rescue”
Hi, I have tried to manually install your package, directly from R-Studio by loading the polyreg-master.zip file, or by copying the folders & files to the R user directory. Neither results in a working set up of the package. The error I get from loading the zip is: (as ‘lib’ is unspecified)
I appreciate your advise.
Hard to say given this little information. If you run R CMD INSTALL on the unzipped package and e-mail the messages, I can probably fix your problem.
Hi, thank you for this very interesting blog and the package. I would like to use prVis to add new coordinates to my data. What would be a way to predict with the output og prVis? Thank you!
The output of prVis() include the output of prcomp() after forming polynomials. Thus you can use the predict() method of prcomp().
Hi, thank you for your response. What I would like to do is to use your approach for feature engineering. I would like to introduce additional features based on your approach. I can not apply prVis() to the whole data set as my memory is not large enough. Thus I apply it to a subset. However, the resulting object contains the rotation matrix based on the polynomial expansion. In my case I have 50 original features. This gives 1325 columns after the expansion (of degree 2) and thus the dimension of the rotation matrix is 1325 x 1325. If I want to apply this to the remaining data then I first have to expand this data to dimension 1325. How can I do this in a way that matches the logic of prVis()? I can’t see how I could use the predict of the prout object method directly. Which function can I use to form the 1325 polynomial columns based on the 50 columns of the input data compatible to the output of the package? Thank you!
We are in the process of making prVis more memory-friendly, should release the new version within a couple of weeks.