You may be interested in my new arXiv paper, joint work with Xi Cheng, an undergraduate at UC Davis (now heading to Cornell for grad school); Bohdan Khomtchouk, a post doc in biology at Stanford; and Pete Mohanty, a Science, Engineering & Education Fellow in statistics at Stanford. The paper is of a provocative nature, and we welcome feedback.

A summary of the paper is:

We present a very simple, informal mathematical argument that neural networks (NNs) are in essence polynomial regression (PR). We refer to this as NNAEPR.

NNAEPR implies that we can use our knowledge of the “old-fashioned” method of PR to gain insight into how NNs — widely viewed somewhat warily as a “black box” — work inside.

One such insight is that the outputs of an NN layer will be prone to multicollinearity, with the problem becoming worse with each successive layer. This in turn may explain why convergence issues often develop in NNs. It also suggests that NN users tend to use overly large networks.

NNAEPR suggests that one may abandon using NNs altogether, and simply use PR instead.

We investigated this on a wide variety of datasets, and found that in every case PR did as well as, and often better than, NNs.

We have developed a feature-rich R package, polyreg, to facilitate using PR in multivariate settings.

Much work remains to be done (see paper), but our results so far are very encouraging. By using PR, one can avoid the headaches of NN, such as selecting good combinations of tuning parameters, dealing with convergence problems, and so on.

Couldn’t you say this is a specific example for the Universal Approximation Theorem? As it says NN can approximate any smooth function and we know any smooth function can be also approximated well with a Polynomial.

No, that is not the same. We are saying that NNs actually produce PR, which is quite different from saying that both NNs and PR are universal approximators.

Well,
Since NN generate analytic function once used with Analytic Activation I’d not be surprised you treat is as a Polynomial.
Hence you can say the problem is searching for the coefficients of the Polynomial, which is Regression.

If you could build the problem as PR and show that given the PR coefficients you fins the function is uniformly close (In extrapolations zones as well) then it will be a game changer for the way NN are trained.

If I’m not mistaken, it is not shown in the paper.

Our paper of course must be treated as preliminary. Hence our cautious phrasing, e.g. “NNAEPR suggests that.”

The uniformity issue is rather nuanced. The S-W Theorem does that continuous functions can be uniformly approximated on a compact set, and in the real world we ALWAYS have a compact set, since all variables are bounded. (No one is 90 feet tall.) So our argument could be made mathematically rigorous. But of course, the larger the bound, the longer it may take to reach a given level of approximation.

The more serious problem, though, is extrapolation itself. One simply cannot count on predicting well in regions in which we have no data. This is true for PR, NNs and any other method.

“The more serious problem, though, is extrapolation itself. One simply cannot count on predicting well in regions in which we have no data. This is true for PR, NNs and any other method.”

decades ago (alas), one of my professors warned that one cannot predict beyond the range of the data. time series folks do it all the time, of course. with purely God driven data, the problem is less critical, since God doesn’t change the rules of the data generation process (although we may not know all of them at a point in time, of course). with human driven data, all bets are off. the rule makers change the rules as often as need be to retain the advantage of being the rule maker.

I recently had an interesting conversation with a very prominent time series guy on this very topic. He conceded that of course extrapolation bias is a problem, but maintained that variance is actually the worse problem. As we all know, the variance of a predicted value generally increases as one moves away from the center of the data, and he said that in time series this becomes a problem much more quickly than the bias.

(Industrial mathematician here who’s been doing deep learning stuff the last year or so) I notice that you used MNIST as an example when talking about multicolinearity but didn’t give any numbers in the experimental results section. MNIST by itself isn’t a very interesting dataset these days given how incredibly well all sorts of NNs do on it, but for that very reason if you’re going to argue against NNs you should address how alternatives perform on it.

The other thing that struck me as odd was when you talk about ‘any reasonable activation function’. I agree that yes, of course, polynomials are dense w/r/t/ continuous functions, but I don’t think that this is particularly useful in describing common activation functions such as RELU (non-differentiable at 0) and softmax (all those denominator terms!) and sigmoid functions in general (bounded range).

(Also I glanced over https://github.com/matloff/polyreg and didn’t immediately see anything about working in an orthogonal basis (vs the monomial one) or how it’d be different than lm+poly )

The MNIST example was there simply to illustrate the multicollinearity issue. The author who worked on this section found it convenient to use an example built-in to the kerasformula package. There is no actual classification being done, in terms of what we were interested in.

The non-differentiability of RELU is not really an issue. Just approximate it with a smooth function. E.g. (x^2+c)^0.5 is a good fit to |x| for small c; adapt this example to RELU. For classification, we use glm() with the logit family and a polynomial argument. One could do this in regression cases too.

Yes, orthogonal polynomials are something we intend to try.

I’m trying to test the polyreg approach on the classic mnist data set. I am only trying to predict 1’s to make it a logistic regression. I cannot find a way to fit the model due to the combinational explosion of terms (even for degree 2). My code is below, am I doing something wrong?

I did read the paper and purposely used the phrase explosion of terms because that’s that how you phrased it in the paper. In section A.3 you give some tips on reducing the number of terms, mainly using principal components. I have tried this approach for the MNIST data set, with code shown above, and cannot get it to work. Maybe I’m making a coding error or approaching it wrong?

In the bigger picture, if you have more than 1000 features (which is the norm for deep learning applications), and PCA doesn’t drastically reduce this number, then I don’t see how this approach is feasible. In section 7, where you present accuracy results, for some of the data sets you list the number of features used for some you don’t. When listed, the number of features is less than 50 (unless I missed one). This would not be a typical application of deep learning.

You and we are in full agreement. There is indeed a combinatorial explosion, which you correctly note that we point out, and it indeed is an important issue to be addressed in future work. One member of our team, for instance, has suggested using PR+LASSO.

Note, though, that due to NNAEPR, whatever problem exists in PR has a counterpart in NNs. It is illusory to think that NNs solve a problem that exists with PR.

Note by the way that many applications of NNs that have thousands of features really don’t. With CNNs applied to images, the “C” layer performs huge dimension reduction.

The reason I asked you to read our paper is that you said that we had tried our methods on the MNIST data, which we did not. Sorry if my tone came across as snide.

From a quick glance at the paper, it seems to be seriously misleading, at least about the theory.

Any linear combination of polynomials of degree n or less will be a polynomial of degree n or less. Hence polynomial regression can approximate arbitrary functions only as higher and higher degree polynomials are used. In practice, this can lead to serious problems, especially with extrapolation, for which completely ridiculous results are easily obtained as the high-order terms of the polynomial blow up.

In contrast, neural networks with even a single hidden layer can approximate any function as the number of hidden units increases, as long as the unit activation function is anything EXCEPT a polynomial (I think this is proved by Hornik, Stinchcombe, and White). Increasing the number of hidden units to improve the approximation has much less serious effects than using high order polynomials, for typical activation functions such as tanh.

More than one hidden layer can be desirable, however, and can sometimes further improve the generalizability (compared to one hidden layer with many units).

Only an academic (I am one too) would make such strongly negative remarks in spite of having only just glanced at a document. 🙂

You have completely missed the main point of the paper. But I will make some comments anyway.

The central point of the paper is that, in essence NN fits are polynomial fits, i.e. NN is actually doing PR.

If the activation function is a polynomial, remove the qualifier “in essence.” Let’s discuss that case first. Here, the more layers one has, the higher the degree of the polynomial that NN generates, so one can, given enough data (n observations) to fit high-degree models, approximate the true regression function as close as desired. (I am treating classification as a special case of regression.) For a fixed number of layers, as one increases the number of neurons per layer, one quickly reaches a point in which the system is overdetermined, for the given degree that arises from the number of layers and the degree of the activation function. This is why in the case of polynomial activation function the number of layers must go to infinity as n does.

Now consider a nonpolynomial activation function that has a Taylor series (for convenience here). It can be approximated by its first k degrees, with the approximation getting better and better as k increases. We thus do not really have the problem of an overdetermined system in this case, and this is why a single layer suffices, per Kurt Hornik as you point out. (This also shows the reason for your recollection above concerning “EXCEPT.”) But the bottom line is that we still have that NN is, in essence, doing polynomial regression (PR).

As we point out in the paper, in addition to suggesting one might use PR in lieu of NNs, our finding that NNs, in essence, ARE in fact PR, can give us insights into the behavior of NNs. For instance, consider the extrapolation problem that you raise here. The fact that PR fits are unbounded then implies that NN fits ARE UNBOUNDED AS WELL, if one uses an unbounded activation function such as the very popular RELU. In the classification case, NN in the end generally uses a bounded activation function BUT SO DOES OUR PR PACKAGE, since we use glm() with the logistic link and with our PR argument.

In the general regression case, in which we use lm(), one could truncate the polynomial fit if one were worried about the extrapolation issue, BUT AGAIN WE COULD DO THIS FOR NNs AS WELL, say by using a bounded function rather than RELU at the last stage.

Again, the central point is this informal equivalence between NNs and PR.

One could say that if a neural network uses an activation function that is real analytic, then the function computed by the network will be real analytic, hence will have a Taylor series expansion, and will therefore “really” be just a polynomial (albeit of infinite order). But this characterization of the situation is completely unhelpful from any practical perspective.

In reality, the extrapolation characteristics of a neural network with tanh hidden units (a very common setup) are not anything like what you would get using polynomial regression of high-enough order to fit the training data comparably well.

Concerning your “analytic in, analytic out” point, what we are saying is a lot stronger than that. Note for instance that our argument shows that the “NN polynomial” has a higher and higher degree with each successive hidden layer. We use that to predict and then confirm that NNs are prone to multicollinearity, and that the multicollinearity should get worse and worse with each successive layer.

I fully agree with you Radford Neal. There is a lot about practibility and huge misconceptions between Theory and Practice in ML community.

Let just consider traditional fully connected Multi-Layer against PolyReg:

Claiming, as the author keeps, that:

a- we can polynomially approximate any continuous function on bounded sets: True

b- real data are commonly bounded sets: True

c- Common NN’s activation functions are continuous: True

then,

d- NN can be seen as compositions of polynomials: This is a very great fallacy.

Proof:

0- Analogy: This fallacy is similar to claiming that, since real data are the most time bounded, a Turing machine with a length-bounded tape can reasonably approximate the computation power of a TM with length-unbounded tape.
Answer: The tape length is not bounded because the data are unbounded, because it is not clear how much finite amount of tape is needed for intermediate results.

Similarly,

Polynomial regression is very very very bound-sensitive with growing degree.
1- Knowing that real data are bounded, even by knowing the bounds, does not give you insights into how the bounds are affected during signal propagation in the networks[composition]

2-Given a fix value-bounded set of data SD0, you can claim that a NN is realizing a polynomial regression and then use PolyReg indeed. HOWEVER, SLIGHTLY changing SD0 completely falsifies this CLAIM. SINCE LEARNING(Vs REPRESENTING) is about going from few to general, SD0(train) should reasonably be part of data domain DD only,

The claim that PolyReg approaches is, strictly spoken, true iff SD0=DD[make no sense regarding the concept of learning]

I think it is helpful to consider cases of exact, rather than approximate polynomial activation functions (AFs). We’ve mentioned that, for example, this is the case if one uses a transcendental AF such as tanh and one’s software uses a finite Taylor series to implement it. But even better, consider the ever-popular reLU. In that case, NNs are EXACTLY piecewise-polynomial regression. Then everything we say in our paper goes through exactly, with “polynomial” replaced by “piecewise-polynomial.” Same issues. (Someone actually mentioned splines earlier in this discussion.)

The point you bring up about generalization at the edges of the training data is of course correct, and is always a problem with any method. But we maintain that these problems are especially similar between NNs and PR, due to the correspondences we’ve pointed out.

In any case, as the old saying goes, “The proof of the pudding is in the eating.” So far, at least, we’ve found that PR is as good as or better than NNs in the datasets we’ve tried. We believe that in cases in which PR outperforms NNs, it is because the latter are stuck on local minima. We will be reporting on new datasets, and improved results on our old datasets, very soon.

We’ll be releasing a new version of polyreg later this month, with improved speed and new features.

Informal equivalence notwithstanding, Choon et al. (2008) stated that NNs are better than PR for g5(x,y) = 1.9*(1.35+exp(x)*sin(13*(x-0.6)^2)*exp(-y)*sin(7*y)). So maybe it’s not entirely right to “abandon using NNs altogether and simply use PR instead”?

I’m not sure that you noticed, but we do cite the Choon work in our paper. Actually, I have a mental note to myself to contact them, as their methods are not clear. E.g., it’s not clear that their FVU values are computed for predictions on new data or on the data on which the model was fitted.

Note that we certainly did not claim there is no dataset anywhere in the world in which PR does not meet or beat NN performance.
And in your quote of my blog post, you have omitted the qualifier “NNAEPR suggests that.” No fair! 🙂

Hi, I took a look at your paper–interesting stuff! It confirms some of the thoughts I had about NN’s, so of course the paper is great. 😉 One thought I had about it. Rather than comparing accuracies, it would be even better to compare predictions. If PR approximates a NN, it’s not enough that the accuracies are similar; you should get same or similar predictions for most or almost every sample. Did you take a look at that comparison?

Interesting paper, thanks. There is an R package RAMP for Regularized Generalized Linear Models with Interaction Effects, with a related paper Regularized Generalized Linear Models with Interaction Effects by Ning Hao, Yang Feng, and Hao Helen Zhang. The basic idea is that you won’t choose the product x_i*x_j as a predictor unless x_i and x_j are chosen as predictors. It would be interesting to compare the predictions of RAMP with neural networks.

Yes, I am aware of that work. At present, we don’t do any “choosing” at all (other than with an optional dropout feature). The problem of variable selection in regression has a long history, and though a multitude of methods have been proposed, none has truly solved the problem. It is thus easier to simply fit the full model. But with a lot of features, this becomes infeasible, and we will likely have to resort to doing some “choosing.” As I mentioned in response to another reader here today, one member of our team would like to try PR+LASSO.

Ha, ha! I just had an exchange on Twitter with Frank. He had claimed that logistic regression is NOT a classification tool, and someone else said it IS. I sided with the other guy, hopefully without indirectly causing any harm to kittens. 🙂

Kidding aside, I agree with Frank Harrell that if your algorithm predicts a probability, which is a continuous outcome, then you should use a continuous proper (or at least semi-proper) scoring rule to evaluate the performance of the model in comparison to other competing models. I believe this for all of the very good reasons that Harrell usually provides. Is the codebase for this project available in a GitHub repository somewhere? I would like to reproduce the analysis, but using proper and semi-proper scoring rules (possibly in addition to analysis of the confusion matrix, which I don’t think either your neural networks or logistic regressions are optimizing).

Our code is on GitHub. See the paper, and see my other comments here to day on details of the arguments used.

Using proportion of correct classification as a performance criterion is pretty standard, and corresponds to what ordinary people (e.g. consulting clients) want to see. They may also find the confusion matrix useful.

I agree with you somewhat about what ordinary people usually want to see, but your paper is about the comparison of the performance of different types of algorithms, and when maximizing over that set, you should use proper scoring rules not improper ones. There are also semi-proper scoring rules (e.g. concordance index and its relative, AUC) that are easy to explain to ordinary folk, so it is a good compromise. I have also successfully convinced executive-level people in industry that probability predictions are the right model output to look at in many business cases, and helped them understand how they can use those predictions to greater effect than a simple classification.

Anyway, I’ll re-read the sections of your paper that talk about polyreg. I was unaware that the package is not just for polynomial regression, but for comparison to neural networks as well! Sorry about that, I was focusing on the results in my first read.

I also agree with Frank Harrell that logistic regression is a probability prediction algorithm. I also agree with you that it can be used for classification, but disagree with anyone who says that models should be optimized with respect to a discontinuous improper scoring rule.

It’s not that clearcut. There are various duality properties that tie together the continuous and categorical aspects.

First of all, the logistic model is actually implied by the conditions for Fisher discriminant analysis — multivariate normal X, equal covariance matrices — i.e. a continuous model. I assume you would accept the fact that Fisher LDA is a model for classification, despite its continuous nature, right?

Second, if one’s goal is maximizing the overall probability of correct classification, then the mathematically optimal rule (e.g. 2-class case) is to guess Y = 1 or 0, depending on whether P(Y = 1 | X) > 0.5.

By these are in my book, Statistical Regression and Classification: from Linear Models to Machine Learning, published by CRC last year. (Couldn’t resist putting in a plug. 🙂 )

Again I disagree, both for the savvy B2B case and for the B2C case, although of course it depends on the product. Like a facial recognition app should just tell you whose face it most likely is, since faces have a very high signal-to-noise ratio. But say we’re talking about email spam. Sure you can have the spam classifier, but you can also have an email retrieval system whereby someone could search their junk folder for an email they lost with a given set of keywords, and the spam score could be used in the algorithm for retrieving the emails and sorting them, maybe into multiple buckets of measured spam risk. So then the user could start with low risk emails that got “mis-classified”, then if they don’t find it there, they could search deeper, or just give up. Another setting where probabilistic outcomes helps a lot of people is, weather forecasts. Sure you see the icon that calls for stormy weather, but almost everyone I know, and not just super-educated quants, also looks at the chance of rain and other continuous measures of inclement weather risk. Another scenario in which I’d much prefer probability prediction rather than classification is an algorithm that estimates the probability of recidivism or the probability of skipping bail. Using the probability, the justice system could estimate the cost to the state of taking a particular action. Another area where probability predictions could create better products than simple classifications is gaming apps.

(Sorry the above was supposed to be a reply to “I like the KISS principle (Keep It Simple, Stupid). I have a feeling I would be a harder sell for your sales pitch than your business clients.”)

“Second, if one’s goal is maximizing the overall probability of correct classification, then the mathematically optimal rule (e.g. 2-class case) is to guess Y = 1 or 0, depending on whether P(Y = 1 | X) > 0.5.”

One last thing! Since the results for your models of a continuous input use proper scoring rules, I definitely believe in those results, and this is a very interesting paper for me because it has huge implications for the use of Bayesian polynomial regression models with principled regularizing priors as substitutes for neural networks. I look forward to the sequels.

What is the interpretation of large accuracy values, those greater than 1? From the relevant explanation it seems _lower_ is better in those cases, but I’m curious to get a more absolute understanding of performance there. For the taxi problem in particular, how do these approaches compare to the Kaggle winners, and why not use the metric from the competition?

As noted in the text of the paper — it must be very boring prose, since so many people here seem to have avoided reading it 🙂 — some examples involve regression and others involve classification. In the former case, there is no bound.

We also note in the paper that some of the datasets are pretty dirty, and that no effort was made on our part to clean it. In the taxi data, for instance, there is one trip that claims to be of 3 days’ duration and another trip that claims the passengers were picked up in Antartica 🙂 . I don’t know how Kaggle and/or its users handle this.

I did indeed miss that bit; relabeling those columns “MAPE” might help others avoid this confusion.

For the taxi one, or any other for which prior art exists, it may also be worth the hassle of coding up the standard metric; Kaggle used RMSLE for that one (https://www.kaggle.com/c/nyc-taxi-trip-duration#evaluation). You could also submit your predictions and let them compute it for you!

Personally, I don’t like the S (and thus the R) in RMSLE.

It would be great if anyone out there were to run our examples with other parameter values etc. As I said earlier, any tuning parameter not mentioned had the default values, and we did no data cleaning. So it would be easy for anyone to replicate our experiments, say under other conditions.

Was the only tuning of the Keras models the shown variation in layer size, e.g. no change in dropout, no other layer arrangements, no variation in learning rate or batch size?

How did the training/test loss curves look, were all keras models overfit or was there underfitting in some cases? How does this compare to the balance struck by the polyreg models?

We really should have provided more detail. Basically, anything that was not mentioned in the tables took the default values. We’ll put this information in our revised paper, but if you would like to know sooner, just type args(kerasformula) etc. We have not yet done further detailed analysis of the type you suggest.

That makes sense; I see the defaults, but I suspect they aren’t appropriate for these problems.

If the deep models you’re comparing to are not well-balanced ones, would that affect your conclusions? E.g. if your MNIST model is overfit, it may be less surprising to observe severe and increasing VIF scores. I suspect that observation may not hold as strongly in a well-specified/well-trained deep model, even those well below SotA performance.

One of our team members is actually the author of kerasformula, and did more extensive experiments. See the CrossFit and cancer examples. But once again, we welcome and encourage all of you out there to try some other configurations of the tuning parameters. That is one of the reasons we put our paper on arXiv.

My own view is that many people have become overly-mesmerized by all the various bells and whistles that people have developed over the years for handling NNs. My gut feeling is that all that tweaking is at best moving one away from one bad local minimum to another (possibly better) local minimum.

Another problem is that many people are unaware of the limitations of cross-validation. If one separates the data into training and test sets, the latter set playing the role of new data, but then tries hundreds of configurations of the tuning parameters on that new data, it is NOT new data anymore. One really ought to split the data into THREE sets.

The number of predictors (sections 7.1.1 to 7.17) are: 16, 90, 8, 16, 54, 20. Only for two of the data sets they are > 20 but even those are relatively small compared to the intended use of NN. To be clear, NN is not supposed to be a good out of the box algorithm; there’s extensive research that bagging or boosting algorithms are. NNs are good at extracting information out of large number of noisy dependent variables. Think count of the pixels in an image (not the fake 16 geometric features “extracted” from the image), or words in dictionary, that are ORDERS of magnitude larger than the sample problems analyzed in the article. So, to begin with, the comparison with NN is used in cases that were not meant for NN.

But let’s go ahead and point some fair criticism to the paper:

1. Polynomial regressions assume quadratic penalty for errors, even for higher order cross terms. Shouldn’t there be less penalty for higher order coefficients? NN have way more flexible way of determining the loss function.

2. How do you deal with both categorical and continuous dependent variables. NN with multiple heads as well as custom loss functions works.

3. When I was learning tensor flow, I did a simple example with simulated data: X1, X2, X3 normally distributed N(0, 1). The true model has a CONDITIONAL dependency:
Y = X1+X2 if X3<0 else X1-X2 with noise added. All standard ML algorithms (SVM, forests, NN) learned this relation perfectly (i.e. out of sample prediction mse equal to noise variance) for large number of data points. Regression failed miserably: beta_1=1, beta_2=0, beta_3=? I didn’t try polynomial regression but I bet it won’t find the perfect out of sample prediction no matter what degree of polynomials you include.

4. There’s more to NN than weights and global minimization of loss. There’s batch training for auto correlated data, stochastic gradient descent for large number of dependent variables, convolution when there’s a relation between adjacent regressors, etc.

5. In section 6, MNIST is used for a criticism of NN regarding co-linearity of NN weights, but in the next section that claims PR is as effective as NN (if not better), there’s no mention of MNIST instead a 16 feature letter recognition data is used. Nice try, but if you are indeed serious about this research, I challenge you to beat the out of sample prediction error of 0.21% of MNIST using PR. https://en.wikipedia.org/wiki/MNIST_database#Classifiers
I tried to to use multivariate logistic regression on MNIST: 20% error rate, 100 times worse than NN.

6. For me this is the most important aspect of NN: Alpha go zero showed very strong evidence that when neural networks predict several things at once they are better than predicting each thing separately. In that case predicting the next move as well as the winner of the game is much stronger than predicting the next move and the winner separately. No regression, no boosting, no svm, no forests, NO OTHER algorithm has this property, which is a really BIG deal.

I look forward to the day when poly regression will beat alpha go zero. There's a famous Murphy law: every complicated problem has a simple solution. Which is wrong. 🙂

Your point that we have not yet explored large-p settings is correct (though now new, since we emphasize it in the paper, along with pointing out the need to solve some technical obstacles to going to large-p). Your claim that dimension reduction does not reduce p is obviously incorrect.

I really haven’t thought much about the case of vector-valued dependent variables, but I see no reason why PR shouldn’t do well there. Note that any classification problem is such a case, and a number of our examples were of that type.

I object to statements about NNs etc. along the lines of “the evidence shows…” Unless there is some theoretical justification for a claim, one is doing not much more than speculation. There are lots of examples in which boosting does NOT work out of the box, contrary to your statement. Our section on NNAEPR shows that theoretical justification for PR (an informal account, but it could be made mathematically rigorous).

What you say about penalty functions in PR is incorrect. The penalty only concerns prediction error, not errors in the coefficients. Frankly, such misinformation is common among ML people.

I’m pretty sure that PR would do fine on your contrived example. Why don’t you try it?

You’ve even got Murphy’s Law wrong; it’s “Anything that can go wrong will go wrong.”

“I’m pretty sure that PR would do fine on your contrived example. Why don’t you try it?”

I tried it; I went up to fourth order both in the polynomial exponent and in the interactions. The model had 794 coefficients; I fit it against 1000 simulated data points. I used noise with sd = 0.03. The range of the residuals was +/- 1.

I also fit a linear model with X1 and X2 but with signum(X3) instead of X3 and with second-order interactions. The true data-generating process is in the space of models: Y = X1 + signum(X3)*X2 + err. The coefficients were correct up to noise and the range of residuals was +/- 0.1.

1. “I really haven’t thought much about the case of vector-valued dependent variables, but I see no reason why PR shouldn’t do well there”: If you use one-hot encoding for categorical variables (days of the week for example), then all polynomial regressors that contain the zero encoded variable will be zero, and higher order polynomials of 1 are idempotent, so it looks like your polynomial basis is not complete anymore.

2. “I object to statements about NNs etc. along the lines of “the evidence shows…” The reality exists even if there’s no rigorous theory for it, pretty much like gravity existed even before Newton formulated a theory for it. So, if alpha go zero beats the previous version 100-0 when you merge the two networks AND learn from scratch, I would say euphemistically that: “there is strong evidence that…” but in reality I mean: “let’s face it, it actually works”. As a matter of fact I see this as a statement of academic impotence: discarding experimental evidence because there’s no theoretical explanation for it.

If you read the paper, you will see that we recognize the idempotence issue in the software. In terms of the theory, actually it is even easier for this case. You are failing to consider the interaction terms.

Part of why NN are powerful is that they give valuable inductive biases. All of the “progress” in NN is not in deep densely connected networks, such as the ones you are looking at. What is so powerful is that non-trivial DAGs (such as convolutions, attention, highway/gating) bias certain functional forms during training. PR also bias certain forms, but it is unclear to me that you could write down a PR that replicates some complicated DAGs like WaveNet/etc. If we had an optimizer that didn’t need these biases the perhaps you could replace NNs with PRs, but by that token you can replace it with any universal approximator. Your results about collinearity for deeper layers also seems like an issue of overtraining/too many fitting parameters and are also common in computer vision. For resnet, for example, later layers tend to have weak and collinear activations (which is somewhat rectified by dense convnets). One technique that allows refinement of large networks is knowledge distillation (see “distilling the knowledge in a NN” Hinton, Vinyals, Dean), but I don’t see how it can be applied to PR. Your suggestion of PCA is generally not very good in practice since we most care about manifolds that are not linear projections.

Yes, we briefly address the idea of these more complex NNs in our paper. By the way, we’ll have a revision up in a couple of days, including an image classification example.

PCA is useful in most situations in which the relations between predictors and response is monotone; it need not be linear. What is more interesting is the nonmonotone cases; there, applying PCA to the second- or maybe third-degree polynomial terms should typically be enough, but again we won’t know until we try it out.

It seems that in quite a few of the regression test cases, PR 1 outperformed KF and DN. But isn’t PR1 simply linear regression? For example, table 4 (regression of engineer income), table 6 (million song year regression), table 8 (letter recognition), table 9 (taxi time regression).

It seems a little odd that chosen benchmark methods can’t match linear regression.

It reminds me of “extreme learning machines” where you take a (fixed) vector to vector random projection of the input data and pass that through non-linear activation functions. Then you train a linear readout layer to give the response wanted for the input. Basically then you get an associative memory.
Anyway an ordinary neural network layer can be viewed as an associative memory as well. And you see techniques like direct feedback alignment being used. Where you are creating an multilayer associative memory with error correction effects and avoidance of “crosstalk” between memories. Is this the same as a deep neural network trained by back-propagation? I would guess not.
You can have a look at my code for associative memory: https://github.com/S6Regen/Associative-Memory-and-Self-Organizing-Maps-Experiments

And the implied equivalence is more structurally precise. If you believe the logic of the article, a polynomial regression has to be equivalent to neural network, and the number of layers in the neural network is equal to the degree of the polynomial in the equivalent polynomial regression.

Read it 1st to last page. Had been thinking this (in far hand wavier terms) myself for a while. Was thinking that feed forward nns might mostly be heuristic Taylor series in disguise. Would love to see the “could be made mathematically rigorous” part fleshed out so I can print it out, post it above my desk and point to it next time a colleague scoffs at my preference for GLM and simpler models for non specialized (eg cnn) problems. In any case as you say the c part of cnn is primarily dimensionality reduction with application beyond nn.

I am impressed at your dedication to responding to comments! I was wondering, will NNs essentially be equivalent in the long run to any ways of constructing functions such that you are doing ‘universal function approximation’? I am also curious as to what other ways besides neural networks there would be to maintain being able to universally approximate any function, or maybe if there are canonical resources for diving more into this specific topic out there?

Hello, I’m wondering why you compared PR to NN with polynomial activation functions. I’m not too surprised that overall you get polynomial regression functions by using such activations… My intuition is that NN work as a series of linearities followed by non linearities. In fact, I remember having seen some (generalisation or expressivity, I don’t remember exactly) results specific to NN with non polynomial activation functions. In your experiments, what kind of activation functions did you use for NNs?

We will be releasing an updated and more details version of our arXiv paper soon, along with putting an updated version of polyreg on CRAN. Please bear with us until then.

I liked the paper. I think you could make the same argument about things like gradient boosting trees. Actually, what ReLu does is the same as a bifurcation in a decision tree. ReLu defines a region relative to the domain of the previous region in which downstream computations affect the solution. Hence, ReLu and gradient boosting etc all are doing exactly local polynomial fits, with order determined by depth of the network. I think it is the locality rather than the nonlinearity makes neural networks universal in a useful way.

Well, anything smooth is “local polynomial” in the approx. sense. I think you’ll find from a small example that it is not exact, though.

As to gradient boosting, we don’t cover it, as it doesn’t have the “mysterious black box” quality of NNs. Finding an asymptotic distribution should be doable, and I assume people have done so, but I don’t see a polynomial argument. With NNs, the polynomial comes naturally from the number of layers; there seems to be no obvious analog in GB.

Would you be willing to comment on the GMDH methods as compared to polynomial regression and the relevance of your work to the topic? GMDH sounds at least superficially related, but it seems that people who promote GMDH are somewhat walled off from the rest of the statistical modelling community, making it difficult for an outsider like me to judge if GMDH is to statistics as cold fusion is to physics, or if it’s a mundane thing made to sound profound, or if it’s something meriting serious further study.

Don’t know, hadn’t heard of it before you brought it up. Your “cold fusion” comment is spot on for the machine learning field 🙂 , in my opinion, and my first reaction is the GMDH may be the same, but I really need to lookinto this.

I have a comment about your heuristic argument in Chapter 6.1 “The Basic Argument”:
Taylor series for activation functions like tanh or the sigmoid have a finite radius of convergence (in fact: the radius is very small). So I cannot see why your argument with the Taylor series approximation holds true.
Could you please clarify.

For large samples, which are typical in NN usage, one will be within the radius of convergence. That’s how class stat asymptotic theory works. But your point probably does apply to non-convergene of the iteration process.

I like your paper.
The radius of convergence of tanh is pi/2 (for the sigmoid the radius is a bit bigger). With a reasonable low order polynomial a good approximation is only valid for the linear part of tanh, resp. the sigmoid. I get the impression that in this case your argument essentially justifies the approximation of a linear NN = linear regression with a polynomial regression. In the case of a highly non-linear NN (the activation is used where it is curved) it might need a more rigorous argument with lots of estimations and inequalities. Do you know of such a research?
The possible non-convergence of the iteration process is another issue.

I believe we understand each other and do not really disagree. There are two senses of convergence here, as we both have said, but they do interact, as I said.

If you look at standard asymptotic analysis, say for MLE, as the sample size n goes to infinity eventually theta-hat is close enough to theta so that a first-degree Taylor approximation is valid with high probability. This is true no matter what the radius of convergence of that Taylor series is, or for that matter even whether there are infinitely many derivatives. The same argument could be constructed for the poly NNs setting.

I thank you back. I had previously argued with my coauthors that we don’t need a formal proof, but after this exchange with you, I now think otherwise.

Couldn’t you say this is a specific example for the Universal Approximation Theorem? As it says NN can approximate any smooth function and we know any smooth function can be also approximated well with a Polynomial.

No, that is not the same. We are saying that NNs actually produce PR, which is quite different from saying that both NNs and PR are universal approximators.

Well,

Since NN generate analytic function once used with Analytic Activation I’d not be surprised you treat is as a Polynomial.

Hence you can say the problem is searching for the coefficients of the Polynomial, which is Regression.

If you could build the problem as PR and show that given the PR coefficients you fins the function is uniformly close (In extrapolations zones as well) then it will be a game changer for the way NN are trained.

If I’m not mistaken, it is not shown in the paper.

Our paper of course must be treated as preliminary. Hence our cautious phrasing, e.g. “NNAEPR suggests that.”

The uniformity issue is rather nuanced. The S-W Theorem does that continuous functions can be uniformly approximated on a compact set, and in the real world we ALWAYS have a compact set, since all variables are bounded. (No one is 90 feet tall.) So our argument could be made mathematically rigorous. But of course, the larger the bound, the longer it may take to reach a given level of approximation.

The more serious problem, though, is extrapolation itself. One simply cannot count on predicting well in regions in which we have no data. This is true for PR, NNs and any other method.

“The more serious problem, though, is extrapolation itself. One simply cannot count on predicting well in regions in which we have no data. This is true for PR, NNs and any other method.”

decades ago (alas), one of my professors warned that one cannot predict beyond the range of the data. time series folks do it all the time, of course. with purely God driven data, the problem is less critical, since God doesn’t change the rules of the data generation process (although we may not know all of them at a point in time, of course). with human driven data, all bets are off. the rule makers change the rules as often as need be to retain the advantage of being the rule maker.

I recently had an interesting conversation with a very prominent time series guy on this very topic. He conceded that of course extrapolation bias is a problem, but maintained that variance is actually the worse problem. As we all know, the variance of a predicted value generally increases as one moves away from the center of the data, and he said that in time series this becomes a problem much more quickly than the bias.

(Industrial mathematician here who’s been doing deep learning stuff the last year or so) I notice that you used MNIST as an example when talking about multicolinearity but didn’t give any numbers in the experimental results section. MNIST by itself isn’t a very interesting dataset these days given how incredibly well all sorts of NNs do on it, but for that very reason if you’re going to argue against NNs you should address how alternatives perform on it.

The other thing that struck me as odd was when you talk about ‘any reasonable activation function’. I agree that yes, of course, polynomials are dense w/r/t/ continuous functions, but I don’t think that this is particularly useful in describing common activation functions such as RELU (non-differentiable at 0) and softmax (all those denominator terms!) and sigmoid functions in general (bounded range).

(Also I glanced over https://github.com/matloff/polyreg and didn’t immediately see anything about working in an orthogonal basis (vs the monomial one) or how it’d be different than lm+poly )

The MNIST example was there simply to illustrate the multicollinearity issue. The author who worked on this section found it convenient to use an example built-in to the

kerasformulapackage. There is no actual classification being done, in terms of what we were interested in.The non-differentiability of RELU is not really an issue. Just approximate it with a smooth function. E.g. (x^2+c)^0.5 is a good fit to |x| for small c; adapt this example to RELU. For classification, we use

glm()with the logit family and a polynomial argument. One could do this in regression cases too.Yes, orthogonal polynomials are something we intend to try.

I’m trying to test the polyreg approach on the classic mnist data set. I am only trying to predict 1’s to make it a logistic regression. I cannot find a way to fit the model due to the combinational explosion of terms (even for degree 2). My code is below, am I doing something wrong?

# DL setup for MNIST ——————————————————

library(keras)

# Input image dimensions

img_rows <- 28

img_cols <- 28

# The data, shuffled and split between train and test sets

mnist <- dataset_mnist()

x_train <- mnist$train$x

y_train <- mnist$train$y

x_test <- mnist$test$x

y_test <- mnist$test$y

# Redefine dimension of train/test inputs

x_train <- array_reshape(x_train, c(nrow(x_train), img_rows, img_cols, 1))

x_test <- array_reshape(x_test, c(nrow(x_test), img_rows, img_cols, 1))

input_shape <- c(img_rows, img_cols, 1)

# Transform RGB values into [0,1] range

x_train <- x_train / 255

x_test <- x_test / 255

# Polyreg —————————————————————–

library(polyreg)

num_to_predict=1

x_mat_train=matrix(nrow=dim(x_train)[1],ncol=28*28,x_train[1:length(x_train)],byrow=TRUE)

idx=which(y_train==num_to_predict)

y=rep(0,nrow(x_mat_train))

y[idx]=1

xy=cbind(x_mat_train,y)

# degree 3, error memory exhausted

# fit=polyFit(xy, deg=3, maxInteractDeg=3, use = "glm", pcaMethod=TRUE, pcaPortion = 0.9)

# try degree 2

# still errors out due to memory exhausted

fit=polyFit(xy, deg=2, maxInteractDeg=2, use = "glm", pcaMethod=TRUE, pcaPortion = 0.9)

Yep, absolutely, there is a combinatorial explosion, which must be dealt with. This is discussed in a fair amount of depth in the paper.

So it doesn’t work on a simplified version of the MNIST data set but you claim it has better performance for every data set you tested?

Please read the paper. We never said our method “doesn’t work on a simplified version of the MNIST data set.”

I did read the paper and purposely used the phrase explosion of terms because that’s that how you phrased it in the paper. In section A.3 you give some tips on reducing the number of terms, mainly using principal components. I have tried this approach for the MNIST data set, with code shown above, and cannot get it to work. Maybe I’m making a coding error or approaching it wrong?

In the bigger picture, if you have more than 1000 features (which is the norm for deep learning applications), and PCA doesn’t drastically reduce this number, then I don’t see how this approach is feasible. In section 7, where you present accuracy results, for some of the data sets you list the number of features used for some you don’t. When listed, the number of features is less than 50 (unless I missed one). This would not be a typical application of deep learning.

You and we are in full agreement. There is indeed a combinatorial explosion, which you correctly note that we point out, and it indeed is an important issue to be addressed in future work. One member of our team, for instance, has suggested using PR+LASSO.

Note, though, that due to NNAEPR, whatever problem exists in PR has a counterpart in NNs. It is illusory to think that NNs solve a problem that exists with PR.

Note by the way that many applications of NNs that have thousands of features really don’t. With CNNs applied to images, the “C” layer performs huge dimension reduction.

The reason I asked you to read our paper is that you said that we had tried our methods on the MNIST data, which we did not. Sorry if my tone came across as snide.

Thanks for the clarification.

From a quick glance at the paper, it seems to be seriously misleading, at least about the theory.

Any linear combination of polynomials of degree n or less will be a polynomial of degree n or less. Hence polynomial regression can approximate arbitrary functions only as higher and higher degree polynomials are used. In practice, this can lead to serious problems, especially with extrapolation, for which completely ridiculous results are easily obtained as the high-order terms of the polynomial blow up.

In contrast, neural networks with even a single hidden layer can approximate any function as the number of hidden units increases, as long as the unit activation function is anything EXCEPT a polynomial (I think this is proved by Hornik, Stinchcombe, and White). Increasing the number of hidden units to improve the approximation has much less serious effects than using high order polynomials, for typical activation functions such as tanh.

More than one hidden layer can be desirable, however, and can sometimes further improve the generalizability (compared to one hidden layer with many units).

Only an academic (I am one too) would make such strongly negative remarks in spite of having only just glanced at a document. 🙂

You have completely missed the main point of the paper. But I will make some comments anyway.

The central point of the paper is that, in essence NN fits are polynomial fits, i.e. NN is actually doing PR.

If the activation function is a polynomial, remove the qualifier “in essence.” Let’s discuss that case first. Here, the more layers one has, the higher the degree of the polynomial that NN generates, so one can, given enough data (n observations) to fit high-degree models, approximate the true regression function as close as desired. (I am treating classification as a special case of regression.) For a fixed number of layers, as one increases the number of neurons per layer, one quickly reaches a point in which the system is overdetermined, for the given degree that arises from the number of layers and the degree of the activation function. This is why in the case of polynomial activation function the number of layers must go to infinity as n does.

Now consider a nonpolynomial activation function that has a Taylor series (for convenience here). It can be approximated by its first k degrees, with the approximation getting better and better as k increases. We thus do not really have the problem of an overdetermined system in this case, and this is why a single layer suffices, per Kurt Hornik as you point out. (This also shows the reason for your recollection above concerning “EXCEPT.”) But the bottom line is that we still have that NN is, in essence, doing polynomial regression (PR).

As we point out in the paper, in addition to suggesting one might use PR in lieu of NNs, our finding that NNs, in essence, ARE in fact PR, can give us insights into the behavior of NNs. For instance, consider the extrapolation problem that you raise here. The fact that PR fits are unbounded then implies that NN fits ARE UNBOUNDED AS WELL, if one uses an unbounded activation function such as the very popular RELU. In the classification case, NN in the end generally uses a bounded activation function BUT SO DOES OUR PR PACKAGE, since we use

glm()with the logistic link and with our PR argument.In the general regression case, in which we use

lm(), one could truncate the polynomial fit if one were worried about the extrapolation issue, BUT AGAIN WE COULD DO THIS FOR NNs AS WELL, say by using a bounded function rather than RELU at the last stage.Again, the central point is this informal equivalence between NNs and PR.

One could say that if a neural network uses an activation function that is real analytic, then the function computed by the network will be real analytic, hence will have a Taylor series expansion, and will therefore “really” be just a polynomial (albeit of infinite order). But this characterization of the situation is completely unhelpful from any practical perspective.

In reality, the extrapolation characteristics of a neural network with tanh hidden units (a very common setup) are not anything like what you would get using polynomial regression of high-enough order to fit the training data comparably well.

Actually, one of the surprises was that low-order polynomials worked so well.

Concerning your “analytic in, analytic out” point, what we are saying is a lot stronger than that. Note for instance that our argument shows that the “NN polynomial” has a higher and higher degree with each successive hidden layer. We use that to predict and then confirm that NNs are prone to multicollinearity, and that the multicollinearity should get worse and worse with each successive layer.

I fully agree with you Radford Neal. There is a lot about practibility and huge misconceptions between Theory and Practice in ML community.

Let just consider traditional fully connected Multi-Layer against PolyReg:

Claiming, as the author keeps, that:

a- we can polynomially approximate any continuous function on bounded sets: True

b- real data are commonly bounded sets: True

c- Common NN’s activation functions are continuous: True

then,

d- NN can be seen as compositions of polynomials: This is a very great fallacy.

Proof:

0- Analogy: This fallacy is similar to claiming that, since real data are the most time bounded, a Turing machine with a length-bounded tape can reasonably approximate the computation power of a TM with length-unbounded tape.

Answer: The tape length is not bounded because the data are unbounded, because it is not clear how much finite amount of tape is needed for intermediate results.

Similarly,

Polynomial regression is very very very bound-sensitive with growing degree.

1- Knowing that real data are bounded, even by knowing the bounds, does not give you insights into how the bounds are affected during signal propagation in the networks[composition]

2-Given a fix value-bounded set of data SD0, you can claim that a NN is realizing a polynomial regression and then use PolyReg indeed. HOWEVER, SLIGHTLY changing SD0 completely falsifies this CLAIM. SINCE LEARNING(Vs REPRESENTING) is about going from few to general, SD0(train) should reasonably be part of data domain DD only,

The claim that PolyReg approaches is, strictly spoken, true iff SD0=DD[make no sense regarding the concept of learning]

Thanks for the comments.

I think it is helpful to consider cases of exact, rather than approximate polynomial activation functions (AFs). We’ve mentioned that, for example, this is the case if one uses a transcendental AF such as tanh and one’s software uses a finite Taylor series to implement it. But even better, consider the ever-popular reLU. In that case, NNs are EXACTLY piecewise-polynomial regression. Then everything we say in our paper goes through exactly, with “polynomial” replaced by “piecewise-polynomial.” Same issues. (Someone actually mentioned splines earlier in this discussion.)

The point you bring up about generalization at the edges of the training data is of course correct, and is always a problem with any method. But we maintain that these problems are especially similar between NNs and PR, due to the correspondences we’ve pointed out.

In any case, as the old saying goes, “The proof of the pudding is in the eating.” So far, at least, we’ve found that PR is as good as or better than NNs in the datasets we’ve tried. We believe that in cases in which PR outperforms NNs, it is because the latter are stuck on local minima. We will be reporting on new datasets, and improved results on our old datasets, very soon.

We’ll be releasing a new version of polyreg later this month, with improved speed and new features.

Informal equivalence notwithstanding, Choon et al. (2008) stated that NNs are better than PR for g5(x,y) = 1.9*(1.35+exp(x)*sin(13*(x-0.6)^2)*exp(-y)*sin(7*y)). So maybe it’s not entirely right to “abandon using NNs altogether and simply use PR instead”?

I’m not sure that you noticed, but we do cite the Choon work in our paper. Actually, I have a mental note to myself to contact them, as their methods are not clear. E.g., it’s not clear that their FVU values are computed for predictions on new data or on the data on which the model was fitted.

Note that we certainly did not claim there is no dataset anywhere in the world in which PR does not meet or beat NN performance.

And in your quote of my blog post, you have omitted the qualifier “NNAEPR suggests that.” No fair! 🙂

Hi, I took a look at your paper–interesting stuff! It confirms some of the thoughts I had about NN’s, so of course the paper is great. 😉 One thought I had about it. Rather than comparing accuracies, it would be even better to compare predictions. If PR approximates a NN, it’s not enough that the accuracies are similar; you should get same or similar predictions for most or almost every sample. Did you take a look at that comparison?

Interesting idea, definitely something to try.

Interesting paper, thanks. There is an R package RAMP for Regularized Generalized Linear Models with Interaction Effects, with a related paper Regularized Generalized Linear Models with Interaction Effects by Ning Hao, Yang Feng, and Hao Helen Zhang. The basic idea is that you won’t choose the product x_i*x_j as a predictor unless x_i and x_j are chosen as predictors. It would be interesting to compare the predictions of RAMP with neural networks.

Yes, I am aware of that work. At present, we don’t do any “choosing” at all (other than with an optional dropout feature). The problem of variable selection in regression has a long history, and though a multitude of methods have been proposed, none has truly solved the problem. It is thus easier to simply fit the full model. But with a lot of features, this becomes infeasible, and we will likely have to resort to doing some “choosing.” As I mentioned in response to another reader here today, one member of our team would like to try PR+LASSO.

Why did you use predictive accuracy instead of a continuous proper scoring rule? Frank Harrell kills a kitten every time someone does that, I heard.

Ha, ha! I just had an exchange on Twitter with Frank. He had claimed that logistic regression is NOT a classification tool, and someone else said it IS. I sided with the other guy, hopefully without indirectly causing any harm to kittens. 🙂

Kidding aside, I agree with Frank Harrell that if your algorithm predicts a probability, which is a continuous outcome, then you should use a continuous proper (or at least semi-proper) scoring rule to evaluate the performance of the model in comparison to other competing models. I believe this for all of the very good reasons that Harrell usually provides. Is the codebase for this project available in a GitHub repository somewhere? I would like to reproduce the analysis, but using proper and semi-proper scoring rules (possibly in addition to analysis of the confusion matrix, which I don’t think either your neural networks or logistic regressions are optimizing).

Our code is on GitHub. See the paper, and see my other comments here to day on details of the arguments used.

Using proportion of correct classification as a performance criterion is pretty standard, and corresponds to what ordinary people (e.g. consulting clients) want to see. They may also find the confusion matrix useful.

I agree with you somewhat about what ordinary people usually want to see, but your paper is about the comparison of the performance of different types of algorithms, and when maximizing over that set, you should use proper scoring rules not improper ones. There are also semi-proper scoring rules (e.g. concordance index and its relative, AUC) that are easy to explain to ordinary folk, so it is a good compromise. I have also successfully convinced executive-level people in industry that probability predictions are the right model output to look at in many business cases, and helped them understand how they can use those predictions to greater effect than a simple classification.

Anyway, I’ll re-read the sections of your paper that talk about polyreg. I was unaware that the package is not just for polynomial regression, but for comparison to neural networks as well! Sorry about that, I was focusing on the results in my first read.

Ah… it was in Appendix A the whole time!

I like the KISS principle (Keep It Simple, Stupid). I have a feeling I would be a harder sell for your sales pitch than your business clients. 🙂

I also agree with Frank Harrell that logistic regression is a probability prediction algorithm. I also agree with you that it can be used for classification, but disagree with anyone who says that models should be optimized with respect to a discontinuous improper scoring rule.

It’s not that clearcut. There are various duality properties that tie together the continuous and categorical aspects.

First of all, the logistic model is actually implied by the conditions for Fisher discriminant analysis — multivariate normal X, equal covariance matrices — i.e. a continuous model. I assume you would accept the fact that Fisher LDA is a model for classification, despite its continuous nature, right?

Second, if one’s goal is maximizing the overall probability of correct classification, then the mathematically optimal rule (e.g. 2-class case) is to guess Y = 1 or 0, depending on whether P(Y = 1 | X) > 0.5.

By these are in my book,

Statistical Regression and Classification: from Linear Models to Machine Learning,published by CRC last year. (Couldn’t resist putting in a plug. 🙂 )Again I disagree, both for the savvy B2B case and for the B2C case, although of course it depends on the product. Like a facial recognition app should just tell you whose face it most likely is, since faces have a very high signal-to-noise ratio. But say we’re talking about email spam. Sure you can have the spam classifier, but you can also have an email retrieval system whereby someone could search their junk folder for an email they lost with a given set of keywords, and the spam score could be used in the algorithm for retrieving the emails and sorting them, maybe into multiple buckets of measured spam risk. So then the user could start with low risk emails that got “mis-classified”, then if they don’t find it there, they could search deeper, or just give up. Another setting where probabilistic outcomes helps a lot of people is, weather forecasts. Sure you see the icon that calls for stormy weather, but almost everyone I know, and not just super-educated quants, also looks at the chance of rain and other continuous measures of inclement weather risk. Another scenario in which I’d much prefer probability prediction rather than classification is an algorithm that estimates the probability of recidivism or the probability of skipping bail. Using the probability, the justice system could estimate the cost to the state of taking a particular action. Another area where probability predictions could create better products than simple classifications is gaming apps.

(Sorry the above was supposed to be a reply to “I like the KISS principle (Keep It Simple, Stupid). I have a feeling I would be a harder sell for your sales pitch than your business clients.”)

Fine, but your arguments apply just as much to the regression case. Nothing special about classification.

“Second, if one’s goal is maximizing the overall probability of correct classification, then the mathematically optimal rule (e.g. 2-class case) is to guess Y = 1 or 0, depending on whether P(Y = 1 | X) > 0.5.”

What is that isn’t your goal?

Sorry, can’t parse your question?

One last thing! Since the results for your models of a continuous input use proper scoring rules, I definitely believe in those results, and this is a very interesting paper for me because it has huge implications for the use of Bayesian polynomial regression models with principled regularizing priors as substitutes for neural networks. I look forward to the sequels.

What is the interpretation of large accuracy values, those greater than 1? From the relevant explanation it seems _lower_ is better in those cases, but I’m curious to get a more absolute understanding of performance there. For the taxi problem in particular, how do these approaches compare to the Kaggle winners, and why not use the metric from the competition?

As noted in the text of the paper — it must be very boring prose, since so many people here seem to have avoided reading it 🙂 — some examples involve regression and others involve classification. In the former case, there is no bound.

We also note in the paper that some of the datasets are pretty dirty, and that no effort was made on our part to clean it. In the taxi data, for instance, there is one trip that claims to be of 3 days’ duration and another trip that claims the passengers were picked up in Antartica 🙂 . I don’t know how Kaggle and/or its users handle this.

I did indeed miss that bit; relabeling those columns “MAPE” might help others avoid this confusion.

For the taxi one, or any other for which prior art exists, it may also be worth the hassle of coding up the standard metric; Kaggle used RMSLE for that one (https://www.kaggle.com/c/nyc-taxi-trip-duration#evaluation). You could also submit your predictions and let them compute it for you!

Personally, I don’t like the S (and thus the R) in RMSLE.

It would be great if anyone out there were to run our examples with other parameter values etc. As I said earlier, any tuning parameter not mentioned had the default values, and we did no data cleaning. So it would be easy for anyone to replicate our experiments, say under other conditions.

Was the only tuning of the Keras models the shown variation in layer size, e.g. no change in dropout, no other layer arrangements, no variation in learning rate or batch size?

How did the training/test loss curves look, were all keras models overfit or was there underfitting in some cases? How does this compare to the balance struck by the polyreg models?

We really should have provided more detail. Basically, anything that was not mentioned in the tables took the default values. We’ll put this information in our revised paper, but if you would like to know sooner, just type

args(kerasformula)etc. We have not yet done further detailed analysis of the type you suggest.That makes sense; I see the defaults, but I suspect they aren’t appropriate for these problems.

If the deep models you’re comparing to are not well-balanced ones, would that affect your conclusions? E.g. if your MNIST model is overfit, it may be less surprising to observe severe and increasing VIF scores. I suspect that observation may not hold as strongly in a well-specified/well-trained deep model, even those well below SotA performance.

One of our team members is actually the author of

kerasformula, and did more extensive experiments. See the CrossFit and cancer examples. But once again, we welcome and encourage all of you out there to try some other configurations of the tuning parameters. That is one of the reasons we put our paper on arXiv.My own view is that many people have become overly-mesmerized by all the various bells and whistles that people have developed over the years for handling NNs. My gut feeling is that all that tweaking is at best moving one away from one bad local minimum to another (possibly better) local minimum.

Another problem is that many people are unaware of the limitations of cross-validation. If one separates the data into training and test sets, the latter set playing the role of new data, but then tries hundreds of configurations of the tuning parameters on that new data, it is NOT new data anymore. One really ought to split the data into THREE sets.

“Another problem is that many people are unaware of the limitations of cross-validation.” Amen, dude.

The number of predictors (sections 7.1.1 to 7.17) are: 16, 90, 8, 16, 54, 20. Only for two of the data sets they are > 20 but even those are relatively small compared to the intended use of NN. To be clear, NN is not supposed to be a good out of the box algorithm; there’s extensive research that bagging or boosting algorithms are. NNs are good at extracting information out of large number of noisy dependent variables. Think count of the pixels in an image (not the fake 16 geometric features “extracted” from the image), or words in dictionary, that are ORDERS of magnitude larger than the sample problems analyzed in the article. So, to begin with, the comparison with NN is used in cases that were not meant for NN.

But let’s go ahead and point some fair criticism to the paper:

1. Polynomial regressions assume quadratic penalty for errors, even for higher order cross terms. Shouldn’t there be less penalty for higher order coefficients? NN have way more flexible way of determining the loss function.

2. How do you deal with both categorical and continuous dependent variables. NN with multiple heads as well as custom loss functions works.

3. When I was learning tensor flow, I did a simple example with simulated data: X1, X2, X3 normally distributed N(0, 1). The true model has a CONDITIONAL dependency:

Y = X1+X2 if X3<0 else X1-X2 with noise added. All standard ML algorithms (SVM, forests, NN) learned this relation perfectly (i.e. out of sample prediction mse equal to noise variance) for large number of data points. Regression failed miserably: beta_1=1, beta_2=0, beta_3=? I didn’t try polynomial regression but I bet it won’t find the perfect out of sample prediction no matter what degree of polynomials you include.

4. There’s more to NN than weights and global minimization of loss. There’s batch training for auto correlated data, stochastic gradient descent for large number of dependent variables, convolution when there’s a relation between adjacent regressors, etc.

5. In section 6, MNIST is used for a criticism of NN regarding co-linearity of NN weights, but in the next section that claims PR is as effective as NN (if not better), there’s no mention of MNIST instead a 16 feature letter recognition data is used. Nice try, but if you are indeed serious about this research, I challenge you to beat the out of sample prediction error of 0.21% of MNIST using PR. https://en.wikipedia.org/wiki/MNIST_database#Classifiers

I tried to to use multivariate logistic regression on MNIST: 20% error rate, 100 times worse than NN.

6. For me this is the most important aspect of NN: Alpha go zero showed very strong evidence that when neural networks predict several things at once they are better than predicting each thing separately. In that case predicting the next move as well as the winner of the game is much stronger than predicting the next move and the winner separately. No regression, no boosting, no svm, no forests, NO OTHER algorithm has this property, which is a really BIG deal.

I look forward to the day when poly regression will beat alpha go zero. There's a famous Murphy law: every complicated problem has a simple solution. Which is wrong. 🙂

Your point that we have not yet explored large-p settings is correct (though now new, since we emphasize it in the paper, along with pointing out the need to solve some technical obstacles to going to large-p). Your claim that dimension reduction does not reduce p is obviously incorrect.

I really haven’t thought much about the case of vector-valued dependent variables, but I see no reason why PR shouldn’t do well there. Note that any classification problem is such a case, and a number of our examples were of that type.

I object to statements about NNs etc. along the lines of “the evidence shows…” Unless there is some theoretical justification for a claim, one is doing not much more than speculation. There are lots of examples in which boosting does NOT work out of the box, contrary to your statement. Our section on NNAEPR shows that theoretical justification for PR (an informal account, but it could be made mathematically rigorous).

What you say about penalty functions in PR is incorrect. The penalty only concerns prediction error, not errors in the coefficients. Frankly, such misinformation is common among ML people.

I’m pretty sure that PR would do fine on your contrived example. Why don’t you try it?

You’ve even got Murphy’s Law wrong; it’s “Anything that can go wrong will go wrong.”

There’s an entire collection of Murphy’s laws, not just the one you quote.

“I’m pretty sure that PR would do fine on your contrived example. Why don’t you try it?”

I tried it; I went up to fourth order both in the polynomial exponent and in the interactions. The model had 794 coefficients; I fit it against 1000 simulated data points. I used noise with sd = 0.03. The range of the residuals was +/- 1.

I also fit a linear model with X1 and X2 but with signum(X3) instead of X3 and with second-order interactions. The true data-generating process is in the space of models: Y = X1 + signum(X3)*X2 + err. The coefficients were correct up to noise and the range of residuals was +/- 0.1.

Thanks, I’ll take a look.

1. “I really haven’t thought much about the case of vector-valued dependent variables, but I see no reason why PR shouldn’t do well there”: If you use one-hot encoding for categorical variables (days of the week for example), then all polynomial regressors that contain the zero encoded variable will be zero, and higher order polynomials of 1 are idempotent, so it looks like your polynomial basis is not complete anymore.

2. “I object to statements about NNs etc. along the lines of “the evidence shows…” The reality exists even if there’s no rigorous theory for it, pretty much like gravity existed even before Newton formulated a theory for it. So, if alpha go zero beats the previous version 100-0 when you merge the two networks AND learn from scratch, I would say euphemistically that: “there is strong evidence that…” but in reality I mean: “let’s face it, it actually works”. As a matter of fact I see this as a statement of academic impotence: discarding experimental evidence because there’s no theoretical explanation for it.

If you read the paper, you will see that we recognize the idempotence issue in the software. In terms of the theory, actually it is even easier for this case. You are failing to consider the interaction terms.

The paper is being discussed at Hacker News https://news.ycombinator.com/item?id=17385383 .

Thanks for the link. Unfortunately, more people who comment without reading the paper.

Part of why NN are powerful is that they give valuable inductive biases. All of the “progress” in NN is not in deep densely connected networks, such as the ones you are looking at. What is so powerful is that non-trivial DAGs (such as convolutions, attention, highway/gating) bias certain functional forms during training. PR also bias certain forms, but it is unclear to me that you could write down a PR that replicates some complicated DAGs like WaveNet/etc. If we had an optimizer that didn’t need these biases the perhaps you could replace NNs with PRs, but by that token you can replace it with any universal approximator. Your results about collinearity for deeper layers also seems like an issue of overtraining/too many fitting parameters and are also common in computer vision. For resnet, for example, later layers tend to have weak and collinear activations (which is somewhat rectified by dense convnets). One technique that allows refinement of large networks is knowledge distillation (see “distilling the knowledge in a NN” Hinton, Vinyals, Dean), but I don’t see how it can be applied to PR. Your suggestion of PCA is generally not very good in practice since we most care about manifolds that are not linear projections.

Yes, we briefly address the idea of these more complex NNs in our paper. By the way, we’ll have a revision up in a couple of days, including an image classification example.

PCA is useful in most situations in which the relations between predictors and response is monotone; it need not be linear. What is more interesting is the nonmonotone cases; there, applying PCA to the second- or maybe third-degree polynomial terms should typically be enough, but again we won’t know until we try it out.

It seems that in quite a few of the regression test cases, PR 1 outperformed KF and DN. But isn’t PR1 simply linear regression? For example, table 4 (regression of engineer income), table 6 (million song year regression), table 8 (letter recognition), table 9 (taxi time regression).

It seems a little odd that chosen benchmark methods can’t match linear regression.

It reminds me of “extreme learning machines” where you take a (fixed) vector to vector random projection of the input data and pass that through non-linear activation functions. Then you train a linear readout layer to give the response wanted for the input. Basically then you get an associative memory.

Anyway an ordinary neural network layer can be viewed as an associative memory as well. And you see techniques like direct feedback alignment being used. Where you are creating an multilayer associative memory with error correction effects and avoidance of “crosstalk” between memories. Is this the same as a deep neural network trained by back-propagation? I would guess not.

You can have a look at my code for associative memory:

https://github.com/S6Regen/Associative-Memory-and-Self-Organizing-Maps-Experiments

Teuvo’s book is the only place I’ve seen repetition code error correction discussed in relation to associative memory:

https://archive.org/details/SelfOrganizationAndAssociativeMemory

I thought about your paper again when coming across the article below at kdnuggets, which is essentially an admission of this equivalence:

https://www.kdnuggets.com/2018/07/beginners-ask-how-many-hidden-layers-neurons-neural-networks.html

And the implied equivalence is more structurally precise. If you believe the logic of the article, a polynomial regression has to be equivalent to neural network, and the number of layers in the neural network is equal to the degree of the polynomial in the equivalent polynomial regression.

I don’t see how they are saying that.

Read it 1st to last page. Had been thinking this (in far hand wavier terms) myself for a while. Was thinking that feed forward nns might mostly be heuristic Taylor series in disguise. Would love to see the “could be made mathematically rigorous” part fleshed out so I can print it out, post it above my desk and point to it next time a colleague scoffs at my preference for GLM and simpler models for non specialized (eg cnn) problems. In any case as you say the c part of cnn is primarily dimensionality reduction with application beyond nn.

I am impressed at your dedication to responding to comments! I was wondering, will NNs essentially be equivalent in the long run to any ways of constructing functions such that you are doing ‘universal function approximation’? I am also curious as to what other ways besides neural networks there would be to maintain being able to universally approximate any function, or maybe if there are canonical resources for diving more into this specific topic out there?

Sorry for NOT being dedicated to replying! It is not a universal approx. theorem; our paper explains why not.

Hello, I’m wondering why you compared PR to NN with polynomial activation functions. I’m not too surprised that overall you get polynomial regression functions by using such activations… My intuition is that NN work as a series of linearities followed by non linearities. In fact, I remember having seen some (generalisation or expressivity, I don’t remember exactly) results specific to NN with non polynomial activation functions. In your experiments, what kind of activation functions did you use for NNs?

We will be releasing an updated and more details version of our arXiv paper soon, along with putting an updated version of polyreg on CRAN. Please bear with us until then.

I liked the paper. I think you could make the same argument about things like gradient boosting trees. Actually, what ReLu does is the same as a bifurcation in a decision tree. ReLu defines a region relative to the domain of the previous region in which downstream computations affect the solution. Hence, ReLu and gradient boosting etc all are doing exactly local polynomial fits, with order determined by depth of the network. I think it is the locality rather than the nonlinearity makes neural networks universal in a useful way.

Well, anything smooth is “local polynomial” in the approx. sense. I think you’ll find from a small example that it is not exact, though.

As to gradient boosting, we don’t cover it, as it doesn’t have the “mysterious black box” quality of NNs. Finding an asymptotic distribution should be doable, and I assume people have done so, but I don’t see a polynomial argument. With NNs, the polynomial comes naturally from the number of layers; there seems to be no obvious analog in GB.

Would you be willing to comment on the GMDH methods as compared to polynomial regression and the relevance of your work to the topic? GMDH sounds at least superficially related, but it seems that people who promote GMDH are somewhat walled off from the rest of the statistical modelling community, making it difficult for an outsider like me to judge if GMDH is to statistics as cold fusion is to physics, or if it’s a mundane thing made to sound profound, or if it’s something meriting serious further study.

Don’t know, hadn’t heard of it before you brought it up. Your “cold fusion” comment is spot on for the machine learning field 🙂 , in my opinion, and my first reaction is the GMDH may be the same, but I really need to lookinto this.

I have a comment about your heuristic argument in Chapter 6.1 “The Basic Argument”:

Taylor series for activation functions like tanh or the sigmoid have a finite radius of convergence (in fact: the radius is very small). So I cannot see why your argument with the Taylor series approximation holds true.

Could you please clarify.

For large samples, which are typical in NN usage, one will be within the radius of convergence. That’s how class stat asymptotic theory works. But your point probably does apply to non-convergene of the iteration process.

I like your paper.

The radius of convergence of tanh is pi/2 (for the sigmoid the radius is a bit bigger). With a reasonable low order polynomial a good approximation is only valid for the linear part of tanh, resp. the sigmoid. I get the impression that in this case your argument essentially justifies the approximation of a linear NN = linear regression with a polynomial regression. In the case of a highly non-linear NN (the activation is used where it is curved) it might need a more rigorous argument with lots of estimations and inequalities. Do you know of such a research?

The possible non-convergence of the iteration process is another issue.

I believe we understand each other and do not really disagree. There are two senses of convergence here, as we both have said, but they do interact, as I said.

If you look at standard asymptotic analysis, say for MLE, as the sample size n goes to infinity eventually theta-hat is close enough to theta so that a first-degree Taylor approximation is valid with high probability. This is true no matter what the radius of convergence of that Taylor series is, or for that matter even whether there are infinitely many derivatives. The same argument could be constructed for the poly NNs setting.

Thank you! Regards from Covid-free NZ.

I thank you back. I had previously argued with my coauthors that we don’t need a formal proof, but after this exchange with you, I now think otherwise.