Knowing Something vs. Knowing the Name of Something: Some Points about Causal Analysis

The famed physicist Richard Feynman once said, “I learned very early the difference between knowing the name of something and knowing something,” a lesson from his father. I think too often we in the statistics/machine learning field are guilty of “only knowing the name of something.” Well, in most cases, we may know a bit more than the name, but not as much as we should to be able to use the concept usefully.

As in most of my stat posts, I hope there is something here for everyone. Some of this will be new and thought provoking for those who are already familiar with causal analysis, but also useful as an introduction for those who don’t have such background.

The post is aimed at developing insight beyond “the name of something” in causal analysis (CA), a data science topic that is not new but has become much more prominent in recent years. As you will see, I am something of a skeptic on CA, and hope to dispel some common misunderstandings regarding it. Make no mistake–I do believe CA is a useful tool–but I hope here to demystify some of the ideas, and bring it down to Earth in terms of the extent of its value.

We will begin by dispelling a common myth about regression analysis (not solely about causal analysis, just ordinary stat, but strongly related). This is such a common myth that some readers will have trouble fully accepting it. How is that for a tantalizing lure?

A quick bit of notation:

Let m(t) = E(Y|X=t) denote the conditional mean of Y given X = t. This is the regression function of Y on X, a function of t in our notation here. X is a vector of predictors/features. If say X is height and Y is weight, then m(70) is the mean (population) weight among people of height 70. If X is height and age, m(70,28) is the mean weight among people of height 70 and age 28. Note that we are not necessarily assuming a linear model for m(t) here, though we often will.

Dispelling a very common myth:

It will be absolutely crucial here to have a firm understanding of what a regression model really does.

One often hears statements like, “We’d like to use a regression model to investigate the impact of family income on such-and-such, but our data is skewed toward the middle class or higher, so our estimated coefficients will be skewed as well.”

For instance, is a student’s performance on the SAT related to family income? We might try to investigate that by fitting a linear model in which we predict SAT by high school grades HGPA and family income FI. The coefficient of FI might give us an idea of how much, if any, income affects SAT scores.

But if in our dataset FI is skewed toward the higher incomes, many people would say such an analysis is invalid.

Actually, there is no validity problem. Let’s take a look at some simulated data.

   m <- matrix(rnorm(10000*2),ncol=2)
x <- m[,1]
y <- m[,1] + m[,2]
coef(lm(y ~ x))
# (Intercept) x
# 0.002190729 1.003438820
y <- y + 2
x <- x + 2
coef(lm(y ~ x))
# (Intercept) x
# -0.004686912 1.003438820

What happened here? We simulated a simple example in which m(t) = t, and of course the estimated coefficients were close to 1 and 0. Then we skewed the data to the right, and sure enough, got essentially the same answer.

How about this?

m <- matrix(rnorm(10000*2),ncol=2)
x <- m[,1]
y <- m[,1] + m[,2]
whichBig <- which(x > 1.5)
x <- x[whichBig]
y <- y[whichBig]
coef(lm(y ~ x))
# (Intercept) x
# 0.2525674 0.8474718
m <- matrix(rnorm(100000*2),ncol=2)
x <- m[,1]
y <- m[,1] + m[,2]
whichBig <- which(x > 1.5)
x <- x[whichBig]
y <- y[whichBig]
coef(lm(y ~ x))
# (Intercept) x
# -0.06571407 1.04364323

Here we said, let’s look at only part of our data, the part in which X > 1. Do the estimated coefficients still come out to be about 1 and 0? Turned out not to be, but because I knew they “should” be about 1 and 0, I knew it was a sample size issue. I thus tried a much larger sample, and sure enough, we did indeed get approximately 1 and 0.

To be sure, if we fit a linear model, or logistic or other parametric model, the above points will depend on the model being correct (or nearly so; no parametric model is exact in practice). To eliminate this issue, let’s look at nonparametric methods, which are always correct. Let’s try k-Nearest Neighbors (kNN).

library(qeML)
data(svcensus)
names(svcensus)
# [1] "age" "educ" "occ" "wageinc" "wkswrkd" "gender"
median(svcensus$age)
# [1] 38.21515
# a young person for our 't'
t0 <- data.frame(age=28,educ='16',occ='102',wkswrkd=52,
gender='male')
# estimated m(t), using k-Nearest Neighbors
knnOut <- qeKNN(svcensus,'wageinc',holdout=NULL)
predict(knnOut,t0)
# [,1]
# [1,] 74068
# skew the data toward older
svc1 <- svcensus
which25plus <- which(svc1$age > 25)
svc1 <- svc1[which25plus,]
knnOut1 <- qeKNN(svc1,'wageinc',holdout=NULL)
predict(knnOut1,t0)
# [,1]
# [1,] 74068

No change! The skewing had no effect.

What is going on here? Mathematically, if t is in A, then

E(Y | X = t, X in A) = E(Y | X = t)

since X = t is more restrictive than X in A.

Bottom line: The common thinking that “The distribution of X is skewed,” so this distorts our linear regression analysis,” is incorrect. True, as is always the case with linear models, one must check for linearity (see below), and pay attention to sample sizes (via a look at the size of the estimated coefficients’ standard errors), but there is nothing inherently distortionary about a skewed X distribution.

A specifically causal-analytic example of this idea:

Think of a university and its admissions policy, under which a student is admitted if either their SAT score or high school grade point average HGPA are above a certain threshold, s and h, respectively. Suppose we are interested in the individual effects of SAT and HGPA on university grade point average UGPA, and are considering using a linear model.

Causal analysis enthusiasts who are reading this would immediately recognize this situation as involving a collider. This is taken a danger signal, for a good reason in one particular sense to be described below, but on the other hand it does NOT invalidate our analysis of the individual effects of SAT and HGPA.

We are modeling E(UGPA | SAT=t1 and HGPA=t2), i.e. Y = UGPA and X = (X1,X2) = (SAT,HGPA). The structure of the admissions process is such that our data has been restricted to the set {X1 >= s or X2 >= h}. Due to the considerations discussed above, we see that this restriction does NOT change our estimate of m(t), i.e. our estimated regression coefficients in a linear model. Our ordinary linear model is still valid.

On the other hand, the situation is different if the object of our interest involves the distribution of X itself. For example, we may be interested in the correlation between X1 and X2 here. This should be substantially positive in the general population, but probably will be weaker in the subpopulation of those admitted to the university

Comparing classical and causal analyses:

Do coaching schools increase a student’s SAT test score? Much has been written on this issue pro and con, but here we will be interested less in the answer than the types of analysis that could lead to an answer.

Classical approach: regression analysis

Let’s again consider a linear model, for simplicity, using HGPA as our explanatory variable, in addition to C, an indicator variable for whether the student has had coaching. One concern is that only the more financially well-off families can afford coaching, so let’s use family income FI as another predictor. (We could of course consider some other factors as well, but let’s keep things simple.)

Our linear model then would be

E(S | HGPA, C, FI) = b0 + b1 HGPA + b2 C + b3 FI

where C is 1 or 0, depending on whether the student had coaching.

This would be the classical approach to assessing the effect of coaching. The quantity b2 represents the average extra points on the SAT, for students of fixed grades and family income. Comparing two students who are identical in grades and income, but one having the coaching and the other not, b2 is the expected difference in SAT. It would be the centerpiece of our analysis of the effects of coaching.

This is the traditional notion of ceteris paribus, “everything else being equal.” We might consider adding interaction terms (see below), but traditionally this would be the standard approach.

This model also allows one to investigate whether family income is a factor in SAT scores (directly, rather than though C). Studies on this have gone both ways, and my own analysis has suggested that the effect is small, but in any case, the above model could be used for such analysis.

What about interaction terms? Studies indicate that the higher-income kids are more likely to have been reading for pleasure since an early age. They thus have good reading comprehension and vocabulary, so the SAT coaching is not very useful to them. But the lower-income kids, perhaps reading less for pleasure, might benefit substantially from the coaching.

We would thus add an interaction term:

E(S | HGPA, C, FI) = b0 + b1 HGPA + b2 C + b3 FI + b4 C x FI

The above scenario would be reflected in an estimated value for b4 that is < 0 (assuming C > 0).

Note that now there is no single effect for the coaching, but rather one effect for each income level. A similar issue would arise if we forego a linear model, and turn to some nonparametric regression method, say kNN. This would involve estimating m(t) for various levels of t = (HGPA,C,FI) deemed of interest, and then comparing those estimates.

A causal approach: covariate matching

Note first, once again, that in the above linear model, it is NOT a problem if, say, our sample data skews toward higher incomes. As with any regression problem, we do have to consider sample size and assess linearity, but there is nothing inherently problematic about the skewed X distribution.

What saves us here? The answer is that we are taking covariates –HGPA and FI–into account. Raw comparison of SATs of the coached and noncoached groups could be very misleading in this setting, because the coached group might be richer and thus have, e.g., more experience reading for pleasure. It then would misleadingly look like the coaching led to higher SATs, rather than being due to differential backgrounds between the richer and poorer students.

Well, we can account for differences between covariates for the coached and noncoached groups in another way, by reconfiguring the data points in the two groups in such a way that the data is now balanced. Many ways have been devised for this, and a number of R packages have been developed along these lines.

For example, consider the function Matching::Match(y,treat,x). In our SAT example, y would be SAT, treat would be C, and x would be (HGPA,FI). The function forms pairs of students from the coached and noncoached groups, where in each pair the coached and noncoached students have the same or similar values of x. Those students who are selected (some may not be chosen) now give us two balanced groups, one coached and the other noncoached, who are similar in all respects other than coaching. For instance, the mean HGPA will be similar in the coached and noncoached groups, and so on. (After forming the groups, the pairing no longer is used.) We can now evaluate coaching fairly.

One drawback to covariate matching is that the larger the dimension of the X vector, the harder it is to do accurate covariate matching. An alternative is to match on just one summary quantity, the propensity score, which is the probability that the treatment will be “assigned.” In our SAT coaching example, it could be the probability that the student had coaching, given his/her family income. We can compute this using a logistic model, say, or a nonparametric technique such as kNN.

We then match on the propensity score alone; the third argument in Match would be that score, rather than the covariate vector X. The thinking is that by matching people with score 0.62, say, one of them who had coaching and one who didn’t, we’ve accounted for covariates.

A key point: the regression and matching approaches are fundamentally incomparable

Once again, the regression approach analyzes E(Y | X), a conditional mean, unaffected by the distribution of X. By contrast, covariate matching estimates E(Y), an unconditional mean, which is very strongly related to the distribution of X. If in the “family wealth” example above, the distribution of X is skewed toward the more well-off families, our analysis of the effect of coaching will be fair, but the magnitude of the effect will tell us how well coaching works among well-off families. Note that the same concern arises if the original distribution of X was not skewed but the data points that survive the selection process are skewed.

Directed Acyclic Graphs (DAGs):

Much of the surge in the use of causal analysis in recent years can be attributed to the attractiveness of DAGs; we all like pictures. Yet DAGs are a prime example of the importance of “knowing about something rather than just knowing the name of something.”

There are some extremely important points to keep in mind about DAGs. First, the word causal should not be taken literally; there is nothing in the framework about actual causation.

Second, DAGs cannot be generated by fitting from data; techniques for causal discovery require very strong assumptions. One can assess whether a DAG fits the data well, but not much more.

Third, DAGs are not unique. For any given set of probabilities, many different DAGs might fit those numbers equally well. Carnegie Mellon University statistics professor Cosma Shalizi has commented that it is highly misleading, indeed unethical, for a researcher to present a DAG without presenting other very different DAGs that are nevertheless mathematically equivalent.

Is causal analysis worth the extra trouble? Why not just use a linear regression model?

Linear models are very attractive here, easy to run and understand. One can add meaningful interaction terms, and if linearity is in question, we can try adding polynomial terms.

Another causal technique, structural equation models (SEMs), runs a series of regression models, in which not only is Y is expressed in the components of X, but also those components are expressed in terms of each other. Here again, we need not worry about skewing of distributions, providing we verify our linearity assumptions.

On the other hand, SEMs require that we have the DAG relating all the variables, and this cannot be determined from data; it essentially must be set from domain expertise in the field of study.

Why m(t)?

The above discussion regarding the irrelevance of whether X has a skewed distribution in some direction assumes that the quantity of interest is m(t) = E(Y | X=t), the regression function. Let’s take a closer look at this.

From a simple application of calculus, we know that the function f that minimizes the quantity E[(Y – f(X))2 | X = t] is m. So basically any prediction method that minimizes a sum of squares will take the form of a conditional mean. That includes linear models, of course, and also neural networks.

Note that the predicted value for a new case having X = t is the regression function value, m(t). Thus m is optimal in the sense of minimizing Mean Squared Prediction Error (at the population level).

Some predictive methods directly estimate m(t), such as k-NN and random forests.

What about predicting a dichotomous Y? Code Y to be 1 or 0. Then m(t) reduces to P(Y = 1 | X=t), and one can predict 1 or 0 depending on whether this quantity is greater or less than 0.5. (Or use another threshold for unequal error costs.) It can be shown that in this way, m(t) minimizes Overall Misclassification Error.

What about L1 models, say quantile regression? Here we minimize

E(|Y – f(X)| | X = t),

which has the solution f(t) = conditional median of Y given X = t. Then the same arguments goes through as before.

E[|Y – f(X)| | X = t, X in A] = E[|Y – f(X)| | X = t]

for the same reasons as before.

The same analysis works for any nonnegative loss function.

Assessing linearity:

One way to visually assess linearity (I frown on hypothesis testing in general, especially Goodness of Fit testing) is to fit linear and parametric models, then plot the latter against the former. Here’s an example:

library(qeML)
data(mlb1)
mlb <- mlb1[,-1] # height, weight, age
# predict weight from height, age, first using kNN then # a linear model
knnout <- qeKNN(mlb,'Weight',holdout=NULL)
lmout <- qeLin(mlb,'Weight',holdout=NULL)
lmpreds <- predict(lmout,mlb[,-2])
plot(knnout$regests,lmpreds)

Looks pretty good overall, only a handful of outliers.

Torch for R Now in the qeML Package

I’ve added a new function, qeNeuralTorch, to the qeML package, as an alternative to the package’s qeNeural. It is experimental as this point, but usable and I urge everyone to try it out. In this post, I will (a) state why I felt it desirable to add such a function, (b) show a couple of examples, (c) explain how the function works, thereby giving you an introduction to Torch, and finally (d) explain what at this point appears to be a fundamental problem with Torch.

If you need an introduction to neural networks, see the ML overview vignette in qeML.

Why an alternative to qeNeural?

The qeNeural function is a wrapper for regtools::krsFit, which is based on the R keras package. In turn, that calls functions in the tensorflow package, which calls the original Python version of the same name.

Though there has been much progress in interfacing R to Python, the interface is difficult to set up, especially in terms of the environment variables. That makes the “Torch for R” package, torch, highly attractive, as it contains no Python. Torch itself was developed (in Python) as a clearer alternative to Tensorflow, and is now probably more popular than the latter. So, it’s nice that a version for R has been produced, especially one that is Python-free.

Another advantage is easy access to GPUs, especially on Mac GPUs, which are generally problematic. There is a companion R package, luz, that can be used to speed up Torch computation.

Examples

Here we use the svcensus data from qeML.

lyrsReg <- list(
list('linear',0,100),
list('relu'),
list('linear',100,100),
list('relu'),
list('linear',100,1))

z <- qeNeuralTorch(svcensus,'wageinc',layers=lyrsReg,
learnRate=0.05)

The call format is standard qeML. We need to specify the network structure via its layers, hence an argument of that name. (The qeNeural function does this a little differently.) We see a linear, i.e. fully-connected layer of “0” inputs and 100 outputs, then a reLU activation function, then another hidden layer and activation, and finally another linear layer with 1 output, our predicted values. The “0” is filled in at runtime with the number of features.

The above is for regression problems; here is code for classification problems:

lyrsClass <- list( 
list('linear',0,100),
list('relu'),
list('linear',100,100),
list('relu'),
list('linear',100,1),
list('sigmoid'))

z <- qeNeuralTorch(svcensus,'gender',yesYVal='male',
layers=lyrsClass,learnRate=0.003)

That last entry in the layers formation squeezes the result to the interval (0,1), to make probabilities. Note that that list-of-lists code is just defining the network, not creating it.

How is Torch used in qeNeuralTorch?

One still must work with tensors, which are essentially multidimensional arrays. So there are lines in the function like

xT <- torch_tensor(x)

where the matrix of futures is converted to a tensor. The major work, though, is done in first setting up the network, and then running the data through it in an iterative manner. Note that one does not need to be aware of any of this to use qeNeuralTorch, but as usual, understanding the innards of a function sharpens one’s ability to use it well.

The network formation code within qeNeuralTorch works on the above list of lists to form nnSeqArgs. which again simply defines the network for Torch. Then the network is created:

model <- do.call(nn_sequential,nnSeqArgs)

A side note is that torch::nn_sequential has the formal argument ‘…’. That is no problem for the ordinary author of Torch code; if they have, say, 4 arguments in their particular application, he/she just states them as arguments in a call to nn_sequential. But qeNeuralTorch must allow for a variable number of network layers, hence the use of R’s do.call.

Here is the code that runs the network:

   for (i in 1:nEpochs) {
preds <- model(xT)
loss <- nnf_mse_loss(preds,yT,reduction = "sum")
optimizer$zero_grad()
loss$backward()
optimizer$step()
}

First, our feature data xT is run through the network, which has been initialized with random weights. That produces predictions, preds. The current L2 loss is then computed, then the gradient of the loss determined and the weights updated. The goes through the number of iterations specified by the user, nEpochs.

Our implementation is rather primitive; we use that same loss L2 function even for the classification case (actually this can be justified), and, for now, limited to the 2-class case).

Torch for R uses R6 class structure, rather different from the more common S3 and S4. An example above is the line

loss$backward()

Here loss is an object, mainly containing the current loss value but also containing a function backward. The latter is called on the former, as those who’ve used, e.g., Python or Java will recognize.

Again, you need not know this in order to use qeNeuralTorch.

Performance

The package seems to be very sensitive to the learning rate.

Also, it turns out, at least in my implementation, that Torch’s accuracy is generally much weaker than those of other qeML functions in regression cases, but similar in classification cases.

I surmised that this was due to Torch producing overly-large weights, and investigated by comparing the L2 norms of its prediction with those of other functions. Sure enough, Torch was producing larger predictions.

Torch has a parameter to control this via regularization, weighted_decay. However, this did not appear to help.

My current guess–maybe some who reads this will have suggestions–is that since ML applications tend more toward the classification case, the problem of large weights never really arose. Having predictions that are too extreme may not hurt in classification problems, as this simply brings them closer to 0 or 1 when run through a sigmoid function or similar. Since qeNeuralTorch rounds the sigmoid output to 0 or 1 to produce class predictions, it all may work out well in the end.

Note that this also means one should be cautious if one takes the unrounded output of the network to be the actual probabilities.

Quantile Regression with Random Forests

In my December 22 blog, I first introduced the classic parametric quantile regression (QR) concept. I then showed how one could use the qeML package to perform quantile regression nonparametrically, using the package’s qeKNN function for a k-Nearest Neighbors approach. A reader then asked if this could be applied to random forests (RFs). The answer is yes, and this will be the topic of the current post.

My goals in this post, as in the previous one, are to introduce the capabilities of qeML and to point out some general ML issues. The key example of the latter here is the fact that leaves in an RF tree are very similar to neighborhoods in k-NN, which implies that in principle one should be able to do QR in an RFs context, just as we did last time with k-NN.

However, as the saying goes, “Easier said than done.” What was key in the kNN case last time was that the qeKNN function argument smoothingFtn gives the user access to the neighborhoods, in that it allows the user to specify a function that performs a user-requested operation in each neighborhood; smoothingFtn offers a local-linear option, for instance, and in the last post I showed how one could achieve QR via a user-written function.

The situation for RFs is not so simple. The problem is that typical RF software does not provide “hooks” directly analogous to smoothingFtn. Some implementations do provide some useful hooks that could play a role, such as randomForests::getTree, but putting them together for the desired result may not be easy, especially given ambiguities in the documentation.

Fortunately, the grf package includes a QR app. The qeML function qeRFgrf originally wrapped the “ordinary” and local linear options in grf, and I’ve now added QR in v.1.2.

The name ‘grf’ stands for “Generalized Random Forests,” with the main generalizing being similar to smoothingFtn, i.e. to allow functions other than the mean to be applied to the data in the leaves. A second generalization aspect is to tailor the node-splitting process to the type of smoothing done in the leaves.

In particular, grf includes the function quantile_forest, providing just what our reader inquired about. One specifies the quantiles of interest in an argument quantiles, and later calls the paired predict function to obtain the estimated quantiles of “Y” at requested values of the “X” variables.

The qeML package has an interface to grf, as the function qeRFgrf. To access the QR option (qeML v.1.2), set the qeRFgrf argument quantls to a nonnull value. Here is an example using the North American major league baseball players data (included in qeML with the permission of the UCLA Stat Dept.). We find the 20th, 40th, 60th and 80th percentiles of weight, for each height.

library(qeML)
data(mlb1)

z <-qeRFgrf(mlb1[,2:3],'Weight',quantls=c(0.2,0.4,0.6,0.8),holdout=NULL)
w <- predict(z,mlb1[,2,drop=F])
df1 <- data.frame(x=mlb1[,2,drop=F],y=w[,1],z='0.20')
df2 <- data.frame(x=mlb1[,2,drop=F],y=w[,2],z='0.40')
df3 <- data.frame(x=mlb1[,2,drop=F],y=w[,3],z='0.60')
df4 <- data.frame(x=mlb1[,2,drop=F],y=w[,4],z='0.80')
dfall <- rbind(df1,df2,df3,df4)
qeML:::qePlotCurves(dfall,xlab='ht',ylab='wt')

The convenience function qePlotCurves is essentially the code I used in the previous post, now added to v.1.2.

I highly recommend the grf package. My attention was immediately drawn to it when it first came out, as I was pleased to see that I could now do analysis in RFs using non-mean smoothing, as I had been doing with qeKNN. It was written by some top researchers, who also developed the supporting theory.

qeML Example: Nonparametric Quantile Regression

In this post, I will first introduce the concept of quantile regression (QR), a powerful technique that is rarely taught in stat courses. I’ll give an example from the quantreg package, and then will show how qeML can be used to do model-free QR estimation. Along the way, I will also illustrate the use of closures in R.

Notation: We are predicting a scalar Y (including the case of dummy/one-hot variables) from a feature vector X.

In its simplest form, QR estimates the conditional median of Y given X, as opposed to the usual conditional mean, using a linear model. As we all know, the median is less affected by outliers than is the mean, so QR is giving us outlier robustness. As a bonus, we dispense with the homoskedasticity assumption, i.e. constant Var(Y|X).

But it’s more than that. We can model any conditional quantile, e.g. estimate the 80th percentile weight for each human height. Quantile analysis has a variety of applications.

One can conduct QR in R with the quantreg package, written by Prof. Roger Koenker, one of the major names in the QR field. Here is an example, using the qeML dataset mlb:

> data(mlb)
> library(quantreg)
> z <- rq(Weight ~ Height,data=mlb,tau=0.80)
> summary(z)

Call: rq(formula = Weight ~ Height, tau = 0.8, data = mlb)

tau: [1] 0.8

Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -201.66667 17.58797 -11.46617 0.00000
Height 5.66667 0.23856 23.75376 0.00000

As you can see, the call form here is like that of the R linear model function lm, and we could have had multiple predictors, e.g. age in addition to height.

But what if we don’t believe a linear model is appropriate? Of course, as usual we may consider adding polynomial terms, and there is also a package quantreg.nonpar. But we obtain model-free estimates easily using qeKNN in qeML.

Standard k-Nearest Neighbors estimation is simple. Say to predict the weight of someone 70 inches tall and 28 years ago, we find the k closest data points in our training data to the vector (70,28). We then compute the mean weight among those k people, and it’s then our predicted weight for the new person who has known height and age but unknown weight.

But qeKNN offers the user more flexibility, via an argument smoothingFtn. Instead of computing mean Y among the neighbors, we can specify the median, or even specify that a small linear model be fit to the neighboring data. The latter may be useful if the new person to be predicted is either very short or very tall, as things tend to be biased near the edges of a dataset. If the new person is 77 inches tall, most or all people in our neighboring data will be shorter than this, thus lighter, so our prediction based on the mean will be biased downward.

But we can also specify our own smoothingFtn, perfect for QR. We simply define a function that gives us the desired Y quantile among the neighbors.

The call form is

smoothingFtn(nearIdxs,x,y,predpt)

Here x and y are our X and Y training data, predpt is the new X value at which we wish to predict (redundant in most cases), and nearIdxs are the indices in x and y of the nearest neighbors to predpt. Note that at the time kNN calls smoothingFtn, the indices have already been computed.

Our code is then

sftn <- function(nearIdxs,x,y,predpt)
{
nearYs <- y[nearIdxs]
quantile(nearYs,0.80)
}

u <- mlb[c('Height','Age','Weight')]
set.seed(9999) # qeML ftns do random holdout
z <- qeKNN(u,'Weight',smoothingFtn=sftn)
predict(z,c(70,28)) # prints 200

It would be nice, though, to run this for a general quantile level q, rather than the special case 0.80. But we can’t do that directly, because the smoothingFtn argument to qeKNN must be a function object, no provision there for an argument to smoothingFtn. But we can accomplish what we want via R closures.

makeSmFtn <- function(q) function(newIdxs,x,y,predpt) quantile(y[newIdxs],q)

To understand this, one must first know more about the R reserved word function. Consider this simple example:

f <- function(x) x^2

Here we are saying, “R, please create a function for me. Its formal argument will be named x, and it will compute and return the square of that quantity. After you create that function–an object, just like other R entities–assign it to f.” In other words, function creates functions. As I like to tell my students,

The function of the function named function is to create functions!

Now, going back to makeQFtn above, it creates a function object (the call to quantile), and returns that object, just as with f above, but the key point is that here the value of q will be “baked in” to that object.

So our call to qeKNN for general q would be

z <- qeKNN(u,’Weight’,smoothingFtn=makeSmFtn(q))

A Comparison of Several qeML Predictive Methods

Is machine learning overrated, with traditional methods being underrated these days? Yes, ML has had some celebrated successes, but these have come after huge amounts of effort, and it’s possible that similar effort with traditional methods may have produced similar results.

A related issue concerns the type of data. Hard core MLers tend to divide applications into tabular and nontabular data. The former consists of the classical observations in rows, variables in columns format, while the latter means image processing, NLP and the like. The MLers’ prescription: use XGBoost for tabular data, deep learning (in some form) for the nontabular apps.

In this post, I’ll compare the performance of four predictive methods on a year 2000 census dataset svcensus, included with the qeML package. The comparison is just for illustration, not comprehensive by any means. For each method, I vary only one hyperparameter: k for qeKNN; minimum leaf size for qeRFranger; polynomial degree for qePolyLin; and max tree depth for qeXGBoost.

I used only the first 5000 rows of the dataset. There were 5 predictors, becoming 10 after categoricals were converted (internally) to dummies. The variable being predicted is wage income. For each hyperparameter value, 100 runs were done. (Recall: by default, qeML predictive functions form a random holdout set.)

15 hyperparameter values were tried for each method. In the case of qeKNN and qeRFranger, these were seq(5,75,5). For the other two methods, the values were 1:15.

Here are the results:

The winner here turned out to be good ol’ polynomial regression. Obviously overfitting occurs, but somewhat surprisingly, only after degree 6 or so. Random forests seems to have leveled off at 60 or so. All might do better by tweaking other default hyperparameters, especially KNN.

Of course, all the methods here could be considered traditional statistical methods, as all had significant statistician contribution. But only polynomial regression is truly traditional, and it’s interesting to see that it prevailed in this case.

From now on, I plan to make code for my blog posts available in my qeML repo, https://github.com/matloff/qeML in the inst/blogs subdirectory. So, readers can conveniently try their own experiments.

The “Secret Sauce” Used in Many qeML Functions

In writing an R package, it is often useful to build up some function call in string form, then “execute” the string. To give a really simple example:

> s <- '1+1'
> eval(parse(text=s))
[1] 2

Quite a lot of trouble to go to just to find that 1+1 = 2? Yes, but this trick can be extremely useful, as we’ll see here.

data(svcensus)
z <- qePCA(svcensus,'wageinc','qeKNN',pcaProp=0.5)

This says, “Apply Principal Component Analysis to the ‘svcensus’ data, with enough PCs to get 0.5 of the total variance. Then do k-Nearest Neighbor Analysis, fitting qeKNN to the PCs to predict wage income.”

So we are invoking a qeML function that the user requested, on the user’s requested data. Fine, but the requested qeML function is called using its default values, in this case k = 25, the number of neighbors. What if we want a different value of k, say 50? We can run

z <- qePCA(svcensus,'wageinc','qeKNN',pcaProp=0.5),
   opts=list(k=50))

But how does the internal code of qePCA handle this? Here is where eval comes in. Typing qePCA at the R > prompt shows us the code, two key lines of which are

cmd <- buildQEcall(qeName,"newData",yName, 
   opts=opts,holdout=holdout)
qeOut <- eval(parse(text = cmd))

So qeML:::buildQEcall builds up the full call to the user-requested function, including optional arguments, in a character string. Then we use eval and parse to execute the string. Inserting a call in qePCA to browser (not shown), we can take a look at that string:

Browse[1]> cmd
[1] "qeKNN(data = newData,yName=\"wageinc\",holdout = 1000,k=50))"

So qeML:::buildQEcall pieced together the call to the user’s requested function, qeKNN, on the user’s requested data–and with the user’s requested optional arguments. In handling the latter, it among things called names(opts), which got the argument names, ‘k’ here, in string form, exactly what we need.

Note too R’s call function, which similarly creates a function call. (Yes, call produces a call, just like function produces a function.) E.g.,

> sqrt(2)
[1] 1.414214
> sqcall <- call('sqrt',2)
> class(sqcall)
[1] "call"
> eval(sqcall)
[1] 1.414214

All of this is an example of R’s metaprogramming capabilities–code that produces code–a really cool feature of R that computer science people tend to be ignorant of. Next time you encounter a CS Pythonista who dismisses R as “not a real language,” reply “R is built on functional programming and OOP models, with powerful metaprogramming facilities…”

qeML Example: Issues of Overfitting, Dimension Reduction Etc.

What about variable selection? Which predictor variables/features should we use? No matter what anyone tells you, this is an unsolved problem. But there are lots of useful methods. See the qeML vignettes on feature selection and overfitting for detailed background on the issues involved.

We note at the outset what our concluding statement will be: Even a very simple, very clean-looking dataset like this one may be much more nuanced than it looks. Real life is not like those simplistic textbooks, eh?

Here I’ll discuss qeML::qeLeaveOut1Var. (I usually omit parentheses in referring to function names; see https://tinyurl.com/4hwr2vf.) The idea is simple: For each variable, find prediction accuracy with and without that variable.

Let’s try it on the famous NYC taxi trip data, included (with modification) in qeML. First, note that qeML prediction calls automatically split the data into training and test sets, and compute test accuracy (mean absolute prediction error or overall misclassification error) on the latter.

The call qeLeaveOut1Var(nyctaxi,’tripTime’,’qeLin’,10) predicts trip time using qeML‘s linear model. (The latter wraps lm, but adds some things and sets the standard qeML call form..) Since the test set is random (as is our data), we’ll do 10 repetitions and average the results. Instead of qeLin, we could have used any other qeML prediction function, e.g. qeKNN for k-Nearest Neighbors.

> qeLeaveOut1Var(nyctaxi,'tripTime','qeLin',10)
         full trip_distance  PULocationID  DOLocationID     DayOfWeek
     238.4611      353.2409      253.2761      246.3186      239.2277
There were 50 or more warnings (use warnings() to see the first 50)

We’ll discuss the warnings shortly, but not surprisingly, trip distance is the most important variable. The pickup and dropoff locations also seem to have predictive value, though day of the week may not.

But let’s take a closer look. There were 224 pickup locations. (run levels(nyctaxi$PULocationID) to see this). That’s 223 dummy (“one-hot”) variables; are some more predictive than others? To explore that in qeLeaveOut1Var, we could make the dummies explicit, so each dummy is removed one at a time:

nyct <- factorsToDummies(nyctaxi,omitLast=TRUE)

This function is actually from the regtools package, included in qeML. Then we could try, say,

nyct <- as.data.frame(nyct)
qeLeaveOut1Var(nyct,'tripTime','qeLin',10)

But with so many dummies, this would take a long time to run. We could directly look at mean trip times for each pickup location to get at least some idea of their individual predictive power,

tapply(nyctaxi$tripTime,nyctaxi$PULocationID,mean)
tapply(nyctaxi$tripTime,nyctaxi$PULocationID,length)

Many locations have very little data, so we’d have to deal with that. Note too the possibility of overfitting.

> dim(nyct)
[1] 10000  479

An old rule of thumb is to use under sqrt(n) variables, 100 here. Just a guide, but much less than 479. (Note: Even our analysis using the original factors still converts to dummies internally; nyctaxi has 4 columns, but lm will expand them as in nyct.)

We may wish to delete pickup location entirely. Or, possibly use PCA for dimension reduction,

z <- qePCA(nyctaxi,'tripTime','qeLin',pcaProp=0.75)

This qeML call says, “Compute PCA on the predictors, retaining enough of them for 0.75 of the total variance, and then run qeLin on the resulting PCs.”

But…remember those warning messages? Running warnings() we see messages like “6 rows removed from test set, due to new factor levels.” The problem is that, in dividing the data into training and test sets, some pickup or dropoff locations appeared only in the latter, thus impossible to predict. So, many of the columns in the training set are all 0s, thus 0 variance, thus problems with PCA. We then might run qeML::constCols to find out which columns have 0 variance, then delete those, and try qePCA again.

And we haven’t even mentioned using, say, qeLASSO or qeXGBoost instead of qeLin, etc. But the point is clear: Even a very simple, very clean-looking application like this one may be much more nuanced than it looks.

New Package, New Book!

Sorry I haven’t been very active on this blog lately, but now that I have more time, that will change. I’ve got myriad things to say.

To begin with, then, I’ll announce a major new R package, and my new book.

qeML package (“quick and easy machine learning”)

Featured aspects:

  • Now on CRAN, https://cran.r-project.org/package=qeML.
  • See GitHub README for intro, https://github.com/matloff/qeML.
  • Extremely simple, “one liner” user interface.
  • Ideal for teaching: very simple interface, lots of included datasets, included ML tutorials.
  • Lots for the advanced MLer too: special ML methods, advanced graphics, etc.
  • Lots of related “nuts and bolts” infrastructure functions, e.g. involving R factor conversion.

The Art of Machine Learning, N. Matloff, NSP, 2023

  • For those without prior ML background. (Experienced MLers may learn a thing or two as well.)
  • Anti-“cookbook”: Aims to empower readers to actually use ML in real, non-“sanitized” applications.
  • Really gets into the How? and Why? (and for that matter, the Why Not?).
  • Uses the qeML package.
  • See my writeup on the philosophy of the book, https://github.com/matloff/ArtOfML.

New Statistics Tutorial

I’ve recently completed fastStat, https://github.com/matloff/fastStat,a quick introduction to statistics for those who’ve had a calculus-based probability course. Many such people later need to do statistics, and this will give them quick access. It is modeled after my R tutorial, https://github.com/matloff/fasteR, a quick introduction to R.

It is not just a quick introduction, but a REAL one, a practical one. Even those who do already know statistics will find that they learn from this tutorial.

I write at the top of the tutorial,

..many people know the mechanics of statistics very well, without truly understanding an intuitive levels what those equations are really doing, and this is our focus…For example, consider estimator bias. Students in a math stat course learn the mathematical definition of bias, after which they learn that the sample mean is unbiased and that the sample variance can be adjusted to be unbiased. But that is the last they hear of the issue…..most estimators are in fact biased, and lack ‘fixes’ like that of the sample variance. Does it matter? None of that is discussed in textbooks and courses.

The tutorial begins with sampling, with a realistic view of parametric models, estimation, standard errors, statistical inference, the Bias-Variance Tradeoff, and multivariate distributions.

It then moves to a major section on prediction, using both classical parametric and machine learning methods. Emphasis is again on the Bias-Variance Tradeoff, with a view toward overfitting. A fresh view of the latter is presented.

Finally, there is an overview of data privacy methods, of major importance today.

Take a look! Comments welcome.

Musings, useful code etc. on R and data science