qeML Example: Issues of Overfitting, Dimension Reduction Etc.

What about variable selection? Which predictor variables/features should we use? No matter what anyone tells you, this is an unsolved problem. But there are lots of useful methods. See the qeML vignettes on feature selection and overfitting for detailed background on the issues involved.

We note at the outset what our concluding statement will be: Even a very simple, very clean-looking dataset like this one may be much more nuanced than it looks. Real life is not like those simplistic textbooks, eh?

Here I’ll discuss qeML::qeLeaveOut1Var. (I usually omit parentheses in referring to function names; see https://tinyurl.com/4hwr2vf.) The idea is simple: For each variable, find prediction accuracy with and without that variable.

Let’s try it on the famous NYC taxi trip data, included (with modification) in qeML. First, note that qeML prediction calls automatically split the data into training and test sets, and compute test accuracy (mean absolute prediction error or overall misclassification error) on the latter.

The call qeLeaveOut1Var(nyctaxi,’tripTime’,’qeLin’,10) predicts trip time using qeML‘s linear model. (The latter wraps lm, but adds some things and sets the standard qeML call form..) Since the test set is random (as is our data), we’ll do 10 repetitions and average the results. Instead of qeLin, we could have used any other qeML prediction function, e.g. qeKNN for k-Nearest Neighbors.

> qeLeaveOut1Var(nyctaxi,'tripTime','qeLin',10)
         full trip_distance  PULocationID  DOLocationID     DayOfWeek
     238.4611      353.2409      253.2761      246.3186      239.2277
There were 50 or more warnings (use warnings() to see the first 50)

We’ll discuss the warnings shortly, but not surprisingly, trip distance is the most important variable. The pickup and dropoff locations also seem to have predictive value, though day of the week may not.

But let’s take a closer look. There were 224 pickup locations. (run levels(nyctaxi$PULocationID) to see this). That’s 223 dummy (“one-hot”) variables; are some more predictive than others? To explore that in qeLeaveOut1Var, we could make the dummies explicit, so each dummy is removed one at a time:

nyct <- factorsToDummies(nyctaxi,omitLast=TRUE)

This function is actually from the regtools package, included in qeML. Then we could try, say,

nyct <- as.data.frame(nyct)
qeLeaveOut1Var(nyct,'tripTime','qeLin',10)

But with so many dummies, this would take a long time to run. We could directly look at mean trip times for each pickup location to get at least some idea of their individual predictive power,

tapply(nyctaxi$tripTime,nyctaxi$PULocationID,mean)
tapply(nyctaxi$tripTime,nyctaxi$PULocationID,length)

Many locations have very little data, so we’d have to deal with that. Note too the possibility of overfitting.

> dim(nyct)
[1] 10000  479

An old rule of thumb is to use under sqrt(n) variables, 100 here. Just a guide, but much less than 479. (Note: Even our analysis using the original factors still converts to dummies internally; nyctaxi has 4 columns, but lm will expand them as in nyct.)

We may wish to delete pickup location entirely. Or, possibly use PCA for dimension reduction,

z <- qePCA(nyctaxi,'tripTime','qeLin',pcaProp=0.75)

This qeML call says, “Compute PCA on the predictors, retaining enough of them for 0.75 of the total variance, and then run qeLin on the resulting PCs.”

But…remember those warning messages? Running warnings() we see messages like “6 rows removed from test set, due to new factor levels.” The problem is that, in dividing the data into training and test sets, some pickup or dropoff locations appeared only in the latter, thus impossible to predict. So, many of the columns in the training set are all 0s, thus 0 variance, thus problems with PCA. We then might run qeML::constCols to find out which columns have 0 variance, then delete those, and try qePCA again.

And we haven’t even mentioned using, say, qeLASSO or qeXGBoost instead of qeLin, etc. But the point is clear: Even a very simple, very clean-looking application like this one may be much more nuanced than it looks.

One thought on “qeML Example: Issues of Overfitting, Dimension Reduction Etc.”

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.