Why Are We Still Teaching t-Tests?

My posting about the statistics profession losing ground to computer science drew many comments, not only here in Mad (Data) Scientist, but also in the co-posting at Revolution Analytics, and in Slashdot.  One of the themes in those comments was that Statistics Departments are out of touch and have failed to modernize their curricula.  Though I may disagree with the commenters’ definitions of “modern,” I have in fact long felt that there are indeed serious problems in statistics curricula.

I must clarify before continuing that I do NOT advocate that, to paraphrase Shakespeare, “First thing we do, we kill all the theoreticians.”   A precise mathematical understanding of the concepts is crucial to good applications.  But stat curricula are not realistic.

I’ll use Student t-tests to illustrate.  (This is material from my open-source book on probablity and statistics.)  The t-test is an exemplar for the curricular ills in three separate senses:

  • Significance testing has long been known to be under-informative at best, and highly misleading at worst.  Yet it is the core of almost any applied stat course.  Why are we still teaching — actually highlighting — a method that is recognized to be harmful?
  • We prescribe the use of the t-test in situations in which  the sampled population has an exact normal distribution — when we know full well that there is no such animal.  All real-life random variables are bounded (as opposed to the infinite-support normal distributions) and discrete (unlike the continuous normal family).  [Clarification, added 9/17:  I advocate skipping the t-distribution,  and going directly to inference based on the Central Limit Theorem.  Same for regression.  See my book.]
  • Going hand-in-hand with the t-test is the sample variance. The classic quantity s2 is an unbiased estimate of the population variance σ2, with s2 defined as 1/(n-1) times the sum of squares of our data relative to the sample mean.  The concept of unbiasedness does have a place, yes, but in this case there really is no point to dividing by n-1 rather than n.  Indeed, even if we do divide by n-1, it is easily shown that the quantity that we actually need, s rather than s2, is a BIASED (downward) estimate of σ.  So that n-1 factor is much ado about nothing.

Right from the beginning, then, in the very first course a student takes in statistics, the star of the show, the t-test, has three major problems.

Sadly, the R language largely caters to this old-fashioned, unwarranted thinking.  The var() and sd() functions use that 1/(n-1) factor, for example — a bit of a shock to unwary students who wish to find the variance of a random variable uniformly distributed on, say, 1,2,…,10.

Much more importantly, R’s statistical procedures are centered far too much on significance testing.  Take ks.test(), for instance; all one can do is a significance test, when it would be nice to be able to obtain a confidence band for the true cdf.  Or consider log-linear models:  The loglin() function is so centered on testing that the user must proactively request parameter estimates, never mind standard errors.  (One can get the latter by using glm() as a workaround, but one shouldn’t have to do this.)

I loved the suggestion by Frank Harrell in r-devel to at least remove the “star system” (asterisks of varying numbers for different p-values) from R output.  A Quixotic action on Frank’s part (so of course I chimed in, in support of his point); sadly, no way would such a change be made.  To be sure, R in fact is modern in many ways, but there are some problems nevertheless.

In my blog posting cited above, I was especially worried that the stat field is not attracting enough of the “best and brightest” students.  Well, any thoughtful student can see the folly of claiming the t-test to be “exact.”  And if a sharp student looks closely, he/she will notice the hypocrisy of using the 1/(n-1) factor in estimating variance for comparing two general means, but NOT doing so when comparing two proportions.  If unbiasedness is so vital, why not use 1/(n-1) in the proportions case, a skeptical student might ask?

Some years ago, an Israeli statistician, upon hearing me kvetch like this, said I would enjoy a book written by one of his countrymen, titled What’s Not What in Statistics.  Unfortunately, I’ve never been able to find it.  But a good cleanup along those lines of the way statistics is taught is long overdue.

Good for TI, Good for Schools, Bad for Kids, Bad for Stat

In my last post, I agreed with Prof. Xiao-Li Meng that Advanced Placement (AP) Statistics courses turn off many students to the statistics field, by being structured in a manner that makes for a boring class.  I cited as one of the problems the fact that the course officially requires TI calculators.  This is a sad waste of resources, as the machines are expensive while R is free, and R is capable of doing things that are much more engaging for kids.

Interestingly, this week the Washington Post ran an article on the monopoly that TI calculators have in the schools.  This was picked up by a Slashdot poster, who connected it to my blog post on AP Stat.  The Post article has some interesting implications.

As the article notes, it’s not just an issue of calculators vs. R.  It’s an issue of calculators in general vs. the TI calculator.  Whether by shrewd business strategy or just luck, TI has attained a structural monopoly.  The textbooks and standardized exams make use of TI calculators, which forces all the teachers to use that particular brand.

Further reinforcing that monopoly are the kickbacks, er, donations to the schools.  When my daughter was in junior high school and was told by the school to buy a TI calculator, I noticed at the store that Casio calculators were both cheaper and had more capabilities.  I asked the teacher about this, and she explained that TI makes donations to the schools.

All this shows why Ms. Chow, the Casio rep quoted in the article, is facing an uphill battle in trying to get schools to use her brand. But there is also something very troubling about Chow’s comment, “That is one thing we do struggle with, teachers worried about how long it is going to take them to learn [Casio products].”  Math teachers would have trouble learning to use a calculator?  MATH teachers?!  I am usually NOT one to bash the U.S. school system, but if many math teachers are this technically challenged, one must question whether they should be teaching math in the first place.  This also goes to the point in my last blog post that kids generally are not getting college-level instruction in the nominally college-level AP Stat courses.

Chow’s comment also relates to my speculation that, if there were a serious proposal to switch from TI to R, the biggest source of resistance would be the AP Stat teachers themselves.  Yet I contend that even they would find that it is easy to learn R to the level needed, meaning being able to do what they currently do on TIs—and to go further, such as analyzing large data sets that engage kids, producing nice color graphics.  This is not hard at all; the teachers don’t need to become programmers.

The Post article also brings up the issue of logistics.  How would teachers give in-class tests in an R-based AP Stat curriculum?  How would the national AP Stat exam handle this?

Those who dismiss using R for AP Stat on such logistical grounds may be shocked to know that the AP Computer Science exam is not conducted with a live programmable computer at hand either. It’s all on paper, with the form of the questions being designed so that a computer is not needed.  (See the sample test here.)  My point is that, if even a test that is specifically about programming can be given without a live computer present, certainly the AP Stat course doesn’t need one either.  For that matter, most questions on the AP Stat exam  concentrate on concepts, not computation, anyway, which is the way it should be.

The teachers should demand a stop to this calculator scam, and demand that the textbooks, AP Stat exam etc. be based on R (or some other free software) rather than on expensive calculators. The kids would benefit, and so would the field of statistics.

Statistics: Losing Ground to CS, Losing Image Among Students

The American Statistical Association (ASA)  leadership, and many in Statistics academia. have been undergoing a period of angst the last few years,  They worry that the field of Statistics is headed for a future of reduced national influence and importance, with the feeling that:

  • The field is to a large extent being usurped by other disciplines, notably Computer Science (CS).
  • Efforts to make the field attractive to students have largely been unsuccessful.

I had been aware of these issues for quite a while, and thus was pleasantly surprised last year to see then-ASA president Marie Davidson write a plaintive editorial titled, “Aren’t We Data Science?”

Good, the ASA is taking action, I thought.  But even then I was startled to learn during JSM 2014 (a conference tellingly titled “Statistics:  Global Impact, Past, Present and Future”) that the ASA leadership is so concerned about these problems that it has now retained a PR firm.

This is probably a wise move–most large institutions engage in extensive PR in one way or another–but it is a sad statement about how complacent the profession has become.  Indeed, it can be argued that the action is long overdue; as a friend of mine put it, “They [the statistical profession] lost the PR war because they never fought it.”

In this post, I’ll tell you the rest of the story, as I see it, viewing events as a statistician, computer scientist and R enthusiasist.

CS vs. Statistics

Let’s consider the CS issue first.  Recently a number of new terms have arisen, such as data science, Big Data, and analytics, and the popularity of the term machine learning has grown rapidly.  To many of us, though, this is just  “old wine in new bottles,” with the “wine” being Statistics.  But the new “bottles” are disciplines outside of Statistics–especially CS.

I have a foot in both the Statistics and CS camps.  I’ve spent most of my career in the Computer Science Dept. at the University of California, Davis, but I began my career in Statistics at that institution.  My mathematics doctoral thesis at UCLA was in probability theory, and my first years on the faculty at Davis focused on statistical methodology.  I was one of the seven charter members of the Department of Statistics.   Though my departmental affiliation later changed to CS, I never left Statistics as a field, and most of my research in Computer Science has been statistical in nature.  With such “dual loyalties,” I’ll refer to people in both professions via third-person pronouns, not first, and I will be critical of both groups.  (A friend who read a draft of this post joked it should be titled “J’accuse”  but of course this is not my intention.)   However, in keeping with the theme of the ASA’s recent actions, my essay will be Stat-centric:  What is poor Statistics to do?

Well then, how did CS come to annex the Stat field?  The primary cause, I believe, came from the CS subfield of Artificial Intelligence (AI).  Though there always had been some probabilistic analysis in AI, in recent years the interest has been almost exclusively in predictive analysis–a core area of Statistics.

That switch in AI was due largely to the emergence of Big Data.  No one really knows what the term means, but people “know it when they see it,” and they see it quite often these days.  Typical data sets range from large to huge to astronomical (sometimes literally the latter, as cosmology is one of the application fields), necessitating that one pay key attention to the computational aspects.  Hence the term data science, combining quantitative methods with speedy computation, and hence another reason for CS to become involved.

Involvement is one thing, but usurpation is another.  Though not a deliberate action by any means, CS is eclipsing Stat in many of Stat’s central areas.  This is dramatically demonstrated by statements that are made like,  “With machine learning methods, you don’t need statistics”–a punch in the gut for statisticians who realize that machine learning really IS statistics.  ML goes into great detail in certain aspects, e.g. text mining, but in essence it consists of parametric and nonparametric curve estimation methods from Statistics, such as logistic regression, LASSO, nearest-neighbor classification, random forests, the EM algorithm and so on.

Though the Stat leaders seem to regard all this as something of an existential threat to the well-being of their profession, I view it as much worse than that.  The problem is not that CS people are doing Statistics, but rather that they are doing it poorly:  Generally the quality of CS work in Stat is weak.  It is not a problem of quality of the researchers themselves; indeed, many of them are very highly talented.  Instead, there are a number of systemic reasons for this, structural problems with the CS research “business model”:

  • CS, having grown out of research on fast-changing software and hardware systems, became accustomed to the “24-hour news cycle”–very rapid publication rates, with the venue of choice being (refereed) frequent conferences rather than slow journals.  This leads to research work being less thoroughly conducted, and less thoroughly reviewed, resulting in poorer quality work.  The fact that some prestigious conferences have acceptance rates in the teens or even lower doesn’t negate these realities.
  • Because CS Depts. at research universities tend to be housed in Colleges of Engineering, there is heavy pressure to bring in lots of research funding, and produce lots of PhD students.  Large amounts of time is spent on trips to schmooze funding agencies and industrial sponsors,  writing grants, meeting conference deadlines and managing a small army of doctoral students–instead of time spent in careful, deep, long-term contemplation about the problems at hand.  This is made even worse by the rapid change in the fashionable research topic de jour, making it difficult to go into a topic in any real depth.  Offloading the actual research onto a large team of grad students can result in faculty not fully applying the talents they were hired for; I’ve seen too many cases in which the thesis adviser is not sufficiently aware of what his/her students are doing.
  • There is rampant “reinventing the wheel.”  The above-mentioned  lack of “adult supervision” and lack of long-term commitment to research topics results in weak knowledge of the literature.  This is especially true for knowledge of the Stat literature, which even the “adults” tend to have very little awareness of.  For instance, consider a paper on the use of mixed labeled and unlabeled training data in classification.  (I’ll omit names.)   One of the two authors is one of the most prominent names in the machine learning field, and the paper has been cited over 3,000 times, yet the paper cites nothing in the extensive Stat literature on this topic, consisting of a long stream of papers from 1981 to the present.
  • Again for historical reasons, CS research is largely empirical/experimental in nature.  This causes what in my view is one of the most serious problems plaguing CS research in Stat–lack of rigor.  Mind you, I am not saying that every paper should consist of theorems and proofs or be overly abstract; data- and/or simulation-based studies are fine.  But there is no substitute for precise thinking, and in my experience, many (nominally) successful CS researchers in Stat do not have a solid understanding of the fundamentals underlying the problems they work on.  For example, a recent paper in a top CS conference incorrectly stated that the logistic classification model cannot handle non-monotonic relations between the predictors and response variable; the paper really stressed this point, yet actually, one can add quadratic terms and so on to model this.
  • This “engineering-style” research model causes a cavalier attitude towards underlying models and assumptions.  Most empirical work in CS doesn’t have any models to worry about.  That’s entirely  appropriate, but in my observation it creates a mentality that inappropriately carries over when CS researchers do Stat work.  A few years ago, for instance, I attended a talk by a machine learning specialist who had just earned her PhD at one of the very top CS Departments  in the world.  She had taken a Bayesian approach to the problem she worked on, and I asked her why she had chosen that specific prior distribution.  She couldn’t answer–she had just blindly used what her thesis adviser had given her–and moreover, she was baffled as to why anyone would want to know why that prior was chosen.
  • Again due to the history of the field, CS people tend to have grand, starry-eyed ambitions–laudable, but a double-edged sword.   On the one hand, this is a  huge plus, leading to highly impressive feats such as recognizing faces in a crowd.  But this mentality leads to  an oversimplified view of things,  with everything being viewed as a paradigm shift.  Neural networks epitomize this problem.  Enticing phrasing such as “Neural networks work like the human brain” blinds many researchers to the fact that neural nets are not fundamentally different from other parametric and nonparametric methods for regression and classification.   (Recently I was pleased to discover–“learn,” if you must–that the famous book by Hastie, Tibshirani and Friedman complains about what they call “hype” over neural networks; sadly, theirs is a rare voice on this matter.)  Among CS folks, there is often a failure to understand that the celebrated accomplishments of “machine learning” have been mainly the result of applying a lot of money, a lot of people time, a lot of computational power and prodigious amounts of tweaking to the given problem–not because fundamentally new technology has been invented.

All this matters–a LOT.  In my opinion, the above factors result in highly lamentable opportunity costs.   Clearly, I’m not saying that people in CS should stay out of Stat research.  But the sad truth is that the usurpation process is causing precious resources–research funding, faculty slots, the best potential grad students, attention from government policymakers, even attention from the press–to go quite disproportionately to CS, even though Statistics is arguably better equipped to make use of them.   This is not a CS vs. Stat issue; Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.

Making Statistics Attractive to Students

This of course is an age-old problem in Stat.  Let’s face it–the very word statistics sounds hopelessly dull.  But I would argue that a more modern development is making the problem a lot worse–the Advanced Placement (AP) Statistics courses in high schools.

Professor Xiao-Li Meng has written extensively about the destructive nature of AP Stat.  He observed, “Among Harvard undergraduates I asked, the most frequent reason for not considering a statistical major was a ‘turn-off’ experience in an AP statistics course.”  That says it all, doesn’t it?  And though Meng’s views predictably sparked defensive replies in some quarters, I’ve had exactly the same experiences as Meng in my own interactions with students.  No wonder students would rather major in a field like CS and study machine learning–without realizing it is Statistics.  It is especially troubling that Statistics may be losing the “best and brightest” students.

One of the major problems is that AP Stat is usually taught by people who lack depth in the subject matter.  A typical example is that a student complained to me that even though he had attended a top-quality high school in the heart of Silicon Valley, his AP Stat teacher could not answer his question as to why it is customary to use n-1 rather than n in the denominator of s2 .  But even that lapse is really minor, compared to the lack among the AP teachers of the broad overview typically possessed by Stat professors teaching university courses, in terms of what can be done with Stat, what the philosophy is, what the concepts really mean and so on.  AP courses are ostensibly college level, but the students are not getting college-level instruction.  The “teach to the test” syndrome that pervades AP courses in general exacerbates this problem.

The most exasperating part of all this is that AP Stat officially relies on TI-83 pocket calculators as its computational vehicle.  The machines are expensive, and after all we are living in an age in which R is free!  Moreover, the calculators don’t have the capabilities of dazzling graphics and analyzing of nontrivial data sets that R provides–exactly the kinds of things that motivate young people.

So, unlike the “CS usurpation problem,” whose solution is unclear, here is something that actually  can be fixed reasonably simply.  If I had my druthers, I would simply ban AP Stat, and actually, I am one of those people who would do away with the entire AP program.   Obviously, there are too many deeply entrenched interests for this to happen, but one thing that can be done for AP Stat is to switch its computational vehicle to R.

As noted, R is free and is multi platform, with outstanding graphical capabilities.  There is no end to the number of data sets teenagers would find attractive for R use, say the Million Song Data Set.

As to a textbook, there are many introductions to Statistics that use R, such as Michael Crawley’s Statistics: an Introduction Using R, and Peter Dalgaard’s Introductory Statistics Using R.  But to really do it right, I would suggest that a group of Stat professors collaboratively write an open-source text, as has been done for instance for Chemistry.  Examples of interest to high schoolers should be used, say this engaging analysis on OK Cupid.

This is not a complete solution by any means.  There still is the issue of AP Stat being taught by people who lack depth in the field, and so on.  And even switching to R would meet with resistance from various interests, such as the College Board and especially the AP Stat teachers themselves.

But given all these weighty problems, it certainly would be nice to do something, right?  Switching to R would be doable–and should be done.

A Matrix Powers Package, and Some General Edifying Material on R

Here I will introduce matpow, a package to flexibly and conveniently compute matrix powers.  But even if you are not interested in matrices, I think many of you will find that this post contains much general material on R that you’ll find useful.  Indeed, most of this post will be about general R issues, not so much about matrices per se.  So, bear with me.

Why matrix powers?  

Sadly, most university beginning linear algebra courses say very little about why the material is important.  Worse, if some motivation is presented, it tends to be physics-oriented.  Indeed, the current trend in U.S. universities is to fold the introductory linear algebra curriculum into the differential equations course.

Most readers of this blog know better, as they are (to various extents) aware of the major roles that matrices play in statistics, e.g. in linear models and principal components analysis.  But did you know that matrices play a major role in the analysis of social networks?  We’re not in Physicsland anymore, Toto.

Matrix powers are useful here.  For instance, suppose we wish to determine whether a network (or graph) is connected, meaning that every node leads to every other node in one or more steps.  (The famous Six Degrees of Separation notion comes from this context.) It turns out that can be determined by taking powers of the adjacency matrix A of the network, whose (i,j) element is 1 if there is a one-step link from i to j, 0 otherwise.  (For technical reasons, we will set the diagonal of A to all 1s.)  If some power of A has all its elements nonzero, then the graph is connected; it’s disconnected if this situation is never reached.  (One need calculate only up to the power n-1, where n is the number of rows and columns of the matrix.)  Moreover, eigenanalysis of A yields other important information about the network, and one way to compute eigenvectors of a matrix involves finding its powers too.

But why have a package to compute the powers?

Isn’t this already straightforward in R?  For instance, the third power of A is computed via the code

m %*% m %*% m

and higher powers can be coded with a simple loop.  But we needed to be much more flexible and versatile than this in developing matpow.

1.   Generality:   Our matpow package accommodates various classes of matrices.

There are many of these in the R world.  In addition to the basic “matrix” class, there is also the “Matrix” class (from the Matrix package), the “big.matrix” class (from bigmemory) and so on.  Syntax can vary; with bigmemory, for instance, one must use brackets in assignment, e.g.

m1[,]

And most important, not all of these use %*% as their multiplication operator.  For example, in gputools one uses the function gpuMatMult() for this purpose.

2.  Accommodation of parallel computation:  Today’s matrices can be huge, so parallel computation would be nice.  Our matpow package does not itself include facilities for parallel computation, but it is designed to integrate well with external parallel methods, as mentioned above for gputools, an R package that interfaces to GPUs.

3.  Capability to include callback functions, to perform application-specific operations after each new power is calculated.

For instance, consider the social network example again.  As we compute higher and higher powers of A, we want to check for an “all non zeroes” condition after each iteration; if the condition is found to hold, there is no point in doing any more iterations.  We can have our callback function check for this.

And there is more.  The (i,j) element in the rth power of A can be shown to be the number of ways to go from node i to node j in r steps. If that element is now positive but was 0 in power r-1 of A, we now know that the shortest path from i to j consists of r steps.  So, we can have our callback function watch for changes from zero to nonzero, and record them, and thus have the overall internode distance matrix in the end.

A few quick examples:

Here is a quick tour, including of the squaring option.  There we square the input matrix, then square the result and so on, thus requiring only a logarithmic number of steps to reach our desired power.  This is much faster if we don’t need the intermediate powers.

> m <- rbind(1:2,3:4)
> ev <- matpow(m,16)  # find m to 16th power
> ev$prod1  # here it is
[,1] [,2]
[1,] 115007491351 1.67615e+11
[2,] 251422553235 3.66430e+11
> ev$i # last iteration was 15th
[1] 15
> ev <- matpow(m,16,squaring=TRUE)
> ev$prod1  # same result
[,1] [,2]
[1,] 115007491351 1.67615e+11
[2,] 251422553235 3.66430e+11
> ev$i  # but with only 4 iterations
[1] 4

# test network connectivity
> m <-    rbind(c(1,0,0,1),c(1,0,1,1),c(0,1,0,0),c(0,0,1,1))
> ev <- matpow(m,callback=cgraph,mindist=T)
> ev$connected
[1] TRUE
> ev$dists
[,1] [,2] [,3] [,4]
[1,] 1 3 2 1
[2,] 1 1 1 1
[3,] 2 1 1 2
[4,] 3 2 1 1

 

How can we deal with different matrix classes/multiplication syntaxes?

As noted, one problem we needed to deal with in developing matpow was how to accommodate diverse matrix classes and multiplication syntaxes.  We solved that problem by using R’s eval() and parse() functions.  Consider the following toy example:

> x <- 28
> s <- "x <- 16"
> eval(parse(text=s))
> x
[1] 16

Of course, here it is just a very roundabout way to set x to 16.   But for matpow, it’s just what we need.  We want our multiplication command in string form to be something like prod <- m %*% m in the “matrix” case,  prod[,] <- m[,] %*% m[,] in the “big.matrix” (bigmemory) case, and so on.

The (built-in or user-supplied) argument genmulcmd (“generate multiplication command”) to matpow() handles this.  For instance, here is the built-in one for gputools:

 genmulcmd.gputools <- function (a, b, c) 
   paste(c, " <- gpuMatMult(", a, ",", b, ")")

We then can plug the string into eval() and parse().  In this manner, we can switch matrix types by changing not even a full line of code, just an argument to matpow().

How can we implement callbacks?

Recall our example matpow() call:

matpow(m,callback=cgraph,mindist=T)

Here we are specifying that there is a callback function (the default is NULL), and it is to be cgraph().  This happens to be built-in for the package, but it could be user-written.

The key issue is how data will be communicated between matpow() and the callback–in BOTH directions, i.e. we need the callback to write information back to matpow().   This is not easy in R, which as a functional language tries to avoid side effects.

Consider for example this:

> x <- sample(1:20,8)
> x
[1] 5 12 9 3 2 6 16 18
> sort(x)
[1] 2 3 5 6 9 12 16 18
> x
[1] 5 12 9 3 2 6 16 18

The point is that x didn’t change, i.e. we had no side effect to the argument x.  If we do want x to change, we must write

> x <- sort(x)

But environments are different.  The matpow() function maintains an environment ev, which it passes to the callback as an argument. Though ev is like an R list, and its components are accessed via the familiar $ operator, the difference is that the callback can change those components (a side effect), thus communicate information back to matpow().

For instance, look at this code from cgraph():

if (all(prd > 0)) { 
   ev$connected <- TRUE 
   ev$stop <- TRUE 
}

We’ve discovered that the network is connected, so we set the proper two components of ev.    Back in matpow(), the code will sense that ev$stop is TRUE, and cease the iteration process.  Since matpow() uses ev as its return value, the user then can see that the network is connected.

Other issues:

Again, keep in mind that matpow() will use whatever form of matrix multiplication you give it.  If you give it a parallel form of multiplication, then matpow()‘s computation will be parallel.  Or if your R is configured with a multicore-capable BLAS, such as OpenBLAS, you again will be computing in parallel.

Though we have gotten good performance in this context with gputools, it should be noted that gpuMatMult() returns the product to the R caller.  This causes unnecessary copying, which on GPUs can sap speed.  Code that maintains the product on the GPU from iteration to iteration would be best.

Where to get matpow():

We plan submission to CRAN, but for now download it from here. Unzip the package, then run R CMD INSTALL on the directory that is created as a result of unzipping.

New freqparcoord Example

In my JSM talk this morning, I spoke about work done by Yingkang Xie and myself, on a novel approach to the parallel coordinates method of visualization.  I’ve made several posts to this blog in the past on freqparcoord, our implemention of our method.

My talk this morning used some recently-available NYC taxi data.  You may find the discoveries made on this data by freqparcoord of interest.  See my slides from the talk.

Code Snippet: Extracting a Subsample from a Large File

Last week a reader of the r-help mailing list posted a query titled “Importing random subsets of a data file.”  With a very large file, it is often much easier and faster–and really, just as good–to just work with a much smaller subset of the data.

Fellow readers then posted rather sophisticated solutions, such as storing the file in a database. Here I’ll show how to perform this task much more simply.  And if you haven’t been exposed to R’s text file reading functions before, it will be a chance for you to learn a bit.

I’m assuming here that we want to avoid storing the entire file in memory at once, which may be difficult or impossible.  In other words, functions like read.table() are out.

I’m also assuming that you don’t know exactly how many records are in the file, though you probably have a rough idea.  (If you do know this number, I’ll outline an alternative approach at the end of this post.)

Finally, due to lack of knowledge of the total number of records, I’m also assuming that extracting every kth record is sufficiently “random” for you.

So, here is the code (downloadable from here):

subsamfile <- function(infile,outfile,k,header=T) {
   ci <- file(infile,"r")
   co <- file(outfile,"w")
   if (header) {
      hdr <- readLines(ci,n=1)
      writeLines(hdr,co)
   }
   recnum = 0
   numout = 0
   while (TRUE) {
      inrec <- readLines(ci,n=1)
      if (length(inrec) == 0) { # end of file?
         close(co) 
         return(numout)
      }
   recnum <- recnum + 1
   if (recnum %% k == 0) {
      numout <- numout + 1
      writeLines(inrec,co)
   }
  }
}

Very straightforward code.  We use file() to open the input and output files, and read in the input file one line at a time, by specifying the argument n = 1 in the first call to file().  Each inputted record is a character string.  To sense the end-of-file condition on the input file, we test whether the input record has length 0.  (Any record, even an empty one, will have length 1, i.e. each record is read as a 1-element vector of mode character, again due to setting n = 1.)

On a Linux or Mac platform, we can determine the number of records in the file ahead of time by running wc -l infile (either directly or via R’s system()).  This may take a long time, but if we are willing to incur that time, then the above code could be changed to extract random records. We’d do something like cullrecs <- sample(1:ntotrecs,m,replace=FALSE) where m is the desired number of records to extract, and then whenever recnum matches the next element of cullrecs, we’d write that record to outfile.

Will you be at the JSM next week? My talk is on Tuesday, but I’ll be there throughout the meeting. If you’d like to exchange some thoughts on R or statistics, I’d enjoy chatting with you.

A Handy Trick for Remote Graphics

I often create plots that require quite a bit of computation.  Ideally I would run this on what I’ll call Machine A, which is a very fast machine, but I am often far away, on Machine B.  So, I’d like to run my computation on B but display it on A.

For the platforms I use (Linux, Mac), I can simply use X11 forwarding, by typing ssh -Y at B to log into A.  You may not have that capability, so I’ll give you a simple method here to accomplish the same thing without X11.  And as a bonus, if you’ve never used R sockets, you can learn something about them along the way.  (You will need the ability to run a server at B.  But just try the commands below to check whether you can do this.)  Here’s what to do:

1.  In an R session at A, run


> con <- socketConnection(port=11000,server=T,
blocking=T,open="a+b",timeout=600)

This starts the server, but the above command will hang until we run a corresponding command at B:


> con <- socketConnection(host=quoted_address_of_A,port=11000,
server=F,blocking=T,open="a+b",timeout=600)

At this point you will have your R prompts at both A and B, so you’re all set.

Now run your computation at A. For instance, I might use my freqparcoord package (see several of my earlier blog postings), which returns the graph in a ggplot2 object. I feed that object into R’s serialize() function, which encodes it and then sends it through my socket con:


> serialize(freqparcoord(tx50,10,c(6,9,11:14)),con)

Then at B, I receive it:


> unserialize(con)

That reads the message from con, decodes it, and prints the result. The latter action results in the screen display.

You can keep doing this as long as you like. Note that I did set a 10-minute timeout period, so if you will have long thoughts, frequent phone interruptions etc., you may wish to set a longer period. You can close the connection by typing


> close(con)

Follow

Get every new post delivered to your Inbox.

Join 77 other followers