Code Snippet: Extracting a Subsample from a Large File

Last week a reader of the r-help mailing list posted a query titled “Importing random subsets of a data file.”  With a very large file, it is often much easier and faster–and really, just as good–to just work with a much smaller subset of the data.

Fellow readers then posted rather sophisticated solutions, such as storing the file in a database. Here I’ll show how to perform this task much more simply.  And if you haven’t been exposed to R’s text file reading functions before, it will be a chance for you to learn a bit.

I’m assuming here that we want to avoid storing the entire file in memory at once, which may be difficult or impossible.  In other words, functions like read.table() are out.

I’m also assuming that you don’t know exactly how many records are in the file, though you probably have a rough idea.  (If you do know this number, I’ll outline an alternative approach at the end of this post.)

Finally, due to lack of knowledge of the total number of records, I’m also assuming that extracting every kth record is sufficiently “random” for you.

So, here is the code (downloadable from here):

subsamfile <- function(infile,outfile,k,header=T) {
   ci <- file(infile,"r")
   co <- file(outfile,"w")
   if (header) {
      hdr <- readLines(ci,n=1)
      writeLines(hdr,co)
   }
   recnum = 0
   numout = 0
   while (TRUE) {
      inrec <- readLines(ci,n=1)
      if (length(inrec) == 0) { # end of file?
         close(co) 
         return(numout)
      }
   recnum <- recnum + 1
   if (recnum %% k == 0) {
      numout <- numout + 1
      writeLines(inrec,co)
   }
  }
}

Very straightforward code.  We use file() to open the input and output files, and read in the input file one line at a time, by specifying the argument n = 1 in the first call to file().  Each inputted record is a character string.  To sense the end-of-file condition on the input file, we test whether the input record has length 0.  (Any record, even an empty one, will have length 1, i.e. each record is read as a 1-element vector of mode character, again due to setting n = 1.)

On a Linux or Mac platform, we can determine the number of records in the file ahead of time by running wc -l infile (either directly or via R’s system()).  This may take a long time, but if we are willing to incur that time, then the above code could be changed to extract random records. We’d do something like cullrecs <- sample(1:ntotrecs,m,replace=FALSE) where m is the desired number of records to extract, and then whenever recnum matches the next element of cullrecs, we’d write that record to outfile.

Will you be at the JSM next week? My talk is on Tuesday, but I’ll be there throughout the meeting. If you’d like to exchange some thoughts on R or statistics, I’d enjoy chatting with you.

Advertisements

5 thoughts on “Code Snippet: Extracting a Subsample from a Large File”

  1. If you want to down-sample to 10% of original size, just do the R equiv of “if (rand(1.0) < .1)" instead of checking mod k. Gets around any periodicity in the data, too, right?

      1. Oops, above written in haste on route to SFO. It’s the same issue, of course.

        My main goal in posting this code snippet was to introduce file access operations to R users who don’t have much programming background.

  2. Reservoir sampling (http://en.wikipedia.org/wiki/Reservoir_sampling) is also a slightly-more-complex version of what you suggested. It actually does produce a simple random sample of k lines out of n, without ever having more than k lines in memory at once. (Yours is something like a systematic sample, and would be exactly that if you choose one of the first k lines at random and every k-th one thereafter.)

    I didn’t even know that R had functions to handle files at this level, so thank you for sharing this.

    1. Our freqparcoord package does have a random-subsample option, but we believe the various “typical lines” options are more useful.

      Glad to hear you found the material on R file functions useful. I’m sure a lot of R users aren’t aware of them, which is the main reason for my post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s