Code Snippet: Extracting a Subsample from a Large File

Last week a reader of the r-help mailing list posted a query titled “Importing random subsets of a data file.” With a very large file, it is often much easier and faster–and really, just as good–to just work with a much smaller subset of the data.

Fellow readers then posted rather sophisticated solutions, such as storing the file in a database. Here I’ll show how to perform this task much more simply. And if you haven’t been exposed to R’s text file reading functions before, it will be a chance for you to learn a bit.

I’m assuming here that we want to avoid storing the entire file in memory at once, which may be difficult or impossible. In other words, functions like read.table() are out.

I’m also assuming that you don’t know exactly how many records are in the file, though you probably have a rough idea. (If you do know this number, I’ll outline an alternative approach at the end of this post.)

Finally, due to lack of knowledge of the total number of records, I’m also assuming that extracting every k^th record is sufficiently “random” for you.

So, here is the code (downloadable from here):

subsamfile <- function(infile,outfile,k,header=T) {
   ci <- file(infile,"r")
   co <- file(outfile,"w")
   if (header) {
      hdr <- readLines(ci,n=1)
      writeLines(hdr,co)
   }
   recnum = 0
   numout = 0
   while (TRUE) {
      inrec <- readLines(ci,n=1)
      if (length(inrec) == 0) { # end of file?
         close(co) 
         return(numout)
      }
   recnum <- recnum + 1
   if (recnum %% k == 0) {
      numout <- numout + 1
      writeLines(inrec,co)
   }
  }
}

Very straightforward code. We use file() to open the input and output files, and read in the input file one line at a time, by specifying the argument n = 1 in the first call to file(). Each inputted record is a character string. To sense the end-of-file condition on the input file, we test whether the input record has length 0. (Any record, even an empty one, will have length 1, i.e. each record is read as a 1-element vector of mode character, again due to setting n = 1.)

On a Linux or Mac platform, we can determine the number of records in the file ahead of time by running wc -l infile (either directly or via R’s system()). This may take a long time, but if we are willing to incur that time, then the above code could be changed to extract random records. We’d do something like cullrecs <- sample(1:ntotrecs,m,replace=FALSE) where m is the desired number of records to extract, and then whenever recnum matches the next element of cullrecs, we’d write that record to outfile.

Will you be at the JSM next week? My talk is on Tuesday, but I’ll be there throughout the meeting. If you’d like to exchange some thoughts on R or statistics, I’d enjoy chatting with you.

5 thoughts on “Code Snippet: Extracting a Subsample from a Large File”

If you want to down-sample to 10% of original size, just do the R equiv of “if (rand(1.0) < .1)" instead of checking mod k. Gets around any periodicity in the data, too, right?

matloff says:

August 2, 2014 at 7:18 am

if (runif(1) < 0.1) …

In this context though, we are aiming for an absolute size in the output, and don't know total size.

Reply
1. matloff says:
  
  August 2, 2014 at 11:44 am
  
  Oops, above written in haste on route to SFO. It’s the same issue, of course.
  
  My main goal in posting this code snippet was to introduce file access operations to R users who don’t have much programming background.
  
  Reply

Reservoir sampling (http://en.wikipedia.org/wiki/Reservoir_sampling) is also a slightly-more-complex version of what you suggested. It actually does produce a simple random sample of k lines out of n, without ever having more than k lines in memory at once. (Yours is something like a systematic sample, and would be exactly that if you choose one of the first k lines at random and every k-th one thereafter.)

I didn’t even know that R had functions to handle files at this level, so thank you for sharing this.

matloff says:

August 6, 2014 at 7:20 pm

Our freqparcoord package does have a random-subsample option, but we believe the various “typical lines” options are more useful.

Glad to hear you found the material on R file functions useful. I’m sure a lot of R users aren’t aware of them, which is the main reason for my post.

Reply

	Anonymous on Just How Good Is ChatGPT in Da…
	Quantile Regression… on Quantile Regression with Rando…
	Anonymous on Quantile Regression with Rando…
	Sina Özdemir on qeML Example: Nonparametric Qu…
	Anonymous on qeML Example: Nonparametric Qu…

Mad (Data) Scientist

Code Snippet: Extracting a Subsample from a Large File

5 thoughts on “Code Snippet: Extracting a Subsample from a Large File”

Leave a comment Cancel reply

Musings, useful code etc. on R and data science

Share this:

Related

5 thoughts on “Code Snippet: Extracting a Subsample from a Large File”

Leave a comment Cancel reply

Musings, useful code etc. on R and data science