R > Python: a Concrete Example

I like both Python and R, and teach them both, but for data science R is the clear choice. When asked why, I always note (a) written by statisticians for statisticians, (b) built-in matrix type and matrix manipulations, (c) great graphics, both base and CRAN, (d) excellent parallelization facilities, etc. I also like to say that R is “more CS-ish than Python,” just to provoke my fellow computer scientists. 🙂

But one aspect that I think is huge but probably gets lost when I cite it is R’s row/column/element-name feature. I’ll give an example here.

Today I was dealing with a problem of ID numbers that are nonconsecutive. My solution was to set up a lookup table. Say we have ID data (5, 12, 13, 9, 12, 5, 6). There are 5 distinct ID values, so we’d like to map these into new IDs 1,2,3,4,5. Here is a simple solution:

> x <- c(5,12,13,9,12,5,6)
> xuc <- as.character(unique(x))
> xuc
[1] "5" "12" "13" "9" "6"
> xLookup <- 1:length(xuc)
> names(xLookup) <- xuc
> xLookup
5 12 13 9 6
1 2 3 4 5

So, from now on, to do the lookup, I just use as subscript the character from of the original ID, e.g.

> xLookup['12']
12
2

Of course, I did all this within program code. So to change a column of IDs to the new ones, I wrote

ul[as.character(testSet[,1])])

Lots of other ways to do this, of course, but it shows how handy the names can be.

14 thoughts on “R > Python: a Concrete Example”

Love your work as always, but if this a comparison, what might be a good Python example of the same problem?

matloff says:

November 21, 2018 at 6:04 pm

Another read just posted a comment with a very complex (though clever) Python version.

One could use Python dictionaries, e.g.

>>> d = {'5':1,'12':2,'13':3,'9':4,'6':5} >>> d['12'] 2

That’s fine to do “by hand” in this little example, but to do it in general programmatically would be more involved. One might try the .keys() built-in method, etc.

Reply
1. Jari says:
  
  November 21, 2018 at 11:45 pm
  
  My comment does exactly that though in 2 lines of python…
  
  Reply
  1. matloff says:
    
    November 22, 2018 at 8:59 am
    
    Again, succinctness is in the eye of the beholder. I would claim, though, that the average data scientist would be much more likely to come up with my solution than yours.
    
    Moreover, row/column/element names in R often give one needed information “for free,” 0 lines. E.g. in that same training/test set example, one might want to know, for a given row in the test set, its original row number in the full set:
    
    > d d x y 1 4 5 2 5 12 3 6 13 > d23 d23 x y 2 5 12 3 6 13
    
    Ah, (5,12) had been row 2 in the original full data.
    
    Reply
    1. matloff says:
      
      November 22, 2018 at 9:14 am
      
      Looks like some characters didn’t survive WordPress’ code format. 🙂 But I think the meaning is clear. I created a data frame d of 3 rows, then extracted the second and third rows into a new data frame d23. The row numbers of d23 were the ones those rows had back in d.
      
      Reply
      1. jarisafi says:
        
        November 25, 2018 at 9:21 am
        
        Oh, I see. This makes sense. Since python is general purpose I see little need for its built in data structures to provide a (relatively) niche feature like this. That said what you’re describing seems to be exactly how the pandas library does things.
        
        I would argue it’s more “CS-ish” that way, but to each their own :). The same applies to the data type conversions, though I recognize that they may be nice in a specific context for R users.
        
        Reply
        
        matloff says:
        
        November 25, 2018 at 10:15 am
        
        Right. Pandas seems to be inspired by R in many aspects.
        
        But that’s not what I meant by “CS-ish.” Instead, I was referring to things like R being a functional language, having lots of metaprogramming features, a choice of several class structures and so on. (Though Python people may strenuously object to the latter.)
        
        Reply
2. Andre Bieler says:
  
  November 22, 2018 at 2:36 pm
  
  Interesting. As a Python user the solution of Jari seems as straight forward as it can be. Why would you make the xLookup keys to be strings? xLookup[12] seems easier than xLookup[’12’]. Is this by choice or a side effect from your implementation?
  
  I ll throw in another way to do it in Python that might be more similar to your approach:
  
  ids = [5,12,13,9,12,5,6]
  unique_ids = set(ids)
  index = range(len(unique_ids))
  lookup = dict(zip(unique_ids, index))
  
  but really Jari’s solution with dict comprehensions seems nicer:
  
  lookup = {value: index for index, value in enumerate(unique_ids)} as the generation of the index list is unnecessary.
  
  Anyway let me know what you think.
  
  Reply
  1. matloff says:
    
    November 22, 2018 at 3:09 pm
    
    Both you and Jari seem to consider the switch to character form artificial. But it is quite common in R, and is considered a virtue. As I mentioned in my second reply to Jari, one gets these names “for free,” with 0 extra effort.
    
    In my original post, for instance, the result of xLookup[’12’] is not only 2 but 2 with the automatic name ’12’. So, in subsequent code, if I ever need to know the original ID for this ID 2, it’s right there, 12.
    
    In an R list, somewhat like a Python dictionary, one has the option of using either a key or numeric index. If l is set to list(x = 4:6, y = c(5,12,13)), say, then we can access that second element as either l[[‘y’]] or l[[2]]. This kind of stuff comes in really handy.
    
    These kinds of things are really useful in graphics as well.
    
    I think that you and Jari are viewing things from a pure algorithm/data structures point of view. R people look at things from a DATA point of view.
    
    Reply

I landed here randomly, was hoping for a python implementation for comparison and since that wasnt’ there I figured I’d try to make one myself. The lookup table generation in python is pretty straight forward and succint:

ids = [5,12,13,9,12,5,6]
lookup = {id1: ii
for ii, id1 in enumerate(list(set(ids)))}
lookup[12]

I’m unfamiliar with R so I can’t quite tell what the line

ul[as.character(testSet[,1])])

is supposed to be doing. I don’t see the lookup showing up here either.

matloff says:

November 21, 2018 at 5:56 pm

I guess succinctness is in the eye of the beholder. 🙂

In the usage example you ask about here, say there is a 12 in column 1 of testSet. The code changes that to ’12’, and evaluates ul[’12’], where ul is the lookup table.

Reply
1. Jari says:
  
  November 21, 2018 at 11:49 pm
  
  Ah, the lookup variable in my code does the same thing (except I don’t convert the numbers to strings as I don’t need to).
  
  We can be more scientific about succinctness:
  R code: 99 characters
  Python code (with variable names changed to match the R code): 77 characters.
  
  I have no intention of claiming one is better, or even more readable (the python code is infinitely more readable for me but then again I dream in python), but one is definitely smaller in multiple ways (fewer statements, fewer characters, fewer type conversions, etc).
  
  Reply

Pingback: Distilled News | Analytixon

Pingback: Distilled News | Statwks

	Anonymous on Just How Good Is ChatGPT in Da…
	Quantile Regression… on Quantile Regression with Rando…
	Anonymous on Quantile Regression with Rando…
	Sina Özdemir on qeML Example: Nonparametric Qu…
	Anonymous on qeML Example: Nonparametric Qu…

Mad (Data) Scientist

14 thoughts on “R > Python: a Concrete Example”

Leave a comment Cancel reply

Musings, useful code etc. on R and data science

Share this:

Related

14 thoughts on “R > Python: a Concrete Example”

Leave a comment Cancel reply

Musings, useful code etc. on R and data science