R > Python: a Concrete Example

I like both Python and R, and teach them both, but for data science R is the clear choice. When asked why, I always note (a) written by statisticians for statisticians, (b) built-in matrix type and matrix manipulations, (c) great graphics, both base and CRAN, (d) excellent parallelization facilities, etc. I also like to say that R is “more CS-ish than Python,” just to provoke my fellow computer scientists. 🙂

But one aspect that I think is huge but probably gets lost when I cite it is R’s row/column/element-name feature. I’ll give an example here.

Today I was dealing with a problem of ID numbers that are nonconsecutive.  My solution was to set up a lookup table. Say we have ID data (5, 12, 13, 9, 12, 5, 6). There are 5 distinct ID values, so we’d like to map these into new IDs 1,2,3,4,5. Here is a simple solution:

 

> x <- c(5,12,13,9,12,5,6)
> xuc <- as.character(unique(x))
> xuc
[1] "5" "12" "13" "9" "6"
> xLookup <- 1:length(xuc)
> names(xLookup) <- xuc
> xLookup
5 12 13 9 6
1 2 3 4 5

So, from now on, to do the lookup, I just use as subscript the character from of the original ID, e.g.

> xLookup['12']
12
2

Of course, I did all this within program code. So to change a column of IDs to the new ones, I wrote

 ul[as.character(testSet[,1])])

Lots of other ways to do this, of course, but it shows how handy the names can be.

14 thoughts on “R > Python: a Concrete Example”

    1. Another read just posted a comment with a very complex (though clever) Python version.

      One could use Python dictionaries, e.g.


      >>> d = {'5':1,'12':2,'13':3,'9':4,'6':5}
      >>> d['12']
      2

      That’s fine to do “by hand” in this little example, but to do it in general programmatically would be more involved. One might try the .keys() built-in method, etc.

        1. Again, succinctness is in the eye of the beholder. I would claim, though, that the average data scientist would be much more likely to come up with my solution than yours.

          Moreover, row/column/element names in R often give one needed information “for free,” 0 lines. E.g. in that same training/test set example, one might want to know, for a given row in the test set, its original row number in the full set:


          > d d
          x y
          1 4 5
          2 5 12
          3 6 13
          > d23 d23
          x y
          2 5 12
          3 6 13

          Ah, (5,12) had been row 2 in the original full data.

          1. Looks like some characters didn’t survive WordPress’ code format. 🙂 But I think the meaning is clear. I created a data frame d of 3 rows, then extracted the second and third rows into a new data frame d23. The row numbers of d23 were the ones those rows had back in d.

            1. Oh, I see. This makes sense. Since python is general purpose I see little need for its built in data structures to provide a (relatively) niche feature like this. That said what you’re describing seems to be exactly how the pandas library does things.

              I would argue it’s more “CS-ish” that way, but to each their own :). The same applies to the data type conversions, though I recognize that they may be nice in a specific context for R users.

              1. Right. Pandas seems to be inspired by R in many aspects.

                But that’s not what I meant by “CS-ish.” Instead, I was referring to things like R being a functional language, having lots of metaprogramming features, a choice of several class structures and so on. (Though Python people may strenuously object to the latter.)

      1. Interesting. As a Python user the solution of Jari seems as straight forward as it can be. Why would you make the xLookup keys to be strings? xLookup[12] seems easier than xLookup[’12’]. Is this by choice or a side effect from your implementation?

        I ll throw in another way to do it in Python that might be more similar to your approach:

        ids = [5,12,13,9,12,5,6]
        unique_ids = set(ids)
        index = range(len(unique_ids))
        lookup = dict(zip(unique_ids, index))

        but really Jari’s solution with dict comprehensions seems nicer:

        lookup = {value: index for index, value in enumerate(unique_ids)} as the generation of the index list is unnecessary.

        Anyway let me know what you think.

        1. Both you and Jari seem to consider the switch to character form artificial. But it is quite common in R, and is considered a virtue. As I mentioned in my second reply to Jari, one gets these names “for free,” with 0 extra effort.

          In my original post, for instance, the result of xLookup[’12’] is not only 2 but 2 with the automatic name ’12’. So, in subsequent code, if I ever need to know the original ID for this ID 2, it’s right there, 12.

          In an R list, somewhat like a Python dictionary, one has the option of using either a key or numeric index. If l is set to list(x = 4:6, y = c(5,12,13)), say, then we can access that second element as either l[[‘y’]] or l[[2]]. This kind of stuff comes in really handy.

          These kinds of things are really useful in graphics as well.

          I think that you and Jari are viewing things from a pure algorithm/data structures point of view. R people look at things from a DATA point of view.

  1. I landed here randomly, was hoping for a python implementation for comparison and since that wasnt’ there I figured I’d try to make one myself. The lookup table generation in python is pretty straight forward and succint:

    ids = [5,12,13,9,12,5,6]
    lookup = {id1: ii
    for ii, id1 in enumerate(list(set(ids)))}
    lookup[12]

    I’m unfamiliar with R so I can’t quite tell what the line

    ul[as.character(testSet[,1])])

    is supposed to be doing. I don’t see the lookup showing up here either.

    1. I guess succinctness is in the eye of the beholder. 🙂

      In the usage example you ask about here, say there is a 12 in column 1 of testSet. The code changes that to ’12’, and evaluates ul[’12’], where ul is the lookup table.

      1. Ah, the lookup variable in my code does the same thing (except I don’t convert the numbers to strings as I don’t need to).

        We can be more scientific about succinctness:
        R code: 99 characters
        Python code (with variable names changed to match the R code): 77 characters.

        I have no intention of claiming one is better, or even more readable (the python code is infinitely more readable for me but then again I dream in python), but one is definitely smaller in multiple ways (fewer statements, fewer characters, fewer type conversions, etc).

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.