R illuminated

Earlier this month I wrote about wrapping routines in C for calling from Ruby. A month later I’m writing about having C code called from the R statistical programming language.

I didn’t expect to be here.

On an R project, I needed to solve a bin packing problem. It goes like this. We have grades given by judges. The grades are discrete, not continuous. They come on the interval from zero to ten in half point increments. 8.5 is a grade. 8.25 is not. Nor is 8.52. Nor 8.6.

To evaluate distributions, I decided that this is discrete, not continuous data. It’s ordinal, in fact, because the grades can be averaged together.

We can count the number of times a judge gives a grade. Now we have counts treating the grades as if they are categories. With categorical data a Chi-Squared test can tell us whether the actual counts for each grade (category) match expected counts. In this case, the expected counts come from a theory that the counts are nomally distributed. Counts for the lowest and highest grades will be infrequent. Counts for the mean grade most frequent. We want to test that.

To do the Chi-Squared test requires a minimum count of five in each “category.” We can combine a count of two grades at 5.5 plus three grades at 6.0 and have five grades in the range enclosing 5.5 and 6.0. What we can’t do is combine two grades at 5.5 with three grades at 7.5, ignoring the grades between.

The “normal” treatment of categories allows any combination, because the categories don’t have a natural order. Think of how many “yellow”, “blue”, “red”, and “silver” cars are in the parking lot. Mixing counts of “yellow” and “red” or “yellow” and “blue” are both valid.

Actual grade distributions can look like this:

  1  |  0  |  2  |  6  |  8  |  9  |  8  |  3    counts
 6.0 | 6.5 | 7.0 | 7.5 | 8.0 | 8.5 | 9.0 | 9.5   grades

  1  |  1  |  5  |  2  |  4  |  9  |  6  |  6
 6.0 | 6.5 | 7.0 | 7.5 | 8.0 | 8.5 | 9.0 | 9.5

To group them, I needed code that would produce:

           9           |  8  |  9  |    11       counts
 6.0   6.5   7.0   7.5 | 8.0 | 8.5 | 9.0   9.5   grade groups

        7        |     6     |  9  |  6  |  6
 6.0   6.5   7.0 | 7.5   8.0 | 8.5 | 9.0 | 9.5

by joining neighboring groups having counts less than five.

Going to C

Going to Sea

It’s probably possible to solve that problem using R. It requires trying all of the combinations guided by some heuristic and bounded in some way not to search combinations that are useless. If you read my post about Davenport’s algorithm this will be a familiar strategy.

I felt that tackling the problem with C would be more straightforward. It doesn’t necessarily play to the strengths or R, or my limited facility with the language.

So, like with Davenport’s algorithm, I went to C, reproducing the strategies that worked so well there: building a makefile with Autoconf and Automake, testing first using Cutter. It went swimmingly. The heuristic and bound are a magnitude less complex than with Davenport’s algorithm. In a few days, I had it.

What threw me for a couple of days more was getting R to embrace it and make use of it.

Writing an R extension

There are a handful of resources for binding R to C that turn out to be helpful:

  • R’s C interface from Hadley Wickham, author of Advanced R and testthat.
  • In Search of C/C++ & FORTRAN Routines by Duncan Temple Lang in “R News”, Volume 1/3, September 2001.
  • Writing R Extensions, the official documentation.
  • R Internals, implementation notes for R.
  • The source code for the interface header, which is most readily accessed through R itself:
    rinternals <- file.path(R.home("include"), "Rinternals.h")
    file.show(rinternals)
    

The other thing you run into is Rcpp. Rcpp is the new bread and butter for putting C++ and R together. It’s wildly popular. Even Hadley tells you to use it.

For me, by the time I figured-out that Rcpp exists, I already had the basics of the glue code written, even working. I would be going to Rcpp just to solve a library binding problem that I knew for sure didn’t require it. And more, it’s one C function that I’ve written. All it does is unwrap R data structures, call my code, and wrap-up the result into R data structures.

Why would I resort to a library with 450 C++ source files, 150 R source files, 4,500 commits, and 100 issues just to get my one little function to work? As you see, I feel defensive about it.

My main issue was that R would not recognize my pre_chi_cluster_neighbors C function without manually loading the compiled library. Installing my package didn’t make it available. The rest was almost cake.

As it turned out, the problem was that I had a make file. Nothing I could do, nothing I tried would let that make file produce the library product that the R packaging system wanted to consume. R packaging creates its own make file the way it wants it. It’s no good fighting.

To get where I needed to go, I had to start from scratch, going with the flow, following the R extensions documentation (about thirty pages of tight, detailed prose. Read it three times.) Pretty quickly I had a package that compiled and installed, whose function could be called. A “hello world” type thing. The details will follow.

Now I needed to shoehorn my C development setup into the working R extension setup. Foremost, there could be no make file in the root nor the source directory of the project. In order to make and run the C tests, I put a make file in the test directory. Having an R package install, compile and run the test suite written in C is perhaps an open problem, an opportunity.

Having done that, the log jam was cleared and the logs flowed sweetly down the river. Ah bliss.

Components of an R extension

Watch works

Everybody says, “Read the documentation”. Some people say, look at any of the couple thousand existing R extensions to see how it’s done. The problem with the first is that the documentation is extensive and detailed. A goal-oriented reading will easily miss something important. It’s quite glib to say, RTFM. That’s quite a lot to swallow.

The problem with the second is that there are multiple ways to do it, some of them outdated, and aside from the friction involved– download a tar file, open it up, poke around in the source tree –there’s no real sense that the code you’re looking at is state of the art. To make matters worse, the state of the art, these days, seems to be, “Use Rcpp”.

I digress. Here’s what you have to do:

  • Put a C source file in the src directory of a package you have created from the R interpreter command line. Section 1.1 of the extensions manual tells you all about the structure and content of R packages, but don’t get all fancy just yet. Use
    > package.skeleton("my_simple_package")
    

    from within R, to generate the skeleton on which to add your muscle. If you already have a working R package, the following will still help you. The code fragments are from src/r-prechi.c

  • Have the C source include the R interface declarations
    #include <R.h>
    #include <Rinternals.h>
    
  • Write a function having a signature like this one:
    SEXP pre_chi_cluster_neighbors(
      SEXP grade_values, SEXP grade_counts, SEXP min_size)
    {
    
  • Convert the sexy SEXP parameters into native C using the macros declared in the R interface, e.g.
    const double *grades = REAL_RO(grade_values);
    
  • Do your thing in C
    Prechi *prechi = prechi_create(grades, counts, n);
    prechi_solve(prechi, minimum_count);
    
  • Convert the result into the tree of SEXP values needed for the return of the function. Usually it’s an R list.
    // Set-up the returned list
    SEXP rv = PROTECT(allocVector(VECSXP, 7)); ++prct;
    SEXP names = PROTECT(allocVector(STRSXP, 7)); ++prct;
    SET_STRING_ELT(names, 0, mkChar("count"));
    //...
    SET_STRING_ELT(names, 6, mkChar("solution_variance"));
    setAttrib(rv, R_NamesSymbol, names);
    
    // Set count on returned list
    SEXP count = PROTECT(allocVector(INTSXP, 1)); ++prct;
    INTEGER(count)[0] = part_ct;
    SET_VECTOR_ELT(rv, 0, count);
    
  • Clean-up, free memory, and return
      prechi_destroy(prechi);
      UNPROTECT(prct);
      return rv;
    }
    
  • At the bottom, define a data structure that tells R about your function, and then register it when the package loads:
    static const R_CallMethodDef PreChiEntries[] = {
      {"pre_chi_cluster_neighbors", (DL_FUNC)&pre_chi_cluster_neighbors, 3},
      {NULL, NULL, 0}
    };
    
    void R_init_prechi(DllInfo *dll) {
      R_registerRoutines(dll, NULL, PreChiEntries, NULL, NULL);
      R_useDynamicSymbols(dll, FALSE);
      R_forceSymbols(dll, TRUE); // .Call(pre_chi_cluster_neighbors, ...)
    }
    
  • in the R directory, add an R source file that will act as diplomat. It checks paramaters and otherwise ensures the paperwork is straight before calling the C function. Doing this is a very helpful recommendation from the article by Hadley Wickham. It will look something like this extract of the source, R/preChi.R
    prechi.cluster_neighbors <- function(grades, counts, minimum_count = 5) {
      n <- as.integer(minimum_count)
      ln <- length(n)
      if (1 < ln) {
        warning("Prechi: ", ln,
          " elements given for minimum count, using the first")
        n <- n[1]
      }
    
      # ... more parameter checking, followed by the call into C:
    
      clustered <- .Call(pre_chi_cluster_neighbors, g, c, n)
      if (clustered$count < 3) {
        stop("Prechi: there is no solution with three or more parts")
      }
      clustered
    }
    
  • Check your package, from the command line at the root of your package, with
    $ R CMD check .
    
  • Build it with
    $ R CMD build .
    
  • Check is actually happier when you’re checking the built result
    $ R CMD check prechi_0.0.1.tar.gz
    
  • You can install it with
    $ R CMD INSTALL prechi_0.0.1.tar.gz
    
  • Or from an R environment started in the package root directory, with devtools,
    > devtools::load_all()
    

Primarily, start simple and build incrementally. Start with a C function that simply writes out the input parameters using Rprintf(... and returns one of them. From there, work in parallel with the R wrapper to .Call(... and the C interface. Have the R wrapper show you what’s returned using print and str or your favorite way of displaying R data items.

Work one small step at a time. When everything is working, clean-up all of the print statements.

That’s pretty much it. It all seems so simple, now. However beware. You’ll still have to read the manual.