Numeric stability in Clojure's statistics package, Incanter

See below for some updates; and I don’t follow Clojure’s development, so the whole post might be out of date by the time you’re reading it.

I’ve been casually interested in Lispy languages for a while; i.e. I’d like to learn one and am not going to let the fact that I only know bits of elisp after 15 years of using Emacs deter me. Clojure seems hot and I really like the talks I’ve seen by Rich Hickey, its creator. Simple Made Easy is especially good. Plus, Clojure even has a well-regarded statistics library, Incanter, so awesome.

Anyway, procrastinating tonight, I decided to check out Incanter’s source code on Github. I have a really simple method for evaluating open source statistics packages: find the linear regression function, and look for variables named xpxi or xtxi and, if they exist, basically avoid the package (for some reason, these variable names are ubiquitous.). Inverting the X’X matrix is a pretty bad idea–it is a numerically unstable way of calculating the regression coefficients that (in problems I’ve worked on) sometimes leads to a non-idempotent projection matrix X(X’X)⁻¹X’ (or, using less terminology, X’X(X’X)⁻¹ may not equal the identity matrix). Needless to say this results in pretty bad estimates of the OLS coefficients. Douglas Bates talks about performance issues in this R-news article too, but I’m much more concerned about numeric instability. I don’t necessarily have the most informed opinion about the best way to get the OLS estimates, but I’ve gotten good results from the QR decomposition.

As of today, you can probably guess, Incanter fails this test. The source code and documentation are pretty unconcerned with the actual implementation of OLS, and I can’t figure out exactly what algorithm solve uses (I’m unpersuaded by the claim that it is “equivalent to R’s solve function” and can’t really track it down any further than that part of the code).

These details are important! I mean, I appreciate the effort and the good intentions that goes into developing open source packages like this. But if you’re developing statistical software for other people to use, you really need to understand the numeric properties of the routines you’re writing and you need to transparently communicate that understanding to other people who might use your code. So I guess I’ll stick with R for a while longer.

Update on July 7, 2013

This post drew a response from an Incanter user (who wants to stay anonymous, or I’d just quote the email). In short, the email pointed out that, if you know how Clojure handles dependencies and libraries, it’s not hard to verify that Incanter’s solve uses LAPACK’s DGESV from JBLAS to invert the X’X matrix using an LU factorization, which is the exact same algorithm as R’s solve, so my suspicion there was misplaced. Great!

Obviously my first reaction to the email was astonishment that anyone’s read my blog. But I think my original point still stands. Looking for a variable named xtxi used to estimate OLS is a quick and dirty way to evaluate a statistics package, because inverting the X’X matrix is numerically unsound compared to other methods of estimating OLS. R, for example, does not use solve for OLS, it uses the QR decomposition.

Here’s some R code where the difference matters (I don’t know Clojure, but this uses the same algorithms). This isn’t quite linear regression, it’s a comparison of different methods for constructing projection matrices, P = X (X’X)⁻¹ X’ (so it’s basically identical to linear regression). Here are three different methods:

projection.LU1 <- function(x) x %*% solve(crossprod(x)) %*% t(x)

projection.LU2 <- function(x) crossprod(t(x), solve(crossprod(x), t(x)))

projection.QR <- function(x) {
  QR <- qr(x)
  tcrossprod(qr.Q(QR)[, QR$pivot[seq_len(QR$rank)], drop = FALSE])
}

The first inverts X’X using the same algorithm in Incanter; the second uses a slightly better version but is basically the same, and the third uses the QR decomposition, just like R.

From the mathematical definition, we can see that P= PP, a property called idempotence, which is an easy property to verify numerically. Here’s a set of 51 observations for 11 regressors (each column is z raised to the pth power for p = 0, 1, 2,…,10 and z between zero and one).

X <- outer(seq(0, 1, 0.02), 0:10, "^")

And now we can “verify” idempotence (up to numerical tolerance)

> all.equal(projection.LU1(X), projection.LU1(X) %*% projection.LU1(X))
[1] "Mean relative difference: 0.0002737938"

> all.equal(projection.LU2(X), projection.LU2(X) %*% projection.LU2(X))
[1] "Mean relative difference: 0.0001990939"

> all.equal(projection.QR(X), projection.QR(X) %*% projection.QR(X))
[1] TRUE

All of the code is available for download here: https://gist.github.com/grayclhn/5717763

You can see (and verify it for yourself by downloading the code) that the first two methods of calculating P, which invert X’X using the LU factorization just like Incanter, are not idempotent. The third method, which uses the QR decomposition just like R, is idempotent. So in this example, the QR decomposition works and the LU factorization doesn’t.

This example is obviously contrived, but it’s not isolated. Chapter 11 of Seber and Lee’s (2003) Linear Regression Analysis shows the same thing: that if the regressors are “badly” distributed, the QR decomposition is more reliable. (In the interest of full disclosure, I should admit, embarrassing as it is, that chapter 11 of Seber and Lee is essentially all I know about these issues, so I’m not claiming a lot of expertise.)

As one last point, let me preempt anyone who might respond, “yes, these issues matter in those particular examples, but that will never come up in real research.” Try to guess why I look to see how a stats package calculates the linear regression coefficients, and why I have this particular criterion that I care about instead of any other, and why….

It’s obvious, right? This is something I’ve personally screwed up before. An early version of my job-market paper had fantastic empirical results that turned out to be entirely an artifact of using (X’X)⁻¹ to calculate the F-test statistic instead of using the QR decomposition and the “projection.QR” function in the code example is copied directly from that project (a later version). I was lucky and paranoid enough to catch it before circulating the paper but the event definitely left an emotional impression.

— Gray Calhoun, 10 Apr 2013

Copyright (c) 2014–2015 Gray Calhoun. This document is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License and any source code listed in this document is also licensed under the MIT License.

Home