**See below for some updates; and I don’t follow Clojure’s
development, so the whole post might be out of date by the time you’re
reading it.**

I’ve been casually interested in Lispy languages for a while; i.e. I’d like to learn one and am not going to let the fact that I only know bits of elisp after 15 years of using Emacs deter me. Clojure seems hot and I really like the talks I’ve seen by Rich Hickey, its creator. Simple Made Easy is especially good. Plus, Clojure even has a well-regarded statistics library, Incanter, so awesome.

Anyway, procrastinating tonight, I decided to check out
Incanter’s source code on Github.
I have a really simple method for evaluating open source statistics
packages: find the linear regression function, and look for variables
named xpxi or xtxi and, if they exist, basically avoid the package
(for some reason, these variable names are ubiquitous.). Inverting
the *X’X* matrix is a pretty bad idea–it is a numerically unstable way
of calculating the regression coefficients that (in problems I’ve
worked on) sometimes leads to a non-idempotent projection matrix
*X(X’X)⁻¹X’* (or, using less terminology, *X’X(X’X)⁻¹* may not equal
the identity matrix). Needless to say this results in pretty bad
estimates of the OLS coefficients.
Douglas Bates talks about performance issues in this R-news
article
too, but I’m much more concerned about numeric instability. I don’t
necessarily have the most informed opinion about the best way to get
the OLS estimates, but I’ve gotten good results from the QR
decomposition.

As of today, you can probably guess, Incanter fails this test. The source code and documentation are pretty unconcerned with the actual implementation of OLS, and I can’t figure out exactly what algorithm solve uses (I’m unpersuaded by the claim that it is “equivalent to R’s solve function” and can’t really track it down any further than that part of the code).

These details are important! I mean, I appreciate the effort and the good intentions that goes into developing open source packages like this. But if you’re developing statistical software for other people to use, you really need to understand the numeric properties of the routines you’re writing and you need to transparently communicate that understanding to other people who might use your code. So I guess I’ll stick with R for a while longer.

This post drew a response from an
Incanter user (who wants to stay anonymous, or I’d just quote the
email). In short, the email pointed out that, if you know how Clojure
handles dependencies and libraries, it’s not hard to verify that
Incanter’s solve uses LAPACK’s DGESV from JBLAS to invert the
*X’X* matrix using an LU factorization, which is the exact same
algorithm as R’s solve, so my suspicion there was misplaced.
Great!

Obviously my first reaction to the email was astonishment that
anyone’s read my blog. But I think my original point still stands.
Looking for a variable named `xtxi`

used to estimate OLS is a quick
and dirty way to evaluate a statistics package, because inverting the
*X’X* matrix is numerically unsound compared to other methods of
estimating OLS. R, for example, does not use `solve`

for OLS, it uses
the QR decomposition.

Here’s some R code where the difference matters (I don’t know Clojure,
but this uses the same algorithms). This isn’t quite linear
regression, it’s a comparison of different methods for constructing
projection matrices, *P = X (X’X)⁻¹ X’* (so it’s basically identical
to linear regression). Here are three different methods:

The first inverts *X’X* using the same algorithm in Incanter; the
second uses a slightly better version but is basically the same, and
the third uses the QR decomposition, just like R.

From the mathematical definition, we can see that *P= PP*, a property
called idempotence, which is an easy property to verify numerically.
Here’s a set of 51 observations for 11 regressors (each column is *z*
raised to the *p*th power for *p* = 0, 1, 2,…,10 and *z* between
zero and one).

And now we can “verify” idempotence (up to numerical tolerance)

All of the code is available for download here: https://gist.github.com/grayclhn/5717763

You can see (and verify it for yourself by downloading the code) that
the first two methods of calculating *P*, which invert *X’X* using the
LU factorization **just like Incanter**, are not idempotent. The third
method, which uses the QR decomposition **just like R**, is idempotent.
So in this example, the QR decomposition works and the LU
factorization doesn’t.

This example is obviously contrived, but it’s not isolated. Chapter
11 of Seber and Lee’s (2003) *Linear Regression Analysis* shows
the same thing: that if the regressors are “badly” distributed, the QR
decomposition is more reliable. (In the interest of full disclosure, I
should admit, embarrassing as it is, that chapter 11 of Seber and Lee
is essentially all I know about these issues, so I’m not claiming a
lot of expertise.)

As one last point, let me preempt anyone who might respond, “yes, these issues matter in those particular examples, but that will never come up in real research.” Try to guess why I look to see how a stats package calculates the linear regression coefficients, and why I have this particular criterion that I care about instead of any other, and why….

It’s obvious, right? This is something I’ve personally screwed up
before. An early version of my job-market paper had fantastic
empirical results that turned out to be entirely an artifact of using
*(X’X)⁻¹* to calculate the F-test statistic instead of using the QR
decomposition and the “projection.QR” function in the code example is
copied directly from that project (a later version). I was lucky and
paranoid enough to catch it before circulating the paper but the event
definitely left an emotional impression.

— Gray Calhoun, 10 Apr 2013

Copyright (c) 2014–2015 Gray Calhoun. This document is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License and any source code listed in this document is also licensed under the MIT License.