I have had a miserable couple of days, here at the keyboard, working through the effects of the Great LII Outage of 2010. I spent a lot of time on repairs, and on measures that sharply decrease the chances of another. But this is the Internet, after all, and a highly complex system, and we know that sooner or later this will happen again. We had a good run. The last unintended outage we had was about six years ago. We experience slowdowns two to four times a year, usually the result of some perfect storm of network traffic that confuses our clustering software, or of a fault in the database back end. But nothing like this last one, ever, and I am hoping that a decade will pass before there is another. It went on for a little over 48 hours.
Like this one, the chances are that the next outage will be self-inflicted.
We brought this on ourselves. We assumed that there was such a thing as an innocent change in a heavily-used system as big and complicated as ours. There isn't, and we should have anticipated that. We should have had an easier way to back the changes out once they were in place. We should have been more methodical in our diagnosis. What followed was the predictable result of hubris, confusion, and a really bizarre technical problem... but it's not my point to talk about that here. We'll fix the technical stuff and put all kinds of traps and wires in place to prevent a recurrence, and we'll change our deployment procedures. Next time, we'll say more to our users about what happened, and we'll say it sooner.
[ Geek note: for those interested in root causes, it turns out that Perl doesn't deal with tail recursion very well, especially inside mod_perl, and that a 750 ms. change in the time it takes to generate a dynamic page can bring a site to its knees, even if it's running on a good-sized cluster. Also, if you change a lot of content, the reaction from crawlers is just indescribable. "Feeding frenzy" doesn't even come close.]
Again, the point of this post is not to review the usual lessons learned, but to point out some others. Mostly these are about people.
We have a remarkably loyal and patient group of users. I talked to, or e-mailed with, a number of them over the last few days (yes, it's often me who answers the phone; we keep telling you guys that there are only five of us here, and that number does not include a receptionist. I still owe some e-mail responses, and will for a few days yet). All were courteous; all told us how much they depend on us; all wanted us back online five minutes ago. And this is probably a good time to thank all of you from firms and libraries who tweeted or wrote us to say that WEXIS is no better, at least in your institutions.
Many who called or wrote were worried that higher education budget cuts had put us off the air for good. Nope, not so. I have to say that the relief these people expressed (often with an explosive "oh thank God") was probably the brightest spot of the last few days; we felt really appreciated. We get core support -- about two-thirds of our budget last year, hopefully less this year -- from the Cornell Law School. While we are hardly central to their mission of providing legal education, they have been, and continue to be, generous in their support. We are working to reduce our dependency on the School, but it will be a few years before we are fully self-sustaining.
But I think the most interesting contact came from someone in the far reaches of a large organization (I won't say where in order to protect the innocent, and some of the guilty, too -- we'll call him Fred). Fred was very worried about the outage, because some months ago he had recommended that we be made the standard go-to source for US statutes within his extended workgroup. Apparently Fred has taken a good bit of flak for that decision. The critics are, he says, much more vocal at the times of year when we run fundraising notices.
Fred just wanted to know what to expect, and to get some kind of a track record on our outages so he could answer his critics. I cannot imagine a more loyal advocate than this guy. I would guess there are a lot more like him out there; I sure hope there are. No doubt they will be hearing about this from their co-workers, too, and I'm sorry for that (repeat after me: First time in six years. Two to four slowdowns a year. Fewer once we have stuff in the cloud, slated for this summer). Fred, and all of you like him: my thanks for your belief in us, and your advocacy on our behalf.
Fred's co-workers, well... them I'm a little less happy with. We have 90,000 visitors each and every day of the week. I have no idea what the aggregate number has been over the last several years, but it's certainly a lot more than 90,000. We have 6,000 active donors. That is a lot of free riders. I think a fundraising solicitation that pops up no more than once during your visits during the months of December and June (assuming you click the thing that turns it off after the first time), is not a heavy price to pay. I don't think I've ever seen a sarcastic review of the LII that says it's worth every penny you pay for it, and I hope I never do; nor do I mean to suggest that those who don't pay for a service are barred from criticism. Far from it. I hope they'll write to us directly and tell us what it is they would like us to improve. Oh, and about the insufferable burdens of being asked to contribute, too.
We have deliberately chosen to avoid give-money-or-we-shoot-the-dog appeals of the kind used by many advocacy organizations, despite the fact that most fundraisers find them highly effective. I think they are unbearably shrill, and as much about manufacturing crisis as solving problems. That's why we won't be turning the servers off once a year to make a point, I guess. Besides, that would be childish.
But I have to say that it looked awfully tempting along about hour 17 of the outage, when Fred's e-mail came in and Dan Nagy and I were rewriting code and juggling servers on our noses. Picture Tom, with a little devil perched on his shoulder, whispering: Pssst....you know...we could turn the lights out for 24 hours every year, predictably and with advance notice. Maybe on Bentham's Birthday.... It's rumored that Paul Ginsparg pulled a stunt like that with the physics arXiv when he was still at Los Alamos. Tempting, so tempting, especially after the Tim Stanley diet-cola-consumption limit is only a distant memory and you've lapsed into twitchy irritability.
All of this to say that the psychological dimensions of something like this outweigh the technical ones, at least for us. There is, of course, the usual set of platitudes about doing things better -- all of them severely devalued in this year of our Lord 2010 by having them pushed into our faces in prime time by Domino's ("our pizza sucks but we're fixing it") and Toyota ("you've always trusted us and naturally we're fixing your cars so you don't hit the guardrail at 75 MPH"). Well, our pizza doesn't suck. And we are fixing the brakes. And we are very, very grateful to those of you who have borne with us through this. It'll happen again, but with luck and (mostly) skill, it won't happen soon.
Oh, and a final word: there is a very special place in Hell reserved for people who have put up web crawlers and have no idea how to operate them. The commercial indexers like Google, Yahoo and their ilk are actually quite respectful of robots.txt files, offer rate-limiting apparatus, and so on. The horde of people who have put up search appliances on college campuses and elsewhere without any idea of the effect they're having on the world are another matter. I wish them an eternity staked out under a heavy, random shower of red-hot air-gun pellets; that seems about right.