skip navigation
search

In the fall of 2015, we wrote about a traffic spike that had occurred during one of the Republican primary debates. Traffic on the night of Sept. 16, 2015 peaked between 9 and 10pm, with a total of 47,025 page views during that interval. For that day, traffic totaled 204,905 sessions and 469,680 page views. At the time, the traffic level seemed like a big deal – our server had run out of resources to handle the traffic, and some of the people who had come to the site had to wait to find the content they were looking for – at that time, the 14th Amendment to the Constitution.

A year later, we found traffic topping those levels on most weekdays. But by that time, we barely noticed. Nic Ceynowa, who runs our systems, had, over the course of the prior year, systematically identified and addressed unnecessary performance-drains across the website. He replaced legacy redirection software with new, more efficient server redirects. He cached dynamic pages that we knew to be serving static data (because we know, for instance, that retired Supreme Court justices issue no new opinions). He throttled access to the most resource-intensive pages (web crawlers had to slow down a bit so that real people doing research could proceed as usual). As a result, he could allow more worker processes to field page requests and we could continue to focus on feature development rather than server load.

Then came the inauguration on January 20th. Presidential memoranda and executive orders inspired many, many members of the general public to read the law for themselves. Traffic hovered around 220,000 sessions per day for the first week. And then the President issued the executive order on immigration. By Sunday January 29th, we had 259,945 sessions – more than we expect on a busy weekday. On January 30th, traffic jumped to 347,393. And then on January 31st traffic peaked at 435,549 sessions – and over 900,000 page views.

The servers were still quiet. Throughout, we were able to continue running some fairly resource-hungry updating processes to keep the CFR current. We’ll admit to having devoted a certain amount of attention to checking in on the real-time analytics to see what people were looking at, but for the most part it was business as usual.

Now, the level of traffic we were talking about was still small compared to the traffic we once fielded when Bush v. Gore  was handed down in 2000 (that day we had steady traffic of about 4000 requests per minute for 24 hours). And Nic is still planning to add clustering to our bag of tricks. But the painstaking work of the last year has given us a lot of breathing room – even when one of our fans gives us a really big internet hug. In the meantime, we’ve settled into the new normal and continue the slow, steady work of making the website go faster when people need it the most.

In our last post, we mentioned that we gained a lot of efficiency by breaking up the CFR text into Sections and processing each section individually. The cost was losing direct access to the structural information in the full Title documents, which made us lose our bearings a bit. In the source data, each Section is nested within its containing structures or “ancestors” (usually Parts and Chapters). Standard XML tools (modern libraries all support XPath) make it trivial to discover a Section’s ancestry or find all of the other Sections that share an arbitrary ancestor.

Once we’d broken up the Titles into Sections, we needed to make sure the software could still accurately identify a containing structure and its descendent Sections.

The first idea was to put a compact notation for the section’s ancestry into each section document. Sylvia added a compact identifier as well as a supplementary “breadcrumb” element to each section. In theory, it would be possible to pull all sections with a particular ancestor and process only those. As it turned out, however, the students found it to be inefficient to keep opening all of the documents to see whether they had the ancestry in question.

So Sylvia constructed a master table of contents (call it the GPS?). The students’ software could now, using a single additional document, pull all sections belonging to any given ancestor. The purists in the audience will, of course, object that we’re caching the same metadata in multiple locations. They’re right. We sacrificed some elegance in the interest of expedience (we were able to deploy the definitions feature on 47 of 49 CFR titles after a semester); we’ll be reworking the software again this semester and will have an opportunity to consolidate if it makes sense.

Back in September when we started LII: Under the Hood, we waited until we had a bare-bones release of the eCFR text before we told you much about it. “What’s the big deal about going from one XML source to another?” asked, among other people, our director. As it turns out, it was a bit of a bigger deal than we’d hoped.

The good news about the new corpus was that it was much (, much, much, much) cleaner and easier to work with than the original book XML. Instead of being an XML representation of a printed volume (with all of the attendant printed-volume artifacts, such as information about the seal of the National Archives and Records Administration, OMB control numbers, the ISBN prefix, and the contact information for the U.S. Government Publishing Office), the eCFR XML contains just the marked-up text of the CFR and retains only minimal artifacts of its printed-volume origins (we’re looking at you, 26 CFR Part 1). The eCFR XML schema is simpler and far more usable (for example, it uses familiar HTML markup like “table” elements rather than proprietary typesetting-code-derived markup). The structure of the files is far more accurate as well, which is to say that the boundaries of the structural elements as marked-up match the actual structure of the documents.

The bad news was that the software we’d written to extract the CFR structure from the book XML couldn’t be simply ported over; it had to be rebuilt — both to handle the elements in the new schema and to deal with the remaining print-volume artifacts.

The further bad news — or perhaps we should say self-inflicted wound — was that in the process of rebuilding, we decided to change the way we were processing the text.

Here’s what we changed. In the past, we’d divided book-volume XML into CFR Parts; this time we decided instead to divide into Sections (the smallest units which occur throughout the CFR). The advantage, short-term and long-term, is that it makes it far easier to run text-enrichment processes in parallel (many sections can be marked at the same time). That is to say, if our software is linking each term to its definition, we no longer have to wait for all of the links to be added to Section 1.1 before we add links to Section 1.2. The disadvantage is that we have more metadata housekeeping to do to make sure that we’re capturing all of the sections and other granules that belong to a Part or Chapter. That is to say, when we’re working with a Section, we now need another way to know which Part it belongs to. And when we’re marking all instances of a term that has been defined with the scope of a Part, we need a way to be sure that we’ve captured all of the text that Part contains.

And as we learned from our students, metadata housekeeping entails a bit more of a learning curve than XML parsing.

So instead of porting software, we were (by which I mean, of course, Sylvia was) rebuilding it — with a new structure. Suddenly this looked like a much steeper hill to climb.

Next up: Climbing the hill.

Oh, look, the LII left the garage door open. What a clunker; looks like the wheels are about to fall off. Hey, is this the hood release?  I probably shouldn’t… oh, why not. <click /> Let’s see what’s under here…

Hi!  Welcome to our new technical blog, LII: Under the Hood. We’re starting this blog to show you how the features you see on the web site actually work, to give you a peek at our development process, and to let you get to know some of our awesome software engineers in the process.

In future posts we’ll be showing you things that most give us a sad details of our technical challenges, previewing new features we’re working on, inviting you to send us feedback, and letting you figure out where all that scary-looking smoke is coming from.  

So, welcome – enjoy this peek under the hood.  Make sure to leave us a note when you’re done looking!

Maybe I shouldn’t turn this thing on.  But it’s just this little chromium switch here