skip navigation
search

Back in September when we started LII: Under the Hood, we waited until we had a bare-bones release of the eCFR text before we told you much about it. “What’s the big deal about going from one XML source to another?” asked, among other people, our director. As it turns out, it was a bit of a bigger deal than we’d hoped.

The good news about the new corpus was that it was much (, much, much, much) cleaner and easier to work with than the original book XML. Instead of being an XML representation of a printed volume (with all of the attendant printed-volume artifacts, such as information about the seal of the National Archives and Records Administration, OMB control numbers, the ISBN prefix, and the contact information for the U.S. Government Publishing Office), the eCFR XML contains just the marked-up text of the CFR and retains only minimal artifacts of its printed-volume origins (we’re looking at you, 26 CFR Part 1). The eCFR XML schema is simpler and far more usable (for example, it uses familiar HTML markup like “table” elements rather than proprietary typesetting-code-derived markup). The structure of the files is far more accurate as well, which is to say that the boundaries of the structural elements as marked-up match the actual structure of the documents.

The bad news was that the software we’d written to extract the CFR structure from the book XML couldn’t be simply ported over; it had to be rebuilt — both to handle the elements in the new schema and to deal with the remaining print-volume artifacts.

The further bad news — or perhaps we should say self-inflicted wound — was that in the process of rebuilding, we decided to change the way we were processing the text.

Here’s what we changed. In the past, we’d divided book-volume XML into CFR Parts; this time we decided instead to divide into Sections (the smallest units which occur throughout the CFR). The advantage, short-term and long-term, is that it makes it far easier to run text-enrichment processes in parallel (many sections can be marked at the same time). That is to say, if our software is linking each term to its definition, we no longer have to wait for all of the links to be added to Section 1.1 before we add links to Section 1.2. The disadvantage is that we have more metadata housekeeping to do to make sure that we’re capturing all of the sections and other granules that belong to a Part or Chapter. That is to say, when we’re working with a Section, we now need another way to know which Part it belongs to. And when we’re marking all instances of a term that has been defined with the scope of a Part, we need a way to be sure that we’ve captured all of the text that Part contains.

And as we learned from our students, metadata housekeeping entails a bit more of a learning curve than XML parsing.

So instead of porting software, we were (by which I mean, of course, Sylvia was) rebuilding it — with a new structure. Suddenly this looked like a much steeper hill to climb.

Next up: Climbing the hill.

Lexicographer: A writer of dictionaries; a harmless drudge, that busies himself in tracing the original, and detailing the signification of words.” – Samuel Johnson

One of our ongoing projects here at LII is to include definitions in our electronic texts. We’ve had them in the UCC since the mid-1990’s; we put them in the CFR in its earlier version a few years ago. Now, as part of the process of rolling out the eCFR and migrating all our old features to it, we’re working on adding definitions to it as well. And just to make life more complicated, we decided at the same time to ask a team of intrepid M.Eng. students (more about them in a subsequent post) to work on the next generation of enhancements to the feature.

Adding definitions to text is a tricky process, so we’re going to milk it for several blog posts take a deliberate approach to spelling out how this works. For starters, let’s look today at why this is a difficult task — why it was hard before, and why it’s hard now.

What we aim for is simple: we want every section of the CFR to have all the key legal terms highlighted. Users could click on each defined term, and they would see how the CFR itself defines those terms. (As you probably know, regulators are famous for defining terms in what we might generously call idiosyncratic ways.) But since the CFR is about a gazillion pages long, there’s no way we can go through and do all those by hand. It needed to be automated.

Which instantly plunged us into the icy waters of natural language processing (or sort-of-natural language processing, a similar if somewhat understudied field).

Definitions in the CFR are not always conveniently labeled; they do not always say something like “we define the term “brontosaurus” to mean “a small plunger to be used on miniature toilets”. They are tricky to pick out — perhaps not tricky for a human reader (although they can be that, too) but certainly tricky for a Turing Machine, even one with its best tuxedo on and its shoes shined.

Consider the following passage:

“Inspected and passed” or “U.S. Inspected and Passed” or “U.S. Inspected and Passed by Department of Agriculture” (or any authorized abbreviation thereof). This term means that the product so identified has been inspected and passed under the regulations in this subchapter, and at the time it was inspected, passed, and identified, it was found to be not adulterated.

We need Tuxedoed Turing to figure out that the three terms in quotations are all equivalent, and all have the same definition. We need it to figure out if an “authorized abbreviation” is being used. We need it to figure out which subchapter this applies to, and not accidentally apply it too narrowly or too widely. And so forth.

And we need it to figure out all sorts of similar problems that other definitions might have: what term precisely is being defined, which words are the definition, what text the definition applies to.

And we made a lot of strides in that direction. The definitions ain’t perfect, but they’re there, in the CFR. Yay, us!

Except, uh, then we had to start all over again. For the eCFR. This time, with shiny new bumpers and a fresh paint job.

Tune in next week for the next part of the story.