skip navigation
search

sara

In 2009, the US House of Representatives agreed to a resolution designating March 14 as “Pi Day”.

So I thought I’d see whether I could rustle up a nice example of pi in the Code of Federal Regulations. I did manage to find the π I was looking for, but it turns out that pi also gives us some examples of why we need to disambiguate.

In the process I also found:

So the story of “pi” in the CFR is a story of disambiguation. When we read (or our software parses) the original text of a particular regulation, it’s reasonably straightforward to tell which of the many “pi”s it’s discussing. We have capitalization, punctuation, formatting, and structural and contextual clues. Once the search engines work their case-folding, acronym normalization, and other magic, we end up with an awful lot of “pi”s and very little context.

Postscript. The real pi was a bit of a letdown: “P = pi (3.14)” – well, close enough.

In our last post, we mentioned that we gained a lot of efficiency by breaking up the CFR text into Sections and processing each section individually. The cost was losing direct access to the structural information in the full Title documents, which made us lose our bearings a bit. In the source data, each Section is nested within its containing structures or “ancestors” (usually Parts and Chapters). Standard XML tools (modern libraries all support XPath) make it trivial to discover a Section’s ancestry or find all of the other Sections that share an arbitrary ancestor.

Once we’d broken up the Titles into Sections, we needed to make sure the software could still accurately identify a containing structure and its descendent Sections.

The first idea was to put a compact notation for the section’s ancestry into each section document. Sylvia added a compact identifier as well as a supplementary “breadcrumb” element to each section. In theory, it would be possible to pull all sections with a particular ancestor and process only those. As it turned out, however, the students found it to be inefficient to keep opening all of the documents to see whether they had the ancestry in question.

So Sylvia constructed a master table of contents (call it the GPS?). The students’ software could now, using a single additional document, pull all sections belonging to any given ancestor. The purists in the audience will, of course, object that we’re caching the same metadata in multiple locations. They’re right. We sacrificed some elegance in the interest of expedience (we were able to deploy the definitions feature on 47 of 49 CFR titles after a semester); we’ll be reworking the software again this semester and will have an opportunity to consolidate if it makes sense.

Back in September when we started LII: Under the Hood, we waited until we had a bare-bones release of the eCFR text before we told you much about it. “What’s the big deal about going from one XML source to another?” asked, among other people, our director. As it turns out, it was a bit of a bigger deal than we’d hoped.

The good news about the new corpus was that it was much (, much, much, much) cleaner and easier to work with than the original book XML. Instead of being an XML representation of a printed volume (with all of the attendant printed-volume artifacts, such as information about the seal of the National Archives and Records Administration, OMB control numbers, the ISBN prefix, and the contact information for the U.S. Government Publishing Office), the eCFR XML contains just the marked-up text of the CFR and retains only minimal artifacts of its printed-volume origins (we’re looking at you, 26 CFR Part 1). The eCFR XML schema is simpler and far more usable (for example, it uses familiar HTML markup like “table” elements rather than proprietary typesetting-code-derived markup). The structure of the files is far more accurate as well, which is to say that the boundaries of the structural elements as marked-up match the actual structure of the documents.

The bad news was that the software we’d written to extract the CFR structure from the book XML couldn’t be simply ported over; it had to be rebuilt — both to handle the elements in the new schema and to deal with the remaining print-volume artifacts.

The further bad news — or perhaps we should say self-inflicted wound — was that in the process of rebuilding, we decided to change the way we were processing the text.

Here’s what we changed. In the past, we’d divided book-volume XML into CFR Parts; this time we decided instead to divide into Sections (the smallest units which occur throughout the CFR). The advantage, short-term and long-term, is that it makes it far easier to run text-enrichment processes in parallel (many sections can be marked at the same time). That is to say, if our software is linking each term to its definition, we no longer have to wait for all of the links to be added to Section 1.1 before we add links to Section 1.2. The disadvantage is that we have more metadata housekeeping to do to make sure that we’re capturing all of the sections and other granules that belong to a Part or Chapter. That is to say, when we’re working with a Section, we now need another way to know which Part it belongs to. And when we’re marking all instances of a term that has been defined with the scope of a Part, we need a way to be sure that we’ve captured all of the text that Part contains.

And as we learned from our students, metadata housekeeping entails a bit more of a learning curve than XML parsing.

So instead of porting software, we were (by which I mean, of course, Sylvia was) rebuilding it — with a new structure. Suddenly this looked like a much steeper hill to climb.

Next up: Climbing the hill.

Lexicographer: A writer of dictionaries; a harmless drudge, that busies himself in tracing the original, and detailing the signification of words.” – Samuel Johnson

One of our ongoing projects here at LII is to include definitions in our electronic texts. We’ve had them in the UCC since the mid-1990’s; we put them in the CFR in its earlier version a few years ago. Now, as part of the process of rolling out the eCFR and migrating all our old features to it, we’re working on adding definitions to it as well. And just to make life more complicated, we decided at the same time to ask a team of intrepid M.Eng. students (more about them in a subsequent post) to work on the next generation of enhancements to the feature.

Adding definitions to text is a tricky process, so we’re going to milk it for several blog posts take a deliberate approach to spelling out how this works. For starters, let’s look today at why this is a difficult task — why it was hard before, and why it’s hard now.

What we aim for is simple: we want every section of the CFR to have all the key legal terms highlighted. Users could click on each defined term, and they would see how the CFR itself defines those terms. (As you probably know, regulators are famous for defining terms in what we might generously call idiosyncratic ways.) But since the CFR is about a gazillion pages long, there’s no way we can go through and do all those by hand. It needed to be automated.

Which instantly plunged us into the icy waters of natural language processing (or sort-of-natural language processing, a similar if somewhat understudied field).

Definitions in the CFR are not always conveniently labeled; they do not always say something like “we define the term “brontosaurus” to mean “a small plunger to be used on miniature toilets”. They are tricky to pick out — perhaps not tricky for a human reader (although they can be that, too) but certainly tricky for a Turing Machine, even one with its best tuxedo on and its shoes shined.

Consider the following passage:

“Inspected and passed” or “U.S. Inspected and Passed” or “U.S. Inspected and Passed by Department of Agriculture” (or any authorized abbreviation thereof). This term means that the product so identified has been inspected and passed under the regulations in this subchapter, and at the time it was inspected, passed, and identified, it was found to be not adulterated.

We need Tuxedoed Turing to figure out that the three terms in quotations are all equivalent, and all have the same definition. We need it to figure out if an “authorized abbreviation” is being used. We need it to figure out which subchapter this applies to, and not accidentally apply it too narrowly or too widely. And so forth.

And we need it to figure out all sorts of similar problems that other definitions might have: what term precisely is being defined, which words are the definition, what text the definition applies to.

And we made a lot of strides in that direction. The definitions ain’t perfect, but they’re there, in the CFR. Yay, us!

Except, uh, then we had to start all over again. For the eCFR. This time, with shiny new bumpers and a fresh paint job.

Tune in next week for the next part of the story.

It will not have escaped the attention of our regular readers — both of them, as the old joke goes — that things have been quiet around here under the hood. It’s not that we haven’t been working; it’s that we’ve been too deep in elbow grease to type it up for you. We tried, but the third computer we ruined with dripping engine fluids caused a bit of a ruckus. So it’s been quiet.

But we just went at our hands with a brillo pad and fire hose, so we think it’s safe for the moment. And it’s New Year’s Eve, time for resolutions. So we are going to redouble our efforts to give the world a peek at what we’re doing, and what you’ve missed while we’ve been distracted by that dang fuel carburetor.

As a preview, and a public resolution to which we can point when our boss asks why we’re despoiling yet another keyboard, here are some of the things we’re going to be telling you about in the coming weeks.

  • Students’ work on definitions, entity linking, topic modeling, and a sub-site refresh
  • Citation resolution (turns out not every citation is to Marbury v Madison; who knew?)
  • Semantic web-based feature development

And that’s not all. For we remain hard at work, and new things are constantly in the works, and we’ll tell you about those as well. So watch this space!

But, uh, those are nice pants. You might want to slip on these coveralls first.

Conferences give us a chance to see other’s work, share ideas, shift perspective, and re-energize our own work. In the last session of the last day of the 2015 Law Via the Internet conference in Sydney, we were treated to an extraordinary panel entitled “Language, translation, and comparative law: East Asian experience”. You might think this is veering a bit out of our lane, but the project report on the translation work from legal and computer science scholars and builders Yoshiharu Matsuura and Amy Shee had a lot in common with problems we’re contending with in our own shop.

Their description of the problem they were addressing:

“There is an implicit assumption that four jurisdictions in East Asia (China, Japan, Korea and Taiwan) share key concepts of law for various reasons. Some people might believe that these jurisdictions share similar legal culture. However, the reality is that accurate knowledge of the legal systems of four jurisdictions is not widely shared even in East Asia.”

Looking at the combination of the linguistic alignment problems and, as Professor Shee described it, “the other translation problem” (the team in Japan works in Ruby; the team in Taiwan works in PHP), we saw a lot of parallels with our own concerns about terminology conformance and bridging the scripting / scaling divide we run into (M.Eng. students we work with prefer Java, which is not our default).

Over the next few weeks, we’ll be posting more from the conference – more soon!

Not exactly, but we’re headed to the 2015 Law Via the Internet Conference hosted by AustLII in Sydney, Australia and we’re taking a week off from the blog.

Catch you next week!

You were expecting only a couple of people.  And you have a couple bags of chips, some beer and a few soft drinks; that should be plenty.  Somewhere in the back there’s probably some pretzels and seltzer if things get low.  But then the door opens and people keep pouring in.  The living room fills, the kitchen, even the bedroom.  You squeeze through and look out the window, and the entire street is mobbed with people coming over.  No doubt about it: your supplies have crashed.

We know how you feel.

As our regular readers know, we’ve been working on building up the eCFR, so our systems mastermind, Nic, has been keeping a close eye on our load numbers.  But because we’re building things — hammers everywhere, and piles of plywood, and mind the cord please — when our numbers started to spike one evening a few weeks ago, Nic figured that it was some extra overhead from the feature deployment that was running.  Whatever it was, it was heavy: Nic said it was the only time he’d seen Nginx, the web server, run out of available worker processes.

But it wasn’t our the deployment.  Nic looked more closely at the server logs, and it was just an enormous spike in traffic.  Not enough to actually crash the site, but enough that not everyone was able to get the information they were looking for.  Nic kept digging.  It was puzzling: the huge wave of hits came at a very specific time — between 9:20 and 9:40 at night.  They were all from different locations, so it wasn’t a crawler run amok.  What was everyone looking for?  Did someone think we had early Star Wars tickets or something?

Well, the clue was in the specific information people were looking for.  Everyone, it seemed, was suddenly interested in the 14th amendment. Cross checking that against a news twitter stream, and there was the answer.  As it turns out the Republican Presidential debate was that night.  At some point shortly after 9:20, someone mentioned the 14th amendment.  And the next thing you know, we’re out of chips and pretzels.

But of course we at the LII, being hospitable folk, don’t like to run out of munchies.  So we’ve got a plan to make sure it doesn’t happen again.  Nic is gearing up to implement demand-driven autoscaling.  This will set up an automated monitor which keeps track of our load numbers, and then, if it detects a spike, adds a new machine and balances the workload between them.  Once the traffic slows down, the extra machine can quietly switch back off.

So next time the house fills up, we hope to have an extra tent ready to pitch itself automatically, and a whole tub of munchies in the garage.  Because rest assured, we’re here to welcome you, and no one likes a party without something to snack on.

… there’s another kind of detail that no shop manual goes into but that is common to all machines and can be given here. This is the detail of the Quality relationship, the gumption relationship, between the machine and the mechanic, which is just as intricate as the machine itself. Throughout the process of fixing the machine things always come up, low quality things, from a dusted knuckle to an accidentally ruined “irreplaceable” assembly. These drain off gumption, destroy enthusiasm and leave you so discouraged you want to forget the whole business. I call these things ‘gumption traps’.

– Robert Pirsig, Zen and the Art of Motorcycle Maintenance.

We’re in a gumption trap, and it’s slowing down our eCFR feature rollout.

We’ve met some unexpected challenges (unexpected, except insofar as you expect, in a general way, that all projects have challenges). In a group as small as ours, passing around a cold can be enough to stall a project for a week – or two. Right now, though, one of the three team members who has been working on the eCFR is leaving for a new job, and this will naturally slow things down for the next few months.

We’re still working on stuff: improving our indents, for instance. But some of the new features which we have beautifully, intricately constructed in our heads — our ideas of Quality —  are going to take a while longer to get out into the real world.  We’re down a mechanic.

All projects have difficulties complications, setbacks, gumption traps. They’re not covered in the shop manual. We spend our days skinning our knuckles on coding details, saying “wait, this regex needed to be capturing and that one needed to be non-capturing!”. It’s nice, at the end, when you’re seeing the beautifully humming final product, to forget them. But they’re an essential part of the process.

In other words, “Motorcycle maintenance gets frustrating. Angering. Infuriating. That’s what makes it interesting.” Something to bear in mind when you start turning the first bolt.

The Code of Federal Regulations (CFR) compiles regulations promulgated by various regulatory agencies — part of the executive branch. But that regulatory power is grounded in legislative power. At some point, some Congress — drunk or sober —passed a law enabling the agency to make such a rule.

And sometimes, reading through the laws, you want to know more than what the rule currently is. You want to know where it came from. You want to know…

Who authorized this?!

So, as one of the basic features of the CFR we provide hyperlinks from each regulation to the point in the U.S. Code which provides the basis for its rule-making authority. This week we restored those links to the eCFR text.

Here’s an example. According to federal regulations, schools can’t share your grades with your parents once you’re a grown-up, which is how you managed to keep that D+ in History 101 from Daddy (thank goodness!). But who said they could do that? If you look at the hyperlinks for “authority”, you’ll get to the Family Educational Rights and Privacy Act of 1974 (with various later amendments). So before 1974, that D+ was fair game, which was why Grandaddy grounded Daddy for a month that one time.

Of course there are a bunch of nitpicky details to take care of in order to mark up the authorities correctly. More on that soon.