skip navigation
search

In our last post, we mentioned that we gained a lot of efficiency by breaking up the CFR text into Sections and processing each section individually. The cost was losing direct access to the structural information in the full Title documents, which made us lose our bearings a bit. In the source data, each Section is nested within its containing structures or “ancestors” (usually Parts and Chapters). Standard XML tools (modern libraries all support XPath) make it trivial to discover a Section’s ancestry or find all of the other Sections that share an arbitrary ancestor.

Once we’d broken up the Titles into Sections, we needed to make sure the software could still accurately identify a containing structure and its descendent Sections.

The first idea was to put a compact notation for the section’s ancestry into each section document. Sylvia added a compact identifier as well as a supplementary “breadcrumb” element to each section. In theory, it would be possible to pull all sections with a particular ancestor and process only those. As it turned out, however, the students found it to be inefficient to keep opening all of the documents to see whether they had the ancestry in question.

So Sylvia constructed a master table of contents (call it the GPS?). The students’ software could now, using a single additional document, pull all sections belonging to any given ancestor. The purists in the audience will, of course, object that we’re caching the same metadata in multiple locations. They’re right. We sacrificed some elegance in the interest of expedience (we were able to deploy the definitions feature on 47 of 49 CFR titles after a semester); we’ll be reworking the software again this semester and will have an opportunity to consolidate if it makes sense.

Back in September when we started LII: Under the Hood, we waited until we had a bare-bones release of the eCFR text before we told you much about it. “What’s the big deal about going from one XML source to another?” asked, among other people, our director. As it turns out, it was a bit of a bigger deal than we’d hoped.

The good news about the new corpus was that it was much (, much, much, much) cleaner and easier to work with than the original book XML. Instead of being an XML representation of a printed volume (with all of the attendant printed-volume artifacts, such as information about the seal of the National Archives and Records Administration, OMB control numbers, the ISBN prefix, and the contact information for the U.S. Government Publishing Office), the eCFR XML contains just the marked-up text of the CFR and retains only minimal artifacts of its printed-volume origins (we’re looking at you, 26 CFR Part 1). The eCFR XML schema is simpler and far more usable (for example, it uses familiar HTML markup like “table” elements rather than proprietary typesetting-code-derived markup). The structure of the files is far more accurate as well, which is to say that the boundaries of the structural elements as marked-up match the actual structure of the documents.

The bad news was that the software we’d written to extract the CFR structure from the book XML couldn’t be simply ported over; it had to be rebuilt — both to handle the elements in the new schema and to deal with the remaining print-volume artifacts.

The further bad news — or perhaps we should say self-inflicted wound — was that in the process of rebuilding, we decided to change the way we were processing the text.

Here’s what we changed. In the past, we’d divided book-volume XML into CFR Parts; this time we decided instead to divide into Sections (the smallest units which occur throughout the CFR). The advantage, short-term and long-term, is that it makes it far easier to run text-enrichment processes in parallel (many sections can be marked at the same time). That is to say, if our software is linking each term to its definition, we no longer have to wait for all of the links to be added to Section 1.1 before we add links to Section 1.2. The disadvantage is that we have more metadata housekeeping to do to make sure that we’re capturing all of the sections and other granules that belong to a Part or Chapter. That is to say, when we’re working with a Section, we now need another way to know which Part it belongs to. And when we’re marking all instances of a term that has been defined with the scope of a Part, we need a way to be sure that we’ve captured all of the text that Part contains.

And as we learned from our students, metadata housekeeping entails a bit more of a learning curve than XML parsing.

So instead of porting software, we were (by which I mean, of course, Sylvia was) rebuilding it — with a new structure. Suddenly this looked like a much steeper hill to climb.

Next up: Climbing the hill.

Great was the rejoicing in the south tower of Myron Taylor Hall, headquarters of the LII, when we got notice of the bulk release of the Electronic Code of Federal Regulations (eCFR) in XML format.

What was not to like? The data was as up-to-date as the CFR could get, the XML was much cleaner than the book version, it had a friendly user guide etc., etc., etc..

It was also different enough from the book XML of the CFR, that we could not simply run it through our existing data enrichment process and serve it to the public as is. So, we curbed our enthusiasm long enough to put together a measured plan to re-do our code.

We have heard enough from you, our wonderful readers, that text indentation was one of the most valued features of our data presentation. Thus, it was the first feature we chose to implement.

All this verbosity is the set-up for a look at some of the messy sausage making details of adding indentation to the eCFR.

***

If you’re not familiar with XML (eXtensible Markup Language), it’s simply a way of marking up data with a predefined, consistent set of descriptive tags that are both easily human and machine readable. So, when we get XML data from the GPO, it looks something like this…

Snippet 1: XML from Title 1 of the CFR
==============================================
<?xml version=1.0 encoding=UTF-8 ?>
<DLPSTEXTCLASS>
<HEADER>
<FILEDESC>
<TITLESTMT>
<TITLE>
Title 1: General Provisions</TITLE>
<AUTHOR TYPE=nameinv>
</AUTHOR>
</TITLESTMT>
<PUBLICATIONSTMT>
<PUBLISHER>
</PUBLISHER>
<PUBPLACE>
</PUBPLACE>
<IDNO TYPE=title>
1</IDNO>
<DATE></DATE>
</PUBLICATIONSTMT>
<SERIESSTMT>
<TITLE>
</TITLE>
</SERIESSTMT>
</FILEDESC>
<PROFILEDESC>
<TEXTCLASS>
<KEYWORDS>
</KEYWORDS>
</TEXTCLASS>
</PROFILEDESC>
</HEADER>
<TEXT>
<BODY>
<ECFRBRWS>
<AMDDATE>Jan. 30, 2015</AMDDATE>
<DIV1 N=1 NODE=1:1 TYPE=TITLE>
<HEAD>Title 1 – General Provisions–Volume 1</HEAD>
<CFRTOC>
<PTHD>Part </PTHD>
<CHAPTI>
<SUBJECT><E T=04>chapter i</E> – Administrative Committee of the Federal Register </SUBJECT>
<PG>1</PG></CHAPTI>
<CHAPTI>
<SUBJECT><E T=04>chapter ii</E> – Office of the Federal Register </SUBJECT>
<PG>51</PG></CHAPTI>
<CHAPTI>
<SUBJECT><E T=04>chapter iii</E> – Administrative Conference of the United States </SUBJECT>
<PG>301</PG></CHAPTI>
<CHAPTI>
<SUBJECT><E T=04>chapter iv</E> – Miscellaneous Agencies </SUBJECT>
<PG>425
</PG></CHAPTI></CFRTOC>
<DIV3 N=I NODE=1:1.0.1 TYPE=CHAPTER>
<HEAD> CHAPTER I – ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER</HEAD>
<DIV4 N=A NODE=1:1.0.1.1 TYPE=SUBCHAP>
<HEAD>SUBCHAPTER A – GENERAL
</HEAD>
<DIV5 N=1 NODE=1:1.0.1.1.1 TYPE=PART>
<HEAD>PART 1 – DEFINITIONS </HEAD>
<AUTH>
<HED>Authority:</HED><PSPACE>44 U.S.C. 1506; sec. 6, E.O. 10530, 19 FR 2709; 3 CFR, 1954-1958 Comp., p.189.
</PSPACE></AUTH>
<DIV8 N=§ 1.1 NODE=1:1.0.1.1.1.0.1.1 TYPE=SECTION>
<HEAD>§ 1.1 Definitions.</HEAD>
<P>As used in this chapter, unless the context requires otherwise – </P>
<P><I>Administrative Committee</I> means the Administrative Committee of the Federal Register established under section 1506 of title 44, United States Code; </P>
<P><I>Agency</I> means each authority, whether or not within or subject to review by another agency, of the United States, other than the Congress, the courts, the District of Columbia, the Commonwealth of Puerto Rico, and the territories and possessions of the United States; </P>
<P><I>Document</I> includes any Presidential proclamation or Executive order, and any rule, regulation, order, certificate, code of fair competition, license, notice, or similar instrument issued, prescribed, or promulgated by an agency; </P>
<P><I>Document having general applicability and legal effect</I> means any document issued under proper authority prescribing a penalty or course of conduct, conferring a right, privilege, authority, or immunity, or imposing an obligation, and relevant or applicable to the general public, members of a class, or persons in a locality, as distinguished from named individuals or organizations; and </P>
<P><I>Filing</I> means making a document available for public inspection at the Office of the Federal Register during official business hours. A document is filed only after it has been received, processed and assigned a publication date according to the schedule in part 17 of this chapter.</P>
<P><I>Regulation</I> and <I>rule</I> have the same meaning. </P>
<CITA TYPE=N>[37 FR 23603, Nov. 4, 1972, as amended at 50 FR 12466, Mar. 28, 1985]
</CITA>
</DIV8>
</DIV5>
</DIV1>
</ECFRBRWS>
</BODY>
</TEXT>
</DLPSTEXTCLASS>

The text of the regulations are enclosed within tags that provide some context for what you’re looking at, have meaning for how it should be displayed or provide additional metadata that may be useful to the enrichment process.

As a first step, we consulted the user guide to see if there was any information on how to indent the text. There was something! On page 13, was this snippet of XML (Figure 1) with the enumeration indicators highlighted. The next page had a suggestion for how that could be displayed (Figure 2).

Figure 1: 5 CFR 151.101 in XML format

Figure 2: Presentation suggested by the Government Print Office for 5 CFR 151.101

Obvious to us and as indicated by the user guide itself, there was no way to achieve this display given just the information from the markup. A good place to look for extra information was within the CFR itself.

We found what we were looking for in Title 1, Section 21.11, which is about how the CFR enumerators are organized, or more accurately, are supposed to be organized. Of particular interest was the hierarchy of paragraphs given by subsection 21.11(h):

(h) Paragraphs, which are designated as follows:
level 1(a), (b), (c), etc.
level 2(1), (2), (3), etc.
level 3(i), (ii), (iii), etc.
level 4(A), (B), (C), etc.
level 5(1), (2), (3), etc.
level 6(i), (ii), (iii), etc.

In our first iteration of indentation, we added attributes to each paragraph defining a depth of indentation corresponding to the 6 levels above. Section 151.101 of Title 5, the example in the user guide pages above, looked lovely. But, (you knew it would not be that simple, right?) this implementation worked fine for only about 60% of the random selection of sections we tested it on.

Where the algorithm did not work, the main reason for failure was the presence of multiple enumerators within a single paragraph. In other words, each enumerator should have its own paragraph but not all paragraphs were marked as such.

Snippet 2: XML from 9 CFR 2.1
==============================================
<p>(a)(1) Any person operating or intending to operate as a dealer, exhibitor, or operator of an auction sale, except persons who are exempted from the licensing requirements under paragraph (a)(3) of this section, must have a valid license. A person must be 18 years of age or older to obtain a license. A person seeking a license shall apply on a form which will be furnished by the AC Regional Director in the State in which that person operates or intends to operate. The applicant shall provide the information requested on the application form, including a valid mailing address through which the licensee or applicant can be reached at all times, and a valid premises address where animals, animal facilities, equipment, and records may be inspected for compliance. The applicant shall file the completed application form with the AC Regional Director. </p>

In the snippet above, we have the case where there are 2 enumerators at the beginning of the paragraph. Since our algorithm assumed one enumerator per paragraph, it would only find (a) but not (1). We fixed that in the second iteration.

In our third iteration, we went after more embedded enumerators (see snippet 3 below) by creating a category for these previously untagged enumerators. We named them, nested paragraphs, and tagged them as such.

Snippet 3: XML from 8 CFR 103.3
==============================================
<p>(a) <i>Denials and appeals</i> – (1) <i>General</i> – (i) <i>Denial of application or petition.</i> When a Service officer denies an application or petition filed under § 103.2 of this part, the officer shall explain in writing the specific reasons for denial. If Form I-292 (a denial form including notification of the right of appeal) is used to notify the applicant or petitioner, the duplicate of Form I-292 constitutes the denial order.</p>

In the last snippet, the paragraph has 3 enumerators, (a), (1), and (i). We’ve developed a library of patterns that our algorithm uses to find them all. In title 26 alone, we find and tag 13,563 nested paragraphs!

So, we now have a pretty nice indentation feature, that while not completely finished, is already an improvement over what we were able to do before for the CFR. See 8 CFR 103.3 (a)(1)(iii)(A) and its corresponding eCFR version for an example of this.

We’re putting it on the back burner for now but there is more to come for indentation. For instance, we know from extensive study of the markup that there are actually 8 levels of nesting to be had, not 6. And, we have to provide special handling for sections that do not follow the numbering scheme in 1 CFR 21.11.

We’re grateful for our beta testers and readers. If you come across places where our current indentation scheme does not work, please let us know. In the interim, we’ll be devoting some brain cycles to adding cross references and other links to the eCFR.

Quoting Tom, LII director, who was channeling Frank Wagner, the longest serving Reporter of Decisions for the US Supreme Court,

“The work of a legal publisher is an exercise in serial nitpicking.”

No quibbles with that. I’ve been indulging in some nitpicking. Until this morning, this is what a portion of 26 CFR 1.263A-3 looked like on the eCFR tab:

Enumerator and its enclosing braces are not properly bolded.

Note how the enumerators 1, i, ii, 2, and their enclosing braces are not properly bolded. Not good! Since I am the sort that deems suspicious, anyone who posts a Craigslist advertisement with grammatical errors, I can see why someone would take issue with a publisher improperly rendering the bolding on a piece of text. So, it’s fixed.

Now, since I am also basking in the euphoria induced by fixing this presentation problem, I will not bore anyone with the details of negative look-ahead, greedy, non-greedy, capturing, and non-capturing pattern matching with regular expressions in python, while carrying my laptop, uphill, both ways to and from my office, ….

… there’s another kind of detail that no shop manual goes into but that is common to all machines and can be given here. This is the detail of the Quality relationship, the gumption relationship, between the machine and the mechanic, which is just as intricate as the machine itself. Throughout the process of fixing the machine things always come up, low quality things, from a dusted knuckle to an accidentally ruined “irreplaceable” assembly. These drain off gumption, destroy enthusiasm and leave you so discouraged you want to forget the whole business. I call these things ‘gumption traps’.

– Robert Pirsig, Zen and the Art of Motorcycle Maintenance.

We’re in a gumption trap, and it’s slowing down our eCFR feature rollout.

We’ve met some unexpected challenges (unexpected, except insofar as you expect, in a general way, that all projects have challenges). In a group as small as ours, passing around a cold can be enough to stall a project for a week – or two. Right now, though, one of the three team members who has been working on the eCFR is leaving for a new job, and this will naturally slow things down for the next few months.

We’re still working on stuff: improving our indents, for instance. But some of the new features which we have beautifully, intricately constructed in our heads — our ideas of Quality —  are going to take a while longer to get out into the real world.  We’re down a mechanic.

All projects have difficulties complications, setbacks, gumption traps. They’re not covered in the shop manual. We spend our days skinning our knuckles on coding details, saying “wait, this regex needed to be capturing and that one needed to be non-capturing!”. It’s nice, at the end, when you’re seeing the beautifully humming final product, to forget them. But they’re an essential part of the process.

In other words, “Motorcycle maintenance gets frustrating. Angering. Infuriating. That’s what makes it interesting.” Something to bear in mind when you start turning the first bolt.

The Code of Federal Regulations (CFR) compiles regulations promulgated by various regulatory agencies — part of the executive branch. But that regulatory power is grounded in legislative power. At some point, some Congress — drunk or sober —passed a law enabling the agency to make such a rule.

And sometimes, reading through the laws, you want to know more than what the rule currently is. You want to know where it came from. You want to know…

Who authorized this?!

So, as one of the basic features of the CFR we provide hyperlinks from each regulation to the point in the U.S. Code which provides the basis for its rule-making authority. This week we restored those links to the eCFR text.

Here’s an example. According to federal regulations, schools can’t share your grades with your parents once you’re a grown-up, which is how you managed to keep that D+ in History 101 from Daddy (thank goodness!). But who said they could do that? If you look at the hyperlinks for “authority”, you’ll get to the Family Educational Rights and Privacy Act of 1974 (with various later amendments). So before 1974, that D+ was fair game, which was why Grandaddy grounded Daddy for a month that one time.

Of course there are a bunch of nitpicky details to take care of in order to mark up the authorities correctly. More on that soon.

 

Sometimes the littlest features have the biggest impact.

Some small tweak can vastly enhance the user experience.

Consider, for example, a small but powerful technology:

the indent.

Why do we indent the eCFR?

Hey, the government doesn’t indent their version?

Why do we?

There are reasons.

In a very complex document, indents help you

read better;

see better;

understand better.

Indents

guide the eye

make visible

the shape of the ideas

and the structure

behind the text.

It’s like in school when they taught you to outline.

(Remember outlines?

(Remember school??))

They taught you to put things in an outline format:

I.

A.

1.

a.

etc.

To help you understand

the structure

of the ideas

you were outlining.

The ECFR has a hierarchical structure.

Its natural, inherent structure contains sections

and subsections.

By laying those out visually

we help guide not only

your eye

but also

your mind.

We indent… so you can read.

 

As our regular readers will know, one of the LII’s most-used collections is the Code of Federal Regulations, which is an online version of the official compilation of the regulations published in the Federal Register.  Our edition has lots of useful features, but we’ve regularly gotten one big complaint: it’s out of date. (Our online text is based on the published book, which can be up to a full year behind.)

Well, those seeking up-to-the-minute regulations (well, up-to-the-last-few-days, anyway) are in luck.  The Office of the Federal Register and the GPO have made available, in bulk,  a machine-readable (XML) text of the eCFR, the Electronic Code of Federal Regulations.  The eCFR is unofficial, but it is very much up to date.

So, in order to satisfy  those of you who have been eager to get the latest in great tasting, less-filling, fresh-from-the-oven regulations, we are going to be rolling out our new edition of the CFR, based on the up-to-date eCFR XML,  as we go.  We have put the bare-bones text up as quickly as possible.  Many of you have told us how much you like the value we add to our editions — rest assured that we  will be adding in all the features you value just as soon as we can adapt them to the new data format. While we’re renovating, “CFR Classic” will continue to be available.

Stay tuned on this blog for details about the process of designing, developing, and implementing those features, as well as announcements when they are up and running.  And let us know what you’d like to see added – we always love to hear from you!