In our last post, we mentioned that we gained a lot of efficiency by breaking up the CFR text into Sections and processing each section individually. The cost was losing direct access to the structural information in the full Title documents, which made us lose our bearings a bit. In the source data, each Section is nested within its containing structures or “ancestors” (usually Parts and Chapters). Standard XML tools (modern libraries all support XPath) make it trivial to discover a Section’s ancestry or find all of the other Sections that share an arbitrary ancestor.
Once we’d broken up the Titles into Sections, we needed to make sure the software could still accurately identify a containing structure and its descendent Sections.
The first idea was to put a compact notation for the section’s ancestry into each section document. Sylvia added a compact identifier as well as a supplementary “breadcrumb” element to each section. In theory, it would be possible to pull all sections with a particular ancestor and process only those. As it turned out, however, the students found it to be inefficient to keep opening all of the documents to see whether they had the ancestry in question.
So Sylvia constructed a master table of contents (call it the GPS?). The students’ software could now, using a single additional document, pull all sections belonging to any given ancestor. The purists in the audience will, of course, object that we’re caching the same metadata in multiple locations. They’re right. We sacrificed some elegance in the interest of expedience (we were able to deploy the definitions feature on 47 of 49 CFR titles after a semester); we’ll be reworking the software again this semester and will have an opportunity to consolidate if it makes sense.