{"id":92,"date":"2016-01-22T12:47:05","date_gmt":"2016-01-22T17:47:05","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/tech\/?p=92"},"modified":"2016-02-26T14:36:13","modified_gmt":"2016-02-26T19:36:13","slug":"re-definition-part-2-some-hills-look-steeper-once-you-get-close-up","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/tech\/2016\/01\/22\/re-definition-part-2-some-hills-look-steeper-once-you-get-close-up\/","title":{"rendered":"Re-definition, part 2: Some hills look steeper once you get close up"},"content":{"rendered":"

Back in September when we started LII: Under the Hood, we waited until we had a bare-bones release of the eCFR text before we told you much about it. \u201cWhat\u2019s the big deal about going from one XML source to another?\u201d asked, among other people, our director. As it turns out, it was a bit of a bigger deal than we\u2019d hoped.<\/p>\n

The good news about the new corpus was that it was much (, much, much, much) cleaner and easier to work with than the original book XML. Instead of being an XML representation of a printed volume (with all of the attendant printed-volume artifacts, such as information about the seal of the National Archives and Records Administration, OMB control numbers, the ISBN prefix, and the contact information for the U.S. Government Publishing Office), the eCFR XML contains just the marked-up text of the CFR and retains only minimal artifacts of its printed-volume origins (we\u2019re looking at you, 26 CFR Part 1). The eCFR XML schema is simpler and far more usable (for example, it uses familiar HTML markup like “table”\u00a0elements rather than proprietary typesetting-code-derived markup). The structure of the files is far more accurate as well, which is to say that the boundaries of the structural elements as marked-up match the actual structure of the documents.<\/p>\n

The bad news was that the software we\u2019d written to extract the CFR structure from the book XML couldn\u2019t be simply ported over; it had to be rebuilt \u2014 both to handle the elements in the new schema and to deal with the remaining print-volume artifacts.<\/p>\n

The further bad news \u2014 or perhaps we should say self-inflicted wound \u2014 was that in the process of rebuilding, we decided to change the way we were processing the text.<\/p>\n

Here\u2019s what we changed. In the past, we\u2019d divided book-volume XML into CFR Parts; this time we decided instead to divide into Sections (the smallest units which occur throughout the CFR). The advantage, short-term and long-term, is that it makes it far easier to run text-enrichment processes in parallel (many sections can be marked at the same time). That is to say, if our software is linking each term to its definition, we no longer have to wait for all of the links to be added to Section 1.1 before we add links to Section 1.2. The disadvantage is that we have more metadata housekeeping to do to make sure that we\u2019re capturing all of the sections and other granules that belong to a Part or Chapter. That is to say, when we\u2019re working with a Section, we now need another way to know which Part it belongs to. And when we\u2019re marking all instances of a term that has been defined with the scope of a Part, we need a way to be sure that we\u2019ve captured all of the text that Part contains.<\/p>\n

And as we learned from our students, metadata housekeeping entails a bit more of a learning curve than XML parsing.<\/p>\n

So instead of porting software, we were (by which I mean, of course, Sylvia was) rebuilding it \u2014 with a new structure. Suddenly this looked like a much steeper hill to climb.<\/p>\n

Next up: Climbing the hill.<\/p>\n","protected":false},"excerpt":{"rendered":"

Back in September when we started LII: Under the Hood, we waited until we had a bare-bones release of the eCFR text before we told you much about it. \u201cWhat\u2019s the big deal about going from one XML source to another?\u201d asked, among other people, our director. As it turns out, it was a bit […]<\/a><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[322,6],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/posts\/92"}],"collection":[{"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/comments?post=92"}],"version-history":[{"count":5,"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/posts\/92\/revisions"}],"predecessor-version":[{"id":98,"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/posts\/92\/revisions\/98"}],"wp:attachment":[{"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/media?parent=92"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/categories?post=92"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/tags?post=92"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}