{"id":89,"date":"2016-01-15T15:05:52","date_gmt":"2016-01-15T20:05:52","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/tech\/?p=89"},"modified":"2016-01-22T10:25:05","modified_gmt":"2016-01-22T15:25:05","slug":"re-definition-part-1-2","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/tech\/2016\/01\/15\/re-definition-part-1-2\/","title":{"rendered":"Re-definition, part 1"},"content":{"rendered":"
“Lexicographer:<\/b> A writer of dictionaries; a harmless drudge, that busies himself in tracing the original, and detailing the signification of words.” – Samuel Johnson<\/p><\/blockquote>\n
One of our ongoing projects here at LII is to include definitions in our electronic texts. We\u2019ve had them in the UCC<\/a> since the mid-1990\u2019s; we put them in the CFR in its earlier version a few years ago. Now, as part of the process of rolling out the eCFR and migrating all our old features to it, we\u2019re working on adding definitions to it as well. And just to make life more complicated, we decided at the same time to ask a team of intrepid M.Eng. students (more about them in a subsequent post) to work on the next generation of enhancements to the feature.<\/p>\n
Adding definitions to text is a tricky process, so we\u2019re going to
milk it for several blog posts<\/del> take a deliberate approach to spelling out how this works. For starters, let\u2019s look today at why this is a difficult task \u2014 why it was hard before, and why it\u2019s hard now.<\/p>\nWhat we aim for is simple: we want every section of the CFR to have all the key legal terms highlighted. Users could click on each defined term, and they would see how the CFR itself defines those terms. (As you probably know, regulators are famous for defining terms in what we might generously call idiosyncratic ways.) But since the CFR is about a gazillion pages long, there\u2019s no way we can go through and do all those by hand. It needed to be automated.<\/p>\n
Which instantly plunged us into the icy waters of natural language processing (or sort-of-natural language processing, a similar if somewhat understudied field).<\/p>\n
Definitions in the CFR are not always conveniently labeled; they do not always say something like \u201cwe define the term \u201cbrontosaurus\u201d to mean \u201ca small plunger to be used on miniature toilets\u201d. They are tricky to pick out \u2014 perhaps not tricky for a human reader (although they can be that, too) but certainly tricky for a Turing Machine, even one with its best tuxedo on and its shoes shined.<\/p>\n
Consider the following passage:<\/p>\n
\u201cInspected and passed\u201d or \u201cU.S. Inspected and Passed\u201d or \u201cU.S. Inspected and Passed by Department of Agriculture\u201d (or any authorized abbreviation thereof). This term means that the product so identified has been inspected and passed under the regulations in this subchapter, and at the time it was inspected, passed, and identified, it was found to be not adulterated.<\/p><\/blockquote>\n
We need Tuxedoed Turing to figure out that the three terms in quotations are all equivalent, and all have the same definition. We need it to figure out if an \u201cauthorized abbreviation\u201d is being used. We need it to figure out which subchapter this applies to, and not accidentally apply it too narrowly or too widely. And so forth.<\/p>\n
And we need it to figure out all sorts of similar problems that other definitions might have: what term precisely is being defined, which words are the definition, what text the definition applies to.<\/p>\n
And we made a lot of strides in that direction. The definitions ain\u2019t perfect, but they\u2019re there, in the CFR. Yay, us!<\/p>\n
Except, uh, then we had to start all over again. For the eCFR. This time, with shiny new bumpers and a fresh paint job.<\/p>\n
Tune in next week for the next part of the story.<\/p>\n","protected":false},"excerpt":{"rendered":"
“Lexicographer: A writer of dictionaries; a harmless drudge, that busies himself in tracing the original, and detailing the signification of words.” – Samuel Johnson One of our ongoing projects here at LII is to include definitions in our electronic texts. We\u2019ve had them in the UCC since the mid-1990\u2019s; we put them in the CFR […]<\/a><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/posts\/89"}],"collection":[{"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/comments?post=89"}],"version-history":[{"count":2,"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/posts\/89\/revisions"}],"predecessor-version":[{"id":91,"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/posts\/89\/revisions\/91"}],"wp:attachment":[{"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/media?parent=89"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/categories?post=89"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/tech\/wp-json\/wp\/v2\/tags?post=89"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}