skip navigation
search

In the fall of 2015, we wrote about a traffic spike that had occurred during one of the Republican primary debates. Traffic on the night of Sept. 16, 2015 peaked between 9 and 10pm, with a total of 47,025 page views during that interval. For that day, traffic totaled 204,905 sessions and 469,680 page views. At the time, the traffic level seemed like a big deal – our server had run out of resources to handle the traffic, and some of the people who had come to the site had to wait to find the content they were looking for – at that time, the 14th Amendment to the Constitution.

A year later, we found traffic topping those levels on most weekdays. But by that time, we barely noticed. Nic Ceynowa, who runs our systems, had, over the course of the prior year, systematically identified and addressed unnecessary performance-drains across the website. He replaced legacy redirection software with new, more efficient server redirects. He cached dynamic pages that we knew to be serving static data (because we know, for instance, that retired Supreme Court justices issue no new opinions). He throttled access to the most resource-intensive pages (web crawlers had to slow down a bit so that real people doing research could proceed as usual). As a result, he could allow more worker processes to field page requests and we could continue to focus on feature development rather than server load.

Then came the inauguration on January 20th. Presidential memoranda and executive orders inspired many, many members of the general public to read the law for themselves. Traffic hovered around 220,000 sessions per day for the first week. And then the President issued the executive order on immigration. By Sunday January 29th, we had 259,945 sessions – more than we expect on a busy weekday. On January 30th, traffic jumped to 347,393. And then on January 31st traffic peaked at 435,549 sessions – and over 900,000 page views.

The servers were still quiet. Throughout, we were able to continue running some fairly resource-hungry updating processes to keep the CFR current. We’ll admit to having devoted a certain amount of attention to checking in on the real-time analytics to see what people were looking at, but for the most part it was business as usual.

Now, the level of traffic we were talking about was still small compared to the traffic we once fielded when Bush v. Gore  was handed down in 2000 (that day we had steady traffic of about 4000 requests per minute for 24 hours). And Nic is still planning to add clustering to our bag of tricks. But the painstaking work of the last year has given us a lot of breathing room – even when one of our fans gives us a really big internet hug. In the meantime, we’ve settled into the new normal and continue the slow, steady work of making the website go faster when people need it the most.

A bit over a year ago, we released the first iteration of our new version of the eCFR , the Office of the Federal Register’s unofficial compilation of the current text of the Code of Federal Regulations. At the time, we’d been using the text of the official print version of the CFR to generate our electronic version – it was based on GPO’s authenticated copy of the text, but it was woefully out of date because titles are prepared for printing only once a year. During the past year, while retrofitting and improving features like indentation , cross references, and definitions , we maintained the print-CFR in parallel so that readers could, if they chose, refer to the outdated-but-based-on-official-print version.

This week we’re discontinuing the print-CFR. The reason? Updates. As agencies engage in rulemaking activity, they amend, revise, add, and remove sections, subparts, parts, and appendices. During the past year, the Office of the Federal Register has published thousands of such changes to the eCFR. These changes will eventually make their way into the annual print edition of the CFR, but most of the time, the newest rules making the headlines are, at best, many months away from reaching print.

What’s new? Well, among those thousands of changes were a number of places where agencies were adding rules reflecting new electronic workflows. And these additions provide us with an occasion for checking every facet of our own electronic workflows. When the Citizenship and Immigration Service added the Electronic Visa Update System, they collected the existing sections in Part 215 of Title 8 of the CFR into a new Subpart A and added sections 215.21-215.24 under the new Subpart B . So, after adding the new sections, the software had to refresh the table of contents for Part 215 and create the table of contents for 8 CFR Part 215 Subpart A.

What you’ll see doesn’t look a whole lot different from what’s been there for the past year, but it will be a lot easier to find new CFR sections, the pages will load more quickly, and we will be able to release new CFR features more quickly.

On December 5th, LII engineers Nic Ceynowa and Sylvia Kwakye, Ph.D., looked on in pride as the Cornell University Masters of Engineering students they’d supervised presented a trio of fall projects to LII and Cornell Law Library staff.

Entity Linking

Mutahir Kazmi and Shraddha Vartak pulled together, enhanced, and scaled a group of applications that link entities in the Code of Federal Regulations. Entity linking is a set of techniques that detect references to things in the world (such as people, places, animals, pharmaceuticals) and link them to data sources that provide more information about them. The team analyzed the entities and the corpus in order to determine which entities required disambiguation, distinguished entities to mark before and after defined-term markup, and used Apache Spark to speed the overall application by 60%.

screen-shot-2016-12-14-at-2-19-32-pm

 

US Code Definition Improvement

Khaleel Khaleel, Pracheth Javali, Ria Mirchandani, and Yashaswini Papanna took on the task of adapting our CFR definition extraction and markup software to meet the unique requirements of the US Code. In addition to learning the hierarchical structure and identifier schemes within the US Code corpus, the project involved discovering and extracting definition patterns that had not before been identified; parsing multiple defined terms, word roots, and abbreviations from individual definitions; and correctly detecting the boundaries of the definitions.

Before:

screen-shot-2016-12-14-at-9-44-33-am

And after:

screen-shot-2016-12-14-at-1-35-07-pm

Search Prototype

Anusha Morappanavar, Deekshith Belchapadda, and Monisha Pavagada Chandrashekar built a prototype of the semantic search application using ElasticSearch and Flask. In addition to learning how to work with ElasticSearch, they had to learn the hierarchical structure of the US Code and CFR, understand how cross-references work within legal corpora, and make use of additional metadata such as the definitions and linked entities the other groups had been working on. Their work will support a search application that distinguishes matches in which the search term is defined, appears in the full text, or appears in a definition of a term that appears within the full text of a document.

screen-shot-2016-12-14-at-2-24-06-pm

We’ll be rolling out the features supported by this semester’s M.Eng. projects starting with entity linking in January.

“There is nothing like looking, if you want to find something.”

-J.R.R. Tolkein

There…

This summer, at close to the very last minute, I set out for Cambridge, Massachusetts to pursue a peculiar quest for open access to law. Steering clear of the dragon on its pile of gold, I found some very interesting people in a library doing something in some ways parallel, and in many ways complementary, to what we do at LII.

At the Harvard Law School Library, there’s a group called the Library Innovation Lab, which uses technology to improve preservation and public access to library materials, including digitizing large corpora of legal documents. It is a project which complements what we do at the LII, and I went there to develop some tools that would be of help to us both and to others.

The LIL summer fellowship program that made this possible brought together a group with wide-ranging interests, in both substantive areas, such as Neel Agrawal’s website on the history of African drumming laws to Muira McCammon’s research on the Guantanamo detainee library, to crowdsourced documentation and preservation projects such as Tiffany Tseng’s Spin and Pix devices, Alexander Nwala’s Local Memory and Ilya Kreymer’s Webrecorder, to infrastructure projects such as Jay Edwards’s Caselaw Access Project API.

My project involved work on a data model to help developers make connections between siloed collections of text and metadata — which will hopefully help future developers to automate the process of connecting concepts in online legal corpora (both that at the LIL and ours at LII) to enriching data and context from multiple different sources.

The work involved exploring a somewhat larger-than-usual number of ontologies, structured vocabularies, and topic models. Each, in turn, came with one or more sets of subjects. Some (like Eurovoc and the topic models) came with sizable amounts of machine-readable text; others (like Linked Data For Libraries) came with very little machine-accessible text. As my understanding of the manageable as well as the insurmountable challenges associated with each one increased, I developed a far greater appreciation for the intuition that had led me to the project all along: there is a lot of useful information locked in these resources; each has a role to play.

In the process, I drew enormous inspiration from the dedication and creativity of the LIL group, from Paul Deschner’s Haystacks project, which provides a set of filters to create a manageable list of books on any subject or search term, to Brett Johnson’s work supporting the H2O open textbook platform, to Matt Phillips’s exploration of private talking spaces, to the Caselaw Access Project visualizations such as Anastasia Aisman’s topic mapping and  Jack Cushman’s word clouds (supported by operational, programming, and metadata work from Kerri Fleming, Andy Silva, Ben Steinberg, and Steve Chapman). (All of this is thanks to the Harvard Law Library leadership of Jonathan Zittrain, LIL founder Kim Dulin, managing director Adam Ziegler, and library director Jocelyn Kennedy.)

And back again…

Returning to home to LII, I’m grateful to have the rejuvenating energy that arises from talking to new people, observing how other high-performing groups do their work, and having had the dedicated time to bring a complicated idea to fruition. All in all, it was a marvelous summer with marvelous people. But they did keep looking at me as if to ask why I’d brought along thirteen dwarfs, and how I managed to vanish any time I put that gold ring on my finger.

I just got back from the 2016 CALI conference at the Georgia State University College of Law in Atlanta, Georgia. This report of my time there is by no means an exhaustive or even chronological record of the conference. It's more of a highlight reel.

CALI 2016 Banner: The year of learning dangerously

This was my second time attending and it still holds the title as my favorite conference. The food was great, the talks were excellent and there was a lot of time between sessions to have interesting conversations with many of the diverse and smart attendees who came from all over North America. Kudos to the organizers.

The conference officially started on Thursday, June 16th, when Indiana Jones, aka John Mayer, executive director of CALI, found the golden plaque of CALI after a harrowing traversal of the conference room, dodging obstacles. He gave a brief but warm welcome address and introduced the keynote speaker, Hugh McGuire, founder of PressBooks and LibriVox.org. With anecdotes from his biography, Mr Mcguire encouraged us to be proactive in solving big problems.

We had another keynote speaker on Friday, Michael Feldstein of Mindwires Consulting and co-producer of e-Literate TV.

Question: To what extent is your institution a school, versus a filtering mechanism tied to a self-study center?

He confessed to being something of a provocateur and succeeded in raising a few hackles when he asked, "Do law schools exist?" among other questions.

He then challenged us to do better at teaching students with different learning styles and skill-sets.

My two favorite presentations out of many excellent sessions were "The WeCite Project" by Pablo Arredondo from Casetext and "So you've digitized U.S. caselaw, now what?" by Adam Ziegler and Jack Cushman from the Harvard Library Innovation Lab.

Pablo described teaching students to be their own legal shepherds by gamifying the creation and categorization of citator entries. The result of this effort is a database of every outgoing citation from the last 20 years of Supreme Court majority opinions and federal appellate courts, unambiguously labelled either as a positive, referencing, distinguishing, or negative citation. This data will be hosted by us (LII) and made freely available without restriction. In addition to the valuable data, he also shared how to engage students, librarians and research instructors as partners in the free law movement.

After a brief presentation of some of the ways they are beginning to use data from all the digitized case laws, Adam and Jack invited us to imagine what we could do with data. I can see possibilities for topic modeling, discovery of multi-faceted relationships between cases, and mapping of changes in contract conditions, etc. Many more features, tools and use cases were suggested by the other attendees. We welcome you to send us your personal wish list for features to make this information useful to you.

I also participated in a panel discussion on software management of large digital archives, moderated by Wilhelmina Randtke (Florida Academic Library Services Cooperative), along with Jack Cushman and Wei Fang (Assistant Dean for Information Technology and Head of Digital Services, Rutgers Law Library).

There was so much interest in the Oyez Project moving to the LII, that Craig's presentation on LII's use of web analytics, was replaced by a discussion hosted by Craig and Tim Stanley (Justia) on the transition. The rather lively discussion was made all the more entertaining by an impromptu costume change by Craig. The prevailing sentiment after the discussion was that the Oyez Project was in the best possible hands and 'safe'.

An unexpected bonus were the number of LII users who made it a point to complement the LII and express how useful they find our services. One particularly enthusiastic fan was DeAnna Swearington, Director of Operations at Quimbee.com (Learning tools for law students). I also met Wilson Tsu, CEO of LearnLeo and a Cornell alum, who had fond memories of when the LII first started. There were also several former law students who told me how invaluable the LII collections had been to them in school and continues to be in their current occupations.

All in all, a successful and enlightening conference. A big thank you to the organizers. They did an excellent job. I am already looking forward to next year!

One of the great things that happens at the LII is working with the amazing students who come to study at Cornell — and finding out about the projects they’ve been cooking up while we weren’t distracting them by dangling shiny pieces of law before their eyes. This spring, Karthik Venkataramaiah, Vishal Kumkar, Shivananda Pujeri, and Mihir Shah — who previously worked with us on regulatory definition extraction and entity linking — invited us to attend a presentation they were giving at a conference of the American Society for Engineering Education: they had developed an app to assist dementia patients in interacting with their families.

The Remember Me app does a number of useful things — reminds patients to prepare for appointments, take medications, and so forth. But the remarkable idea is the way it would help dementia patients interact with people in their lives.

Here’s how it works: the app is installed on both the phone of the dementia sufferer and their loved ones and caregivers. When one of the people whom the dementia patient knows comes into proximity to the patient, the app automatically reminds the patient who the person is and how they know them by flashing up pictures designed to place the person in familiar context and remind the patient of their connection. Given the way that memory is always keyed to specific contexts, this helps patients stay grounded in relating to people whom they love but which their disease may hinder their recollection of.

One notable feature of the app is that it was designed not for a class in app development but in cloud computing, which means that the app can be used by a large number of people. The nature of the app also presented additional requirements: the team noted that “as our project is related to health domain, we need to be more careful with respect to cloud data security.” Further, although the students were software engineers who were tasked with developing a scalable application, their app reflects a thoughtful approach to developing a user experience that can benefit people with memory and other cognitive impairments. Associate Director Sara Frug says “among the many teams of talented M.Eng. students with whom we have worked over the years, Karthik, Vishal, Shivananda, and Mihir have shown a rare combination of skill and sophistication in software engineering, product design, and project management. Their app is a remarkable achievement, and we are proud to have seen its earliest stages of development.”

The Remember Me app has been developed as a prototype, with its first launch scheduled for August.

 

In 2009, the US House of Representatives agreed to a resolution designating March 14 as “Pi Day”.

So I thought I’d see whether I could rustle up a nice example of pi in the Code of Federal Regulations. I did manage to find the π I was looking for, but it turns out that pi also gives us some examples of why we need to disambiguate.

In the process I also found:

So the story of “pi” in the CFR is a story of disambiguation. When we read (or our software parses) the original text of a particular regulation, it’s reasonably straightforward to tell which of the many “pi”s it’s discussing. We have capitalization, punctuation, formatting, and structural and contextual clues. Once the search engines work their case-folding, acronym normalization, and other magic, we end up with an awful lot of “pi”s and very little context.

Postscript. The real pi was a bit of a letdown: “P = pi (3.14)” – well, close enough.

In our last post, we mentioned that we gained a lot of efficiency by breaking up the CFR text into Sections and processing each section individually. The cost was losing direct access to the structural information in the full Title documents, which made us lose our bearings a bit. In the source data, each Section is nested within its containing structures or “ancestors” (usually Parts and Chapters). Standard XML tools (modern libraries all support XPath) make it trivial to discover a Section’s ancestry or find all of the other Sections that share an arbitrary ancestor.

Once we’d broken up the Titles into Sections, we needed to make sure the software could still accurately identify a containing structure and its descendent Sections.

The first idea was to put a compact notation for the section’s ancestry into each section document. Sylvia added a compact identifier as well as a supplementary “breadcrumb” element to each section. In theory, it would be possible to pull all sections with a particular ancestor and process only those. As it turned out, however, the students found it to be inefficient to keep opening all of the documents to see whether they had the ancestry in question.

So Sylvia constructed a master table of contents (call it the GPS?). The students’ software could now, using a single additional document, pull all sections belonging to any given ancestor. The purists in the audience will, of course, object that we’re caching the same metadata in multiple locations. They’re right. We sacrificed some elegance in the interest of expedience (we were able to deploy the definitions feature on 47 of 49 CFR titles after a semester); we’ll be reworking the software again this semester and will have an opportunity to consolidate if it makes sense.

Back in September when we started LII: Under the Hood, we waited until we had a bare-bones release of the eCFR text before we told you much about it. “What’s the big deal about going from one XML source to another?” asked, among other people, our director. As it turns out, it was a bit of a bigger deal than we’d hoped.

The good news about the new corpus was that it was much (, much, much, much) cleaner and easier to work with than the original book XML. Instead of being an XML representation of a printed volume (with all of the attendant printed-volume artifacts, such as information about the seal of the National Archives and Records Administration, OMB control numbers, the ISBN prefix, and the contact information for the U.S. Government Publishing Office), the eCFR XML contains just the marked-up text of the CFR and retains only minimal artifacts of its printed-volume origins (we’re looking at you, 26 CFR Part 1). The eCFR XML schema is simpler and far more usable (for example, it uses familiar HTML markup like “table” elements rather than proprietary typesetting-code-derived markup). The structure of the files is far more accurate as well, which is to say that the boundaries of the structural elements as marked-up match the actual structure of the documents.

The bad news was that the software we’d written to extract the CFR structure from the book XML couldn’t be simply ported over; it had to be rebuilt — both to handle the elements in the new schema and to deal with the remaining print-volume artifacts.

The further bad news — or perhaps we should say self-inflicted wound — was that in the process of rebuilding, we decided to change the way we were processing the text.

Here’s what we changed. In the past, we’d divided book-volume XML into CFR Parts; this time we decided instead to divide into Sections (the smallest units which occur throughout the CFR). The advantage, short-term and long-term, is that it makes it far easier to run text-enrichment processes in parallel (many sections can be marked at the same time). That is to say, if our software is linking each term to its definition, we no longer have to wait for all of the links to be added to Section 1.1 before we add links to Section 1.2. The disadvantage is that we have more metadata housekeeping to do to make sure that we’re capturing all of the sections and other granules that belong to a Part or Chapter. That is to say, when we’re working with a Section, we now need another way to know which Part it belongs to. And when we’re marking all instances of a term that has been defined with the scope of a Part, we need a way to be sure that we’ve captured all of the text that Part contains.

And as we learned from our students, metadata housekeeping entails a bit more of a learning curve than XML parsing.

So instead of porting software, we were (by which I mean, of course, Sylvia was) rebuilding it — with a new structure. Suddenly this looked like a much steeper hill to climb.

Next up: Climbing the hill.

Lexicographer: A writer of dictionaries; a harmless drudge, that busies himself in tracing the original, and detailing the signification of words.” – Samuel Johnson

One of our ongoing projects here at LII is to include definitions in our electronic texts. We’ve had them in the UCC since the mid-1990’s; we put them in the CFR in its earlier version a few years ago. Now, as part of the process of rolling out the eCFR and migrating all our old features to it, we’re working on adding definitions to it as well. And just to make life more complicated, we decided at the same time to ask a team of intrepid M.Eng. students (more about them in a subsequent post) to work on the next generation of enhancements to the feature.

Adding definitions to text is a tricky process, so we’re going to milk it for several blog posts take a deliberate approach to spelling out how this works. For starters, let’s look today at why this is a difficult task — why it was hard before, and why it’s hard now.

What we aim for is simple: we want every section of the CFR to have all the key legal terms highlighted. Users could click on each defined term, and they would see how the CFR itself defines those terms. (As you probably know, regulators are famous for defining terms in what we might generously call idiosyncratic ways.) But since the CFR is about a gazillion pages long, there’s no way we can go through and do all those by hand. It needed to be automated.

Which instantly plunged us into the icy waters of natural language processing (or sort-of-natural language processing, a similar if somewhat understudied field).

Definitions in the CFR are not always conveniently labeled; they do not always say something like “we define the term “brontosaurus” to mean “a small plunger to be used on miniature toilets”. They are tricky to pick out — perhaps not tricky for a human reader (although they can be that, too) but certainly tricky for a Turing Machine, even one with its best tuxedo on and its shoes shined.

Consider the following passage:

“Inspected and passed” or “U.S. Inspected and Passed” or “U.S. Inspected and Passed by Department of Agriculture” (or any authorized abbreviation thereof). This term means that the product so identified has been inspected and passed under the regulations in this subchapter, and at the time it was inspected, passed, and identified, it was found to be not adulterated.

We need Tuxedoed Turing to figure out that the three terms in quotations are all equivalent, and all have the same definition. We need it to figure out if an “authorized abbreviation” is being used. We need it to figure out which subchapter this applies to, and not accidentally apply it too narrowly or too widely. And so forth.

And we need it to figure out all sorts of similar problems that other definitions might have: what term precisely is being defined, which words are the definition, what text the definition applies to.

And we made a lot of strides in that direction. The definitions ain’t perfect, but they’re there, in the CFR. Yay, us!

Except, uh, then we had to start all over again. For the eCFR. This time, with shiny new bumpers and a fresh paint job.

Tune in next week for the next part of the story.