skip navigation
search

sara

In June, we received an email from Nigerian human rights lawyer Jake Effoduh, who was starting a free access to law project, #Law2Go, while on a summer fellowship at the Harvard Library Innovation Lab. In his concept note, he said:

“#Law2Go seeks to leverage on the extraordinary growth in the use of smart phones in Nigeria. By the end of 2017, there will be 18 million smart phone users in Nigeria with 38 million smartphones projected to be in used in Nigeria by 2018 – a growth like no other on the continent. This platform can be utilised to address one of the most crucial problems in Nigeria’s justice sector which is access.”

Two months later, Effoduh sent us a link to the Law2Go website, with a link to the Android app. Distinctive among open access to law websites is the innovative combination of translation and audio. Effoduh hosts a popular radio show in Nigeria and has not only translated the Nigerian Constitution into local languages, but also provided a simple English interpretation and paired each text with an audio recording.

Effoduh has used social media to develop an FAQ with questions ranging from the most general (e.g., “what are human rights?”) to the very specific (e.g., “my land containing minerals, oils, natural gas and the government wants to take it away; do they have a right to?”). The site has also already published a number of resources for those seeking legal services, and provides a contact form for those seeking legal advice.

We’re back from the 2017 Legislative Data and Transparency Conference in Washington, DC, where technologists from the federal government and transparency organizations presented their latest open data work.

In the past year, several government websites have completed initiatives that make their data more accessible and more re-usable: from mobile-friendly redesigns from Congress.gov, Govinfo.gov and GPO.gov, to new repositories of bulk data for download, to initiatives that will support original drafting in formats suitable for publication. We were particularly excited to see LII’s work on the Legislative Data Model being adopted in government information systems, as well as FDsys metadata in RDF.

I spoke on a panel about data integration, along with my co-panelist, GovTrack founder Josh Tauberer, and our moderator, GPO’s Lisa LaPlant. Each of us is finding new ways to pick up a legal text, learn what we can about it, and connect it to other legal texts and, particularly in LII’s case, real-world objects.

This presentation was the latest installment in the ongoing work we’ve been doing to aggregate different data sources and connect them to one another, thus helping people navigate from what they know to what they don’t know and therefore making it easier for everyone to find and understand the law that affects them.

On May 12, LII engineers Sylvia Kwakye, Ph.D., and Nic Ceynowa hosted a presentation by the 14 Cornell University Masters of Engineering students they’d supervised this spring as they presented their project work on the Docket Wrench application to LII and Cornell Law Library staff.

LII adopted the Docket Wrench application from the Sunlight Foundation when it closed its software development operation last fall. Originally developed by software engineer Andrew Pendleton in 2012, Docket Wrench is designed to help users explore public participation in the rulemaking process.  It supports exploration by rulemaking docket, agency, commenting company or organization, and the language of the comments themselves. It is a sprawling application with many moving parts, and when LII adopted it, it had not been running for two years.

On the infrastructure team, Mahak Garg served as project manager and, along with Mutahir Kazmi, focused on updating and supporting infrastructure for the application. They worked on updating the software and creating a portable version of the application for other teams to use for development.

The search team, Gaurav Keswani, Soorya Pillai, Ayswarya Ravichandran, Sheethal Shreedhara, and Vinayaka Suryanarayana, ensured that data made its way into, and could be correctly retrieved from, the search engine. This work included setting up and maintaining automated testing to ensure that the software would continue to function correctly after each enhancement was made.

The entities team, Shweta Shrivastava, Vikas Nelamangala, and Saarthak Chandra, ensured that the software could detect and extract the names of corporations and organizations submitting comments in the rulemaking process. Because the data on which Docket Wrench originally relied was no longer available, they researched, found a new data source, and altered the software to make use of it. (Special thanks to Jacob Hileman at the Center for Responsive Politics for his help with the Open Secrets API.)

Deekshith Belchappada, Monisha Chandrashekar, and Anusha Morappanavar, evaluated alternate techniques for computing document similarity, which enables users to find clusters of similar comments and see which language from a particular comment is unique. And Khaleel R  prototyped the use of Apache Spark to detect and mark legal citations and legislation names from within the documents.

So, where is it?

The good news is that after a semester of extremely hard work, “Team Docket” has Docket Wrench up and running again. But we need to ingest a great deal more data and test to make sure that the application can run once we’ve done so. This will take a while. As soon as the students have completed their final project submission, though, we’ll be starting a private beta in which our collaborators can nominate dockets, explore the service, and propose features. Please join us!

In the fall of 2015, we wrote about a traffic spike that had occurred during one of the Republican primary debates. Traffic on the night of Sept. 16, 2015 peaked between 9 and 10pm, with a total of 47,025 page views during that interval. For that day, traffic totaled 204,905 sessions and 469,680 page views. At the time, the traffic level seemed like a big deal – our server had run out of resources to handle the traffic, and some of the people who had come to the site had to wait to find the content they were looking for – at that time, the 14th Amendment to the Constitution.

A year later, we found traffic topping those levels on most weekdays. But by that time, we barely noticed. Nic Ceynowa, who runs our systems, had, over the course of the prior year, systematically identified and addressed unnecessary performance-drains across the website. He replaced legacy redirection software with new, more efficient server redirects. He cached dynamic pages that we knew to be serving static data (because we know, for instance, that retired Supreme Court justices issue no new opinions). He throttled access to the most resource-intensive pages (web crawlers had to slow down a bit so that real people doing research could proceed as usual). As a result, he could allow more worker processes to field page requests and we could continue to focus on feature development rather than server load.

Then came the inauguration on January 20th. Presidential memoranda and executive orders inspired many, many members of the general public to read the law for themselves. Traffic hovered around 220,000 sessions per day for the first week. And then the President issued the executive order on immigration. By Sunday January 29th, we had 259,945 sessions – more than we expect on a busy weekday. On January 30th, traffic jumped to 347,393. And then on January 31st traffic peaked at 435,549 sessions – and over 900,000 page views.

The servers were still quiet. Throughout, we were able to continue running some fairly resource-hungry updating processes to keep the CFR current. We’ll admit to having devoted a certain amount of attention to checking in on the real-time analytics to see what people were looking at, but for the most part it was business as usual.

Now, the level of traffic we were talking about was still small compared to the traffic we once fielded when Bush v. Gore  was handed down in 2000 (that day we had steady traffic of about 4000 requests per minute for 24 hours). And Nic is still planning to add clustering to our bag of tricks. But the painstaking work of the last year has given us a lot of breathing room – even when one of our fans gives us a really big internet hug. In the meantime, we’ve settled into the new normal and continue the slow, steady work of making the website go faster when people need it the most.

A bit over a year ago, we released the first iteration of our new version of the eCFR , the Office of the Federal Register’s unofficial compilation of the current text of the Code of Federal Regulations. At the time, we’d been using the text of the official print version of the CFR to generate our electronic version – it was based on GPO’s authenticated copy of the text, but it was woefully out of date because titles are prepared for printing only once a year. During the past year, while retrofitting and improving features like indentation , cross references, and definitions , we maintained the print-CFR in parallel so that readers could, if they chose, refer to the outdated-but-based-on-official-print version.

This week we’re discontinuing the print-CFR. The reason? Updates. As agencies engage in rulemaking activity, they amend, revise, add, and remove sections, subparts, parts, and appendices. During the past year, the Office of the Federal Register has published thousands of such changes to the eCFR. These changes will eventually make their way into the annual print edition of the CFR, but most of the time, the newest rules making the headlines are, at best, many months away from reaching print.

What’s new? Well, among those thousands of changes were a number of places where agencies were adding rules reflecting new electronic workflows. And these additions provide us with an occasion for checking every facet of our own electronic workflows. When the Citizenship and Immigration Service added the Electronic Visa Update System, they collected the existing sections in Part 215 of Title 8 of the CFR into a new Subpart A and added sections 215.21-215.24 under the new Subpart B . So, after adding the new sections, the software had to refresh the table of contents for Part 215 and create the table of contents for 8 CFR Part 215 Subpart A.

What you’ll see doesn’t look a whole lot different from what’s been there for the past year, but it will be a lot easier to find new CFR sections, the pages will load more quickly, and we will be able to release new CFR features more quickly.

On December 5th, LII engineers Nic Ceynowa and Sylvia Kwakye, Ph.D., looked on in pride as the Cornell University Masters of Engineering students they’d supervised presented a trio of fall projects to LII and Cornell Law Library staff.

Entity Linking

Mutahir Kazmi and Shraddha Vartak pulled together, enhanced, and scaled a group of applications that link entities in the Code of Federal Regulations. Entity linking is a set of techniques that detect references to things in the world (such as people, places, animals, pharmaceuticals) and link them to data sources that provide more information about them. The team analyzed the entities and the corpus in order to determine which entities required disambiguation, distinguished entities to mark before and after defined-term markup, and used Apache Spark to speed the overall application by 60%.

screen-shot-2016-12-14-at-2-19-32-pm

 

US Code Definition Improvement

Khaleel Khaleel, Pracheth Javali, Ria Mirchandani, and Yashaswini Papanna took on the task of adapting our CFR definition extraction and markup software to meet the unique requirements of the US Code. In addition to learning the hierarchical structure and identifier schemes within the US Code corpus, the project involved discovering and extracting definition patterns that had not before been identified; parsing multiple defined terms, word roots, and abbreviations from individual definitions; and correctly detecting the boundaries of the definitions.

Before:

screen-shot-2016-12-14-at-9-44-33-am

And after:

screen-shot-2016-12-14-at-1-35-07-pm

Search Prototype

Anusha Morappanavar, Deekshith Belchapadda, and Monisha Pavagada Chandrashekar built a prototype of the semantic search application using ElasticSearch and Flask. In addition to learning how to work with ElasticSearch, they had to learn the hierarchical structure of the US Code and CFR, understand how cross-references work within legal corpora, and make use of additional metadata such as the definitions and linked entities the other groups had been working on. Their work will support a search application that distinguishes matches in which the search term is defined, appears in the full text, or appears in a definition of a term that appears within the full text of a document.

screen-shot-2016-12-14-at-2-24-06-pm

We’ll be rolling out the features supported by this semester’s M.Eng. projects starting with entity linking in January.

“There is nothing like looking, if you want to find something.”

-J.R.R. Tolkein

There…

This summer, at close to the very last minute, I set out for Cambridge, Massachusetts to pursue a peculiar quest for open access to law. Steering clear of the dragon on its pile of gold, I found some very interesting people in a library doing something in some ways parallel, and in many ways complementary, to what we do at LII.

At the Harvard Law School Library, there’s a group called the Library Innovation Lab, which uses technology to improve preservation and public access to library materials, including digitizing large corpora of legal documents. It is a project which complements what we do at the LII, and I went there to develop some tools that would be of help to us both and to others.

The LIL summer fellowship program that made this possible brought together a group with wide-ranging interests, in both substantive areas, such as Neel Agrawal’s website on the history of African drumming laws to Muira McCammon’s research on the Guantanamo detainee library, to crowdsourced documentation and preservation projects such as Tiffany Tseng’s Spin and Pix devices, Alexander Nwala’s Local Memory and Ilya Kreymer’s Webrecorder, to infrastructure projects such as Jay Edwards’s Caselaw Access Project API.

My project involved work on a data model to help developers make connections between siloed collections of text and metadata — which will hopefully help future developers to automate the process of connecting concepts in online legal corpora (both that at the LIL and ours at LII) to enriching data and context from multiple different sources.

The work involved exploring a somewhat larger-than-usual number of ontologies, structured vocabularies, and topic models. Each, in turn, came with one or more sets of subjects. Some (like Eurovoc and the topic models) came with sizable amounts of machine-readable text; others (like Linked Data For Libraries) came with very little machine-accessible text. As my understanding of the manageable as well as the insurmountable challenges associated with each one increased, I developed a far greater appreciation for the intuition that had led me to the project all along: there is a lot of useful information locked in these resources; each has a role to play.

In the process, I drew enormous inspiration from the dedication and creativity of the LIL group, from Paul Deschner’s Haystacks project, which provides a set of filters to create a manageable list of books on any subject or search term, to Brett Johnson’s work supporting the H2O open textbook platform, to Matt Phillips’s exploration of private talking spaces, to the Caselaw Access Project visualizations such as Anastasia Aisman’s topic mapping and  Jack Cushman’s word clouds (supported by operational, programming, and metadata work from Kerri Fleming, Andy Silva, Ben Steinberg, and Steve Chapman). (All of this is thanks to the Harvard Law Library leadership of Jonathan Zittrain, LIL founder Kim Dulin, managing director Adam Ziegler, and library director Jocelyn Kennedy.)

And back again…

Returning to home to LII, I’m grateful to have the rejuvenating energy that arises from talking to new people, observing how other high-performing groups do their work, and having had the dedicated time to bring a complicated idea to fruition. All in all, it was a marvelous summer with marvelous people. But they did keep looking at me as if to ask why I’d brought along thirteen dwarfs, and how I managed to vanish any time I put that gold ring on my finger.

One of the great things that happens at the LII is working with the amazing students who come to study at Cornell — and finding out about the projects they’ve been cooking up while we weren’t distracting them by dangling shiny pieces of law before their eyes. This spring, Karthik Venkataramaiah, Vishal Kumkar, Shivananda Pujeri, and Mihir Shah — who previously worked with us on regulatory definition extraction and entity linking — invited us to attend a presentation they were giving at a conference of the American Society for Engineering Education: they had developed an app to assist dementia patients in interacting with their families.

The Remember Me app does a number of useful things — reminds patients to prepare for appointments, take medications, and so forth. But the remarkable idea is the way it would help dementia patients interact with people in their lives.

Here’s how it works: the app is installed on both the phone of the dementia sufferer and their loved ones and caregivers. When one of the people whom the dementia patient knows comes into proximity to the patient, the app automatically reminds the patient who the person is and how they know them by flashing up pictures designed to place the person in familiar context and remind the patient of their connection. Given the way that memory is always keyed to specific contexts, this helps patients stay grounded in relating to people whom they love but which their disease may hinder their recollection of.

One notable feature of the app is that it was designed not for a class in app development but in cloud computing, which means that the app can be used by a large number of people. The nature of the app also presented additional requirements: the team noted that “as our project is related to health domain, we need to be more careful with respect to cloud data security.” Further, although the students were software engineers who were tasked with developing a scalable application, their app reflects a thoughtful approach to developing a user experience that can benefit people with memory and other cognitive impairments. Associate Director Sara Frug says “among the many teams of talented M.Eng. students with whom we have worked over the years, Karthik, Vishal, Shivananda, and Mihir have shown a rare combination of skill and sophistication in software engineering, product design, and project management. Their app is a remarkable achievement, and we are proud to have seen its earliest stages of development.”

The Remember Me app has been developed as a prototype, with its first launch scheduled for August.

 

In 2009, the US House of Representatives agreed to a resolution designating March 14 as “Pi Day”.

So I thought I’d see whether I could rustle up a nice example of pi in the Code of Federal Regulations. I did manage to find the π I was looking for, but it turns out that pi also gives us some examples of why we need to disambiguate.

In the process I also found:

So the story of “pi” in the CFR is a story of disambiguation. When we read (or our software parses) the original text of a particular regulation, it’s reasonably straightforward to tell which of the many “pi”s it’s discussing. We have capitalization, punctuation, formatting, and structural and contextual clues. Once the search engines work their case-folding, acronym normalization, and other magic, we end up with an awful lot of “pi”s and very little context.

Postscript. The real pi was a bit of a letdown: “P = pi (3.14)” – well, close enough.

In our last post, we mentioned that we gained a lot of efficiency by breaking up the CFR text into Sections and processing each section individually. The cost was losing direct access to the structural information in the full Title documents, which made us lose our bearings a bit. In the source data, each Section is nested within its containing structures or “ancestors” (usually Parts and Chapters). Standard XML tools (modern libraries all support XPath) make it trivial to discover a Section’s ancestry or find all of the other Sections that share an arbitrary ancestor.

Once we’d broken up the Titles into Sections, we needed to make sure the software could still accurately identify a containing structure and its descendent Sections.

The first idea was to put a compact notation for the section’s ancestry into each section document. Sylvia added a compact identifier as well as a supplementary “breadcrumb” element to each section. In theory, it would be possible to pull all sections with a particular ancestor and process only those. As it turned out, however, the students found it to be inefficient to keep opening all of the documents to see whether they had the ancestry in question.

So Sylvia constructed a master table of contents (call it the GPS?). The students’ software could now, using a single additional document, pull all sections belonging to any given ancestor. The purists in the audience will, of course, object that we’re caching the same metadata in multiple locations. They’re right. We sacrificed some elegance in the interest of expedience (we were able to deploy the definitions feature on 47 of 49 CFR titles after a semester); we’ll be reworking the software again this semester and will have an opportunity to consolidate if it makes sense.