skip navigation

In June, we received an email from Nigerian human rights lawyer Jake Effoduh, who was starting a free access to law project, #Law2Go, while on a summer fellowship at the Harvard Library Innovation Lab. In his concept note, he said:

“#Law2Go seeks to leverage on the extraordinary growth in the use of smart phones in Nigeria. By the end of 2017, there will be 18 million smart phone users in Nigeria with 38 million smartphones projected to be in used in Nigeria by 2018 – a growth like no other on the continent. This platform can be utilised to address one of the most crucial problems in Nigeria’s justice sector which is access.”

Two months later, Effoduh sent us a link to the Law2Go website, with a link to the Android app. Distinctive among open access to law websites is the innovative combination of translation and audio. Effoduh hosts a popular radio show in Nigeria and has not only translated the Nigerian Constitution into local languages, but also provided a simple English interpretation and paired each text with an audio recording.

Effoduh has used social media to develop an FAQ with questions ranging from the most general (e.g., “what are human rights?”) to the very specific (e.g., “my land containing minerals, oils, natural gas and the government wants to take it away; do they have a right to?”). The site has also already published a number of resources for those seeking legal services, and provides a contact form for those seeking legal advice.

We’re back from the 2017 Legislative Data and Transparency Conference in Washington, DC, where technologists from the federal government and transparency organizations presented their latest open data work.

In the past year, several government websites have completed initiatives that make their data more accessible and more re-usable: from mobile-friendly redesigns from, and, to new repositories of bulk data for download, to initiatives that will support original drafting in formats suitable for publication. We were particularly excited to see LII’s work on the Legislative Data Model being adopted in government information systems, as well as FDsys metadata in RDF.

I spoke on a panel about data integration, along with my co-panelist, GovTrack founder Josh Tauberer, and our moderator, GPO’s Lisa LaPlant. Each of us is finding new ways to pick up a legal text, learn what we can about it, and connect it to other legal texts and, particularly in LII’s case, real-world objects.

This presentation was the latest installment in the ongoing work we’ve been doing to aggregate different data sources and connect them to one another, thus helping people navigate from what they know to what they don’t know and therefore making it easier for everyone to find and understand the law that affects them.

On May 12, LII engineers Sylvia Kwakye, Ph.D., and Nic Ceynowa hosted a presentation by the 14 Cornell University Masters of Engineering students they’d supervised this spring as they presented their project work on the Docket Wrench application to LII and Cornell Law Library staff.

LII adopted the Docket Wrench application from the Sunlight Foundation when it closed its software development operation last fall. Originally developed by software engineer Andrew Pendleton in 2012, Docket Wrench is designed to help users explore public participation in the rulemaking process.  It supports exploration by rulemaking docket, agency, commenting company or organization, and the language of the comments themselves. It is a sprawling application with many moving parts, and when LII adopted it, it had not been running for two years.

On the infrastructure team, Mahak Garg served as project manager and, along with Mutahir Kazmi, focused on updating and supporting infrastructure for the application. They worked on updating the software and creating a portable version of the application for other teams to use for development.

The search team, Gaurav Keswani, Soorya Pillai, Ayswarya Ravichandran, Sheethal Shreedhara, and Vinayaka Suryanarayana, ensured that data made its way into, and could be correctly retrieved from, the search engine. This work included setting up and maintaining automated testing to ensure that the software would continue to function correctly after each enhancement was made.

The entities team, Shweta Shrivastava, Vikas Nelamangala, and Saarthak Chandra, ensured that the software could detect and extract the names of corporations and organizations submitting comments in the rulemaking process. Because the data on which Docket Wrench originally relied was no longer available, they researched, found a new data source, and altered the software to make use of it. (Special thanks to Jacob Hileman at the Center for Responsive Politics for his help with the Open Secrets API.)

Deekshith Belchappada, Monisha Chandrashekar, and Anusha Morappanavar, evaluated alternate techniques for computing document similarity, which enables users to find clusters of similar comments and see which language from a particular comment is unique. And Khaleel R  prototyped the use of Apache Spark to detect and mark legal citations and legislation names from within the documents.

So, where is it?

The good news is that after a semester of extremely hard work, “Team Docket” has Docket Wrench up and running again. But we need to ingest a great deal more data and test to make sure that the application can run once we’ve done so. This will take a while. As soon as the students have completed their final project submission, though, we’ll be starting a private beta in which our collaborators can nominate dockets, explore the service, and propose features. Please join us!

A bit over a year ago, we released the first iteration of our new version of the eCFR , the Office of the Federal Register’s unofficial compilation of the current text of the Code of Federal Regulations. At the time, we’d been using the text of the official print version of the CFR to generate our electronic version – it was based on GPO’s authenticated copy of the text, but it was woefully out of date because titles are prepared for printing only once a year. During the past year, while retrofitting and improving features like indentation , cross references, and definitions , we maintained the print-CFR in parallel so that readers could, if they chose, refer to the outdated-but-based-on-official-print version.

This week we’re discontinuing the print-CFR. The reason? Updates. As agencies engage in rulemaking activity, they amend, revise, add, and remove sections, subparts, parts, and appendices. During the past year, the Office of the Federal Register has published thousands of such changes to the eCFR. These changes will eventually make their way into the annual print edition of the CFR, but most of the time, the newest rules making the headlines are, at best, many months away from reaching print.

What’s new? Well, among those thousands of changes were a number of places where agencies were adding rules reflecting new electronic workflows. And these additions provide us with an occasion for checking every facet of our own electronic workflows. When the Citizenship and Immigration Service added the Electronic Visa Update System, they collected the existing sections in Part 215 of Title 8 of the CFR into a new Subpart A and added sections 215.21-215.24 under the new Subpart B . So, after adding the new sections, the software had to refresh the table of contents for Part 215 and create the table of contents for 8 CFR Part 215 Subpart A.

What you’ll see doesn’t look a whole lot different from what’s been there for the past year, but it will be a lot easier to find new CFR sections, the pages will load more quickly, and we will be able to release new CFR features more quickly.

On December 5th, LII engineers Nic Ceynowa and Sylvia Kwakye, Ph.D., looked on in pride as the Cornell University Masters of Engineering students they’d supervised presented a trio of fall projects to LII and Cornell Law Library staff.

Entity Linking

Mutahir Kazmi and Shraddha Vartak pulled together, enhanced, and scaled a group of applications that link entities in the Code of Federal Regulations. Entity linking is a set of techniques that detect references to things in the world (such as people, places, animals, pharmaceuticals) and link them to data sources that provide more information about them. The team analyzed the entities and the corpus in order to determine which entities required disambiguation, distinguished entities to mark before and after defined-term markup, and used Apache Spark to speed the overall application by 60%.



US Code Definition Improvement

Khaleel Khaleel, Pracheth Javali, Ria Mirchandani, and Yashaswini Papanna took on the task of adapting our CFR definition extraction and markup software to meet the unique requirements of the US Code. In addition to learning the hierarchical structure and identifier schemes within the US Code corpus, the project involved discovering and extracting definition patterns that had not before been identified; parsing multiple defined terms, word roots, and abbreviations from individual definitions; and correctly detecting the boundaries of the definitions.



And after:


Search Prototype

Anusha Morappanavar, Deekshith Belchapadda, and Monisha Pavagada Chandrashekar built a prototype of the semantic search application using ElasticSearch and Flask. In addition to learning how to work with ElasticSearch, they had to learn the hierarchical structure of the US Code and CFR, understand how cross-references work within legal corpora, and make use of additional metadata such as the definitions and linked entities the other groups had been working on. Their work will support a search application that distinguishes matches in which the search term is defined, appears in the full text, or appears in a definition of a term that appears within the full text of a document.


We’ll be rolling out the features supported by this semester’s M.Eng. projects starting with entity linking in January.

“There is nothing like looking, if you want to find something.”

-J.R.R. Tolkein


This summer, at close to the very last minute, I set out for Cambridge, Massachusetts to pursue a peculiar quest for open access to law. Steering clear of the dragon on its pile of gold, I found some very interesting people in a library doing something in some ways parallel, and in many ways complementary, to what we do at LII.

At the Harvard Law School Library, there’s a group called the Library Innovation Lab, which uses technology to improve preservation and public access to library materials, including digitizing large corpora of legal documents. It is a project which complements what we do at the LII, and I went there to develop some tools that would be of help to us both and to others.

The LIL summer fellowship program that made this possible brought together a group with wide-ranging interests, in both substantive areas, such as Neel Agrawal’s website on the history of African drumming laws to Muira McCammon’s research on the Guantanamo detainee library, to crowdsourced documentation and preservation projects such as Tiffany Tseng’s Spin and Pix devices, Alexander Nwala’s Local Memory and Ilya Kreymer’s Webrecorder, to infrastructure projects such as Jay Edwards’s Caselaw Access Project API.

My project involved work on a data model to help developers make connections between siloed collections of text and metadata — which will hopefully help future developers to automate the process of connecting concepts in online legal corpora (both that at the LIL and ours at LII) to enriching data and context from multiple different sources.

The work involved exploring a somewhat larger-than-usual number of ontologies, structured vocabularies, and topic models. Each, in turn, came with one or more sets of subjects. Some (like Eurovoc and the topic models) came with sizable amounts of machine-readable text; others (like Linked Data For Libraries) came with very little machine-accessible text. As my understanding of the manageable as well as the insurmountable challenges associated with each one increased, I developed a far greater appreciation for the intuition that had led me to the project all along: there is a lot of useful information locked in these resources; each has a role to play.

In the process, I drew enormous inspiration from the dedication and creativity of the LIL group, from Paul Deschner’s Haystacks project, which provides a set of filters to create a manageable list of books on any subject or search term, to Brett Johnson’s work supporting the H2O open textbook platform, to Matt Phillips’s exploration of private talking spaces, to the Caselaw Access Project visualizations such as Anastasia Aisman’s topic mapping and  Jack Cushman’s word clouds (supported by operational, programming, and metadata work from Kerri Fleming, Andy Silva, Ben Steinberg, and Steve Chapman). (All of this is thanks to the Harvard Law Library leadership of Jonathan Zittrain, LIL founder Kim Dulin, managing director Adam Ziegler, and library director Jocelyn Kennedy.)

And back again…

Returning to home to LII, I’m grateful to have the rejuvenating energy that arises from talking to new people, observing how other high-performing groups do their work, and having had the dedicated time to bring a complicated idea to fruition. All in all, it was a marvelous summer with marvelous people. But they did keep looking at me as if to ask why I’d brought along thirteen dwarfs, and how I managed to vanish any time I put that gold ring on my finger.

I just got back from the 2016 CALI conference at the Georgia State University College of Law in Atlanta, Georgia. This report of my time there is by no means an exhaustive or even chronological record of the conference. It's more of a highlight reel.

CALI 2016 Banner: The year of learning dangerously

This was my second time attending and it still holds the title as my favorite conference. The food was great, the talks were excellent and there was a lot of time between sessions to have interesting conversations with many of the diverse and smart attendees who came from all over North America. Kudos to the organizers.

The conference officially started on Thursday, June 16th, when Indiana Jones, aka John Mayer, executive director of CALI, found the golden plaque of CALI after a harrowing traversal of the conference room, dodging obstacles. He gave a brief but warm welcome address and introduced the keynote speaker, Hugh McGuire, founder of PressBooks and With anecdotes from his biography, Mr Mcguire encouraged us to be proactive in solving big problems.

We had another keynote speaker on Friday, Michael Feldstein of Mindwires Consulting and co-producer of e-Literate TV.

Question: To what extent is your institution a school, versus a filtering mechanism tied to a self-study center?

He confessed to being something of a provocateur and succeeded in raising a few hackles when he asked, "Do law schools exist?" among other questions.

He then challenged us to do better at teaching students with different learning styles and skill-sets.

My two favorite presentations out of many excellent sessions were "The WeCite Project" by Pablo Arredondo from Casetext and "So you've digitized U.S. caselaw, now what?" by Adam Ziegler and Jack Cushman from the Harvard Library Innovation Lab.

Pablo described teaching students to be their own legal shepherds by gamifying the creation and categorization of citator entries. The result of this effort is a database of every outgoing citation from the last 20 years of Supreme Court majority opinions and federal appellate courts, unambiguously labelled either as a positive, referencing, distinguishing, or negative citation. This data will be hosted by us (LII) and made freely available without restriction. In addition to the valuable data, he also shared how to engage students, librarians and research instructors as partners in the free law movement.

After a brief presentation of some of the ways they are beginning to use data from all the digitized case laws, Adam and Jack invited us to imagine what we could do with data. I can see possibilities for topic modeling, discovery of multi-faceted relationships between cases, and mapping of changes in contract conditions, etc. Many more features, tools and use cases were suggested by the other attendees. We welcome you to send us your personal wish list for features to make this information useful to you.

I also participated in a panel discussion on software management of large digital archives, moderated by Wilhelmina Randtke (Florida Academic Library Services Cooperative), along with Jack Cushman and Wei Fang (Assistant Dean for Information Technology and Head of Digital Services, Rutgers Law Library).

There was so much interest in the Oyez Project moving to the LII, that Craig's presentation on LII's use of web analytics, was replaced by a discussion hosted by Craig and Tim Stanley (Justia) on the transition. The rather lively discussion was made all the more entertaining by an impromptu costume change by Craig. The prevailing sentiment after the discussion was that the Oyez Project was in the best possible hands and 'safe'.

An unexpected bonus were the number of LII users who made it a point to complement the LII and express how useful they find our services. One particularly enthusiastic fan was DeAnna Swearington, Director of Operations at (Learning tools for law students). I also met Wilson Tsu, CEO of LearnLeo and a Cornell alum, who had fond memories of when the LII first started. There were also several former law students who told me how invaluable the LII collections had been to them in school and continues to be in their current occupations.

All in all, a successful and enlightening conference. A big thank you to the organizers. They did an excellent job. I am already looking forward to next year!

In 2009, the US House of Representatives agreed to a resolution designating March 14 as “Pi Day”.

So I thought I’d see whether I could rustle up a nice example of pi in the Code of Federal Regulations. I did manage to find the π I was looking for, but it turns out that pi also gives us some examples of why we need to disambiguate.

In the process I also found:

So the story of “pi” in the CFR is a story of disambiguation. When we read (or our software parses) the original text of a particular regulation, it’s reasonably straightforward to tell which of the many “pi”s it’s discussing. We have capitalization, punctuation, formatting, and structural and contextual clues. Once the search engines work their case-folding, acronym normalization, and other magic, we end up with an awful lot of “pi”s and very little context.

Postscript. The real pi was a bit of a letdown: “P = pi (3.14)” – well, close enough.

Lexicographer: A writer of dictionaries; a harmless drudge, that busies himself in tracing the original, and detailing the signification of words.” – Samuel Johnson

One of our ongoing projects here at LII is to include definitions in our electronic texts. We’ve had them in the UCC since the mid-1990’s; we put them in the CFR in its earlier version a few years ago. Now, as part of the process of rolling out the eCFR and migrating all our old features to it, we’re working on adding definitions to it as well. And just to make life more complicated, we decided at the same time to ask a team of intrepid M.Eng. students (more about them in a subsequent post) to work on the next generation of enhancements to the feature.

Adding definitions to text is a tricky process, so we’re going to milk it for several blog posts take a deliberate approach to spelling out how this works. For starters, let’s look today at why this is a difficult task — why it was hard before, and why it’s hard now.

What we aim for is simple: we want every section of the CFR to have all the key legal terms highlighted. Users could click on each defined term, and they would see how the CFR itself defines those terms. (As you probably know, regulators are famous for defining terms in what we might generously call idiosyncratic ways.) But since the CFR is about a gazillion pages long, there’s no way we can go through and do all those by hand. It needed to be automated.

Which instantly plunged us into the icy waters of natural language processing (or sort-of-natural language processing, a similar if somewhat understudied field).

Definitions in the CFR are not always conveniently labeled; they do not always say something like “we define the term “brontosaurus” to mean “a small plunger to be used on miniature toilets”. They are tricky to pick out — perhaps not tricky for a human reader (although they can be that, too) but certainly tricky for a Turing Machine, even one with its best tuxedo on and its shoes shined.

Consider the following passage:

“Inspected and passed” or “U.S. Inspected and Passed” or “U.S. Inspected and Passed by Department of Agriculture” (or any authorized abbreviation thereof). This term means that the product so identified has been inspected and passed under the regulations in this subchapter, and at the time it was inspected, passed, and identified, it was found to be not adulterated.

We need Tuxedoed Turing to figure out that the three terms in quotations are all equivalent, and all have the same definition. We need it to figure out if an “authorized abbreviation” is being used. We need it to figure out which subchapter this applies to, and not accidentally apply it too narrowly or too widely. And so forth.

And we need it to figure out all sorts of similar problems that other definitions might have: what term precisely is being defined, which words are the definition, what text the definition applies to.

And we made a lot of strides in that direction. The definitions ain’t perfect, but they’re there, in the CFR. Yay, us!

Except, uh, then we had to start all over again. For the eCFR. This time, with shiny new bumpers and a fresh paint job.

Tune in next week for the next part of the story.

It will not have escaped the attention of our regular readers — both of them, as the old joke goes — that things have been quiet around here under the hood. It’s not that we haven’t been working; it’s that we’ve been too deep in elbow grease to type it up for you. We tried, but the third computer we ruined with dripping engine fluids caused a bit of a ruckus. So it’s been quiet.

But we just went at our hands with a brillo pad and fire hose, so we think it’s safe for the moment. And it’s New Year’s Eve, time for resolutions. So we are going to redouble our efforts to give the world a peek at what we’re doing, and what you’ve missed while we’ve been distracted by that dang fuel carburetor.

As a preview, and a public resolution to which we can point when our boss asks why we’re despoiling yet another keyboard, here are some of the things we’re going to be telling you about in the coming weeks.

  • Students’ work on definitions, entity linking, topic modeling, and a sub-site refresh
  • Citation resolution (turns out not every citation is to Marbury v Madison; who knew?)
  • Semantic web-based feature development

And that’s not all. For we remain hard at work, and new things are constantly in the works, and we’ll tell you about those as well. So watch this space!

But, uh, those are nice pants. You might want to slip on these coveralls first.