Legal XML » VoxPopuLII

Rough Consensus, Running Standards: The Restatement Project

Annotation of legal texts, Electronic legal publishing, Legal argument, Legal informatics, legal language, Legal text processing, Legal XML 2 Responses »

Jun 032014

1. The Death and Life of Great Legal Data Standards

Thanks to the many efforts of the open government movement in the past decade, the benefits of machine-readable legal data — legal data which can be processed and easily interpreted by computers — are now widely understood. In the world of government statutes and reports, machine-readability would significantly enhance public transparency, help to increase efficiencies in providing services to the public, and make it possible for innovators to develop third-party services that enhance civic life.

In the universe of private legal data — that of contracts, briefs, and memos — machine-readability would open up vast potential efficiencies within the law firm context, allow the development of novel approaches to processing the law, and would help to drive down the costs of providing legal services.

However, while the benefits are understood, by and large the vision of rendering the vast majority of legal documents into a machine-readable standard has not been realized. While projects do exist to acquire and release statutory language in a machine-readable format (and the government has backed similar initiatives), the vast body of contractual language and other private legal documents remains trapped in a closed universe of hard copies, PDFs, unstructured plaintext and Microsoft Word files.

Though this is a relatively technical point, it has broad policy implications for society at large. Perhaps the biggest upshot is that machine-readability promises to vastly improve access to the legal system, not only for those seeking legal services, but also for those seeking to provide legal services, as well.

It is not for lack of a standard specification that the status quo exists. Indeed, projects like LegalXML have developed specifications that describe a machine-readable markup for a vast range of different types of legal documents. As of writing, the project includes technical committees working on legislative documents, contracts, court filings, citations, and more.

However, by and large these efforts to develop machine-readable specifications for legal data have only solved part of the problem. Creating the standard is one thing, but actually driving adoption of a legal data standard is another (often more difficult) matter. There are a number of reasons why existing standards have failed to gain traction among the creators of legal data.

For one, the oft-cited aversion of lawyers to technology remains a relevant factor. Particularly in the case of the standardization of legal data, where the projected benefits exist in the future and the magnitude of benefit speculative at the present moment, persuading lawyers and legislatures to adopt a new standard remains a challenge, at best.

Secondly, the financial incentives of some actors may actually be opposed towards rendering the universe of legal documents into a machine-readable standard. A universe of largely machine-readable legal documents would also be one in which it may be possible for third-parties to develop systems that automate and significantly streamline legal services. In the context of the ever-present billable hour, parties may resist the introduction of technological shifts that enable these efficiencies to emerge.

Third, the costs of converting existing legal data into a machine-readable standard may also pose a significant barrier to adoption. Marking up unstructured legal text can be highly costly depending on the intended machine usage of the document and the type of document in question. Persuading a legislature, firm, or other organization with a large existing repository of legal documents to take on large one-time costs to render the documents into a common standard also discourages adoption.

These three reinforcing forces erect a significant cultural and economic barrier against the integration of machine-readable standards into the production of legal text. To the extent that one believes in the benefits from standardization for the legal industry and society at large, the issue is — in fact — not how to define a standard, but how to establish one.

2. Rough Consensus, Running Standards

So, how might one go about promulgating a standard? Particularly in a world in which lawyers, the very actors that produce the bulk of legal data, are resistant to change, mere attempts to mobilize the legal community to action are destined to fail in bringing about the fundamental shift necessary to render most if not all legal documents in a common machine-readable format.

In such a context, implementing a standard in a way that removes humans from the loop entirely may, in fact, be more effective. To do so, one might design code that was capable of automatically rendering legal text into a machine-readable format. This code could then be implemented by applications of all kinds, which would output legal documents in a standard format by default. This would include the word processors used by lawyers, but also integration with platforms like LegalZoom or RocketLawyer that routinely generate large quantities of legal data. Such a solution would eliminate the need for lawyer involvement from the process of implementing a standard entirely: any text created would be automatically parsed and outputted in a machine readable format. Scripts might also be written to identify legal documents online and process them into a common format. As the body of documents rendered in a given format grew, it would be possible for others to write software leveraging the increased penetration of the standard.

There are — obviously — technical limitations in realizing this vision of a generalized legal data parser. For one, designing a truly comprehensive parser is a massively difficult computer science challenge. Legal documents come in a vast diversity of flavors, and no common textual conventions allow for the perfect accurate parsing of the semantic content of any given legal text. Quite simply, any parser will be an imperfect (perhaps highly imperfect) approximation of full machine-readability.

Despite the lack of a perfect solution, an open question exists as to whether or not an extremely rough parsing system, implemented at sufficient scale, would be enough to kickstart the creation of a true common standard for legal text. A popular solution, however imperfect, would encourage others to implement nuances to the code. It would also encourage the design of applications for documents rendered in the standard. Beginning from the roughest of parsers, a functional standard might become the platform for a much bigger change in the nature of legal documents. The key is to achieve the “minimal viable standard” that will begin the snowball rolling down the hill: the point at which the parser is rendering sufficient legal documents in a common format that additional value can be created by improving the parser and applying it to an ever broader scope of legal data.

But, what is the critical mass of documents one might need? How effective would the parser need to be in order to achieve the initial wave of adoption? Discovering this, and learning whether or not such a strategy would be effective, is at the heart of the Restatement project.

3. Introducing Project Restatement

Supported by a grant from the Knight Foundation Prototype Fund, Restatement is a simple, rough-and-ready system which automatically parses legal text into a basic machine-readable JSON format. It has also been released under the permissive terms of the MIT License, to encourage active experimentation and implementation.

The concept is to develop an easily-extensible system which parses through legal text and looks for some common features to render into a standard format. Our general design principle in developing the parser was to begin with only the most simple features common to nearly all legal documents. This includes the parsing of headers, section information, and “blanks” for inputs in legal documents like contracts. As a demonstration of the potential application of Restatement, we’re also designing a viewer that takes documents rendered in the Restatement format and displays them in a simple, beautiful, web-readable version.

Underneath the hood, Restatement is all built upon web technology. This was a deliberate choice, as Restatement aims to provide a usable alternative to document formats like PDF and Microsoft Word. We want to make it easy for developers to write software that displays and modifies legal documents in the browser.

In particular, Restatement is built entirely in JavaScript. The past few years have been exciting for the JavaScript community. We’ve seen an incredible flourishing of not only new projects built on JavaScript, but also new tools for building cool new things with JavaScript. It seemed clear to us that it’s the platform to build on right now, so we wrote the Restatement parser and viewer in JavaScript, and made the Restatement format itself a type of JSON (JavaScript Object Notation) document.

For those who are more technically inclined, we also knew that Restatement needed a parser formalism, that is, a precise way to define how plain text can get transformed into Restatement format. We became interested in recent advance in parsing technology, called PEG (Parsing Expression Grammar).

PEG parsers are different from other types of parsers; they’re unambiguous. That means that plain text passing through a PEG parser has only one possible valid parsed output. We became excited about using the deterministic property of PEG to mix parsing rules and code, and that’s when we found peg.js.

With peg.js, we can generate a grammar that executes JavaScript code as it parses your document. This hybrid approach is super powerful. It allows us to have all of the advantages of using a parser formalism (like speed and unambiguity) while also allowing us to run custom JavaScript code on each bit of your document as it parses. That way we can use an external library, like the Sunlight Foundation’s fantastic citation, from inside the parser.

Our next step is to prototype an “interactive parser,” a tool for attorneys to define the structure of their documents and see how they parse. Behind the scenes, this interactive parser will generate peg.js programs and run them against plaintext without the user even being aware of how the underlying parser is written. We hope that this approach will provide users with the right balance of power and usability.

4. Moving Forwards

Restatement is going fully operational in June 2014. After launch, the two remaining challenges are to (a) continuing expanding the range of legal document features the parser will be able to successfully process, and (b) begin widely processing legal documents into the Restatement format.

For the first, we’re encouraging a community of legal technologists to play around with Restatement, break it as much as possible, and give us feedback. Running Restatement against a host of different legal documents and seeing where it fails will expose the areas that are necessary to bolster the parser to expand its potential applicability as far as possible.

For the second, Restatement will be rendering popular legal documents in the format, and partnering with platforms to integrate Restatement into the legal content they produce. We’re excited to say on launch Restatement will be releasing the standard form documents used by the startup accelerator Y Combinator, and Series Seed, an open source project around seed financing created by Fenwick & West.

It is worth adding that the Restatement team is always looking for collaborators. If what’s been described here interests you, please drop us a line! I’m available at tim@robotandhwang.org, and on Twitter @RobotandHwang.

Jason Boehmig is a corporate attorney at Fenwick & West LLP, a law firm specializing in technology and life science matters. His practice focuses on startups and venture capital, with a particular emphasis on early stage issues. He is an active maintainer of the Series Seed Documents, an open source set of equity financing documents. Prior to attending law school, Jason worked for Lehman Brothers, Inc. as an analyst and then as an associate in their Fixed Income Division.

Tim Hwang currently serves as the managing human partner at the offices of Robot, Robot & Hwang LLP. He is curator and chair for the Stanford Center on Legal Informatics FutureLaw 2014 Conference, and organized the New and Emerging Legal Infrastructures Conference (NELIC) at Berkeley Law in 2010. He is also the founder of the Awesome Foundation for the Arts and Sciences, a distributed, worldwide philanthropic organization founded to provide lightweight grants to projects that forward the interest of awesomeness in the universe. Previously, he has worked at the Berkman Center for Internet and Society at Harvard University, Creative Commons, Mozilla Foundation, and the Electronic Frontier Foundation. For his work, he has appeared in the New York Times, Forbes, Wired Magazine, the Washington Post, the Atlantic Monthly, Fast Company, and the Wall Street Journal, among others. He enjoys ice cream.

Paul Sawaya is a software developer currently working on Restatement, an open source toolkit to parse, manipulate, and publish legal documents on the web. He previously worked on identity at Mozilla, and studied computer science at Hampshire College.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

AT4AM: the XML web editor used by Members of European Parliament

Electronic government, European Union, Legal XML, Legislative information systems, open source software No Responses »

Aug 152013

AT4AM – Authoring Tool for Amendments – is a web editor provided to Members of European Parliament (MEPs) that has greatly improved the drafting of amendments at European Parliament since its introduction in 2010.

The tool, developed by the Directorate for Innovation and Technological Support of European Parliament (DG ITEC) has replaced a system based on a collection of macros developed in MS Word and specific ad hoc templates.

Why move to a web editor?

The need to replace a traditional desktop authoring tool came from the increasing complexity of layout rules combined with a need to automate several processes of the authoring/checking/translation/distribution chain.

In fact, drafters not only faced complex rules and had to search among hundreds of templates in order to get the right one, but the drafting chain for all amendments relied on layout to transmit information down the different processes. Bold / Italic notation or specific tags were used to transmit specific information on the meaning of the text between the services in charge of subsequent revision and translation.

Over the years, an editor that was initially conceived to support mainly the printing of documents was often used to convey information in an unsuitable manner. During the drafting activity, documents transmitted between different services included a mix of content and layout where the layout sometime referred to some information on the business process that should rather be transmitted via other mediums.

Moreover, encapsulating in one single file all the amendments drafted in 23 languages was a severe limitation for subsequent revisions and translations carried out by linguistic sectors. Experts in charge of legal and linguistic revision of drafted amendments, who need to work in parallel on one document grouping multilingual amendments, were severely hampered in their work.

All the needs listed above justified the EP undertaking a new project to improve the drafting of amendments. The concept was soon extended to the drafting, revision, translation and distribution of the entire legislative content in the European Parliament, and after some months the eParliament Programme was initiated to cover all projects of the parliamentary XML-based drafting chain.

It was clear from the beginning that, in order to provide an advanced web editor, the original proposal to be amended had to be converted into a structured format. After an extensive search, XML Akoma Ntoso format was chosen, because it is the format that best covers the requirements for drafting legislation. Currently it is possible to export amendments produced via AT4AM in Akoma Ntoso. It is planned to apply Akoma Ntoso schema to the entire legislative chain within eParliament Programme. This will enable EP to publish legislative texts in open data format.

What distinguishes the approach taken by EP from other legislative actors who handle XML documents is the fact that EP decided to use XML to feed the legislative chain rather than just converting existing documents into XML for distribution. This aspect is fundamental because requirements are much stricter when the result of XML conversion is used as the first step of legislative chain. In fact, the proposal coming from European Commission is first converted in XML and after loaded into AT4AM. Because the tool relies on the XML content, it is important to guarantee a valid structure and coherence between the language versions. The same articles, paragraphs, point, subpoints must appear at the correct position in all the 23 language versions of the same text.

What is the situation now?

After two years of intensive usage, Members of European Parliaments have drafted 285.000 amendments via AT4AM. The tool is also used daily by the staff of the secretariat in charge of receiving tabled amendments, checking linguistic and legal accuracy and producing voting lists. Today more then 2300 users access the system regularly, and no one wants to go back to the traditional methods of drafting. Why?

Because it is much simpler and faster to draft and manage amendments via an editor that takes care of everything, thus allowing drafters to concentrate on their essential activity: modifying the text.

Soon after the introduction of AT4AM, the secretariat’s staff who manage drafted amendments breathed a sigh of relief, because errors like wrong position references, which were the cause of major headaches, no longer occurred.

What is better than a tool that guides drafters through the amending activity by adding all the surrounding information and taking care of all the metadata necessary for subsequent treatment, while letting the drafter focus on the text amendments and produce well-formatted output with track changes?

After some months of usage, it was clear that not only the time to draft, check and translate amendments was drastically reduced, but also the quality of amendments increased.

The slogan that best describes the strength of this XML editor is: “You are always just two clicks away from tabling an amendment!”

Web editor versus desktop editor: is it an acceptable compromise?

One of the criticisms that users often raise against web editors is that they are limited when compared with a traditional desktop rich editor. The experience at the European Parliament has demonstrated that what users lose in terms of editing features is highly compensated by the gains of getting a tool specifically designed to support drafting activity. Moreover, recent technologies enable programmers to develop rich web WYSIWYG (What You See Is What You Get) editors that include many of the traditional features plus new functions specific to a “networking” tool.

What’s next?

The experience of EP was so positive and so well received by other Parliaments that in May 2012, at the opening of the international workshop “Identifying benefits deriving from the adoption of XML-based chains for drafting legislation“, Vice President Wieland announced the launch of a new project aimed at to providing an open source version of the AT4AM code.

in a video conference with the United Nations Department for General Assembly and Conference Management from New York on 19 March 2013, Vice President Wieland announced, the UN/DESA’s Africa i-Parliaments Action Plan from Nairobi and the Senate of Italy from Rome, the availability of AT4AM for All, which is the name given to this open source version, for any parliament and institution interested in taking advantage of this well-oiled IT tool that has made the life of MEPs much easier.

The code has been released under EUPL(European Union Public Licence), an open source licence provided by European Commission that is compatible with major open source licences like Gnu GPLv2 with the advantage of being available in the 22 official languages of the European Union.

AT4AM for All is provided with all the important features of the amendment tool used in the European Parliament and can manage all type of legislative content provided in the XML format Akoma Ntoso. This XML standard, developed through the UN/DESA’s initiative Africa i-Parliaments Action Plan, is currently under certification process at OASIS, a non-profit consortium that drives the development, convergence and adoption of open standards for the global information society. Those who are interested may have a look to the committee in charge of the certification: LegalDocumentML

Currently the Documentation Division, Department for General Assembly and Conference Management of United Nations is evaluating the software for possible integration in their tools to manage UN resolutions.

The ambition of EP is that other Parliaments with fewer resources may take advantage of this development to improve their legislative drafting chain. Moreover, the adoption of such tools allows a Parliament to move towards an XML based legislative chain. The distribution of legislative content in open document formats like XML allows other parties to treat in an efficient way the legislation produced.

Thanks to the efforts of European Parliament, any parliament in the world is now able to use the advanced features of AT4AM to support the drafting of amendments. AT4AM will serve as a useful tool for all those interested in moving towards open data solutions and more democratic transparency in the legislative process.

At AT4AM for All website it is possible to get the status of works and run a sample editor with several document types. Any Parliament interested can go to the repository and download the code.

Claudio Fabiani is Project Manager at the Directorate-General for Innovation and Tecnological Support of European Parliament. After an experience of several years in private sector as IT consultant, he started his career as civil servant at European Commission, in 2001, where he has managed several IT developments. Since 2008 he is responsible of AT4AM project and more recently he has managed the implementation of AT4AM for All, the open source version.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

Standardizing the World’s Legislative Information—One hackathon at a time

Annotation of legal texts, Cross-language legal information retrieval, digital law, Electronic government, elegislation, Legal metadata, Legal XML, Open Government Data, Semantic annotation of legal texts, Standards 8 Responses »

Sep 172012

As guest bloggers to this site, we have been asked to write about big ideas. We’ll get to those. But first, a note about hackathons.

Could legal hackathons be like this one day?

Hackathons used to be the exclusive domain of soda-and-coffee-guzzling, pizza-eating, all-night hacking, highly competitive computer programmers. The result of such a hackathon is often supposed to be a cool app (like the forerunner of Twitter) that is even cooler because it was built in the compressed schedule of the event. More recently, hackathons have been popping up in a variety of places, with some unexpected contexts and sponsors including the U.S. House of Representatives, NASA, Brooklyn Law School, New York City government, and others. These events serve as a way to prove (or build) the sponsor’s tech credentials and to cross-fertilize policy and technology expertise. There has been some handwringing and thoughtful commentary about the expansion of “civic” hackathons and what sustainable outcomes they produce.

As co-organizers, with Karen Suhaka, Greg Wilson, Charles Belle and others, of two legislative focused, hackathon-inspired events–the California Law Hackathon, and the International Legislation Un-hackathon–we can attest to their value in bringing engineers and lawyer and policy folks together. We can give some insights into the kinds of benefits these events have had in propelling efforts on legislative data standards, and some of the advances that have taken place in the development of these standards over the last year.

Big Idea: Legislative Data Standards

And now to the big idea: to represent all the world’s legislation in a standard structured data format. That’s actually two big ideas: (1) putting legislation into a structured data format, and (2) designing that format so that it is compatible with the wide variety of laws and legislative document types worldwide.

There are reasons for doing these things: First, introducing structured data to legislation can make it possible to search and analyze the law with greater precision and efficiency. And second, having a common standard can permit more comprehensive bill-tracking and comparison between jurisdictions.

California Bill with Metadata

It also can make it possible for legislatures with small (and shrinking) budgets to benefit from some of the same bill drafting software that is being developed for much larger jurisdictions. (Full disclosure: Xcential has developed such software for more than ten years, including the drafting platform used by the State of California.)

In the age of Google, these ideas may not seem so big; in fact, they are a subset of Google’s far-reaching mission. However, legislation is a corner of the world’s information that Google has not yet addressed in a systematic way. And as regular readers of this blog know, legislation presents its own hurdles, technical and bureaucratic (not necessarily in that order), that make this both an interesting and a challenging problem. One of the challenges is that the kind of people who generally work with data (we’ll call them engineers) and the kind of people who generally work with legislation (we’ll call them lawyers and policy folks) don’t often work on data and legislation together. One of us, a lawyer and policy type, has made this point graphically (and somewhat hyperbolically) in a Quora response to a question about whether version control software could be used for legislation. That question, and a subsequent discussion generated in response to a blogpost by software engineer Abe Voelker about version control for legislation, drew in many engineers and some lawyers and policy folks.

For software engineers who consider such things, it is very attractive to think about treating legal text as if it were software code; we could automatically highlight and cross-validate key terms, run test cases, automate redlining and version control, etc. It would be easy to see what the state of the law was at any particular point in time, and to trace the series of amendments that got us into the mess we’re in today. This desire is often expressed as “What if we had a Github for legislation?” On the other hand, people who work closely with legislation–researching it, drafting it or developing information systems to deal with it–tend to see the many places that the analogy between computer code and legal code break down. Legal texts have been shaped over hundreds of years by technologically conservative institutions, using print-based systems.

The full transformation of law to digital information is not going to happen overnight. While most law is already accessible in electronic format (often pdf), it is not encoded in a way that software engineers could start using their favorite text-munching tools. One of us, an engineer, has described this as the difference between computerization and automation. The move toward better digital tools for automating legislative drafting and research tasks will require more dialogue and working exchanges between engineers and the lawyers and policy folks.

That brings us back to hackathons.

What is a Legislative Hackathon?

Recognizing the need to bring lawyers and engineers together in order to implement our big idea(s), and appreciating the valuable bandwagon that hackathons have become, we decided to jump onboard. The first event we organized, the California Law Hackathon, was hosted just over a year ago, in September 2011, in Berkeley at the offices of Maplight, and in Denver by Karen Suhaka’s team at BillTrack50. The event focused on building web-based visualization tools to track the timeline of amendments to California legislation, and to link particular amendments, through their legislative sponsors, to particular donors or interest groups. We were joined remotely by a number of international participants, including John Sheridan, head of e-services for legislation.gov.uk, and a fellow guest contributor to this blog. As one participant noted, we learned a great deal at the event, including the limits placed on us by the existing data. Neither the legislative record, nor the donations databases are detailed enough to trace influence in politics in the way we hoped. This helped spark an interest in a more in-depth exploration of legislative data formats, and in particular how more and better metadata could be added to legislation.

That led to the International Legislation Un-hackathon, held simultaneously at UC Hastings, Stanford and Denver, with participants from the University of Bologna (Ravenna campus) and around the world. So assuming you can get engineers together with lawyers and policy folks, what do you do with them? We decided that we’d need a user-friendly tool that could be used to explore and add metadata to legislation from around the world. This could highlight a developing legislative XML standard, Akoma Ntoso (more about this standard soon), and give hands-on experience to lawyer and policy types kinds of text and analysis tools that engineers take for granted.

Hacking With A Legislative Editor

So one of us (the engineer, naturally) started building a web-based editor for legislation, while the other (the lawyer, naturally) started organizing the next hackathon. Of course, thought the lawyer, it would just

Legislative Editor at legalhacks.org

be a matter of time before all governments worldwide use such editors to draft their laws and regulations in a standard data format.

Advances in Legislative Data Standards Efforts

Akoma Ntoso

Akoma Ntoso (AkN) is a strong contender to be that format. Developed under the auspices of the UN Department of Economic and Social Affairs, AkN is an XML data structure that is meant to capture high-level forms and semantic ideas that are common to a broad variety of legal texts. OASIS, the folks who brought us the DocBook standards, among others, have convened a standards committee to create an official legislative data standard based on AkN. (More disclosure: the engineer is a member of this committee.) There’s just one problem. Few governments are using AkN to draft or store their legislation.

AkN itself is fast evolving, and with more exposure to legal data structures from different jurisdictions, the OASIS committee will be able to adapt AkN to better model those structures.

We saw the International Legislative Un-Hackathon as a venue to kick off this process. It was conceived with Charles Belle of UC Hastings, as part of the Legal Hacks initiative. The event was held simultaneously at UC Hastings, Stanford, in Denver. Jim Harper and Francis Avila of the Cato Institute came to the Hastings Event. We also had many international participants. Key among them were Professors Monica Palmirani and Fabio Vitali of the University of Bologna, the architects and primary evangelists of AkN. Over the course of the day, participants learned about AkN and, importantly, got a chance to try it out, marking up documents of their choosing with the web editor. In the process, as expected, we found bugs in the software and bugs in the standard. We found structures in U.S. legislation that didn’t fit well with the existing AkN element set. We saw places where there was confusion in applying AkN’s data structures to documents. All of this information was collected to incorporate in the development of both the editor and AkN, underscoring again the importance of getting more practical exposure for both.

University of Bologna Summer School–Ravenna

And we are working to expand the venues for this kind of practical exposure to develop the AkN standard. Every September, the University of Bologna hosts the LEX Summer School in Ravenna, Italy. For them, it’s an opportunity to introduce Akoma Ntoso to new groups of students from around Europe and around the world. For the students, it’s an opportunity to learn about the application of XML to legislation, see the success various groups are having around the world, and to meet interesting new people having a passion for legal informatics. One of us, the engineer, who was a student two years ago, was invited to return last year to present a success story, and this year is returning once more to deliver a class in how to build and use the HTML5-based editor for drafting legislation in XML. For us, this is an opportunity to expose the editor to the European legal traditions in order for us to better understand how our editor must evolve to fulfill our vision of a unified standard around the world with common, highly adaptable, tools.

Chile National Library of Congress Browser-based editor

Another step toward adoption of legislative data standards is a project by Chile’s National Library of Congress (BCN in Spanish) called the “History of the Law” (Historia de la Ley). This ambitious project aims to bring together machine learning, a legislative editor and other features to mark up Chile’s legislative record and other legislative documents. The BCN has chosen Xcential’s browser-based editor, working with the AkN standard, to conduct the mark-up and correction after documents are passed through an automated parser. As with the hackathon, but on a larger scale, we are learning from experience the modifications that are needed to AkN, to make it work with Chile’s live documents. Excitingly, each mismatch we find between AkN and actual legislation can be fed back into the OASIS committee process, to make AkN able to handle a wider variety of real-world use cases.

Other Efforts and the Future of Legislative Data Standards

We see these steps as just the beginning. European governments are also flirting with legislative standards, and Karen Suhaka’s group at BillTrack50 has converted all U.S. bills from all states into a single standard XML format showing that the technical hurdles can be overcome, and many of the practical benefits of doing so. In focusing on the projects (and hackathons) we are most closely involved with, we have certainly left out a lot of the initiatives that are advancing legislative data standards around the world. That’s what the comments are for. Let us know your experience with Akoma Ntosa as a legislative standard, and what you’re doing or interested in doing with AkN or other legislative data standards worldwide.

Grant Vergottini (the engineer) is a founder of Xcential. He is a leading authority on applications of XML data to legislation. Prior to founding Xcential, Grant was the Director of Applications at Chrystal Software, a company dedicated to XML design and reporting software. Before Chrystal, Grant led the redesign of Homestore.com, and founded Genedax Design Automation, which developed innovative team and data management applications for electronics design. Bringing data structures and automation tools to the legislative drafting process parallels the work that Grant did earlier in his career at Mentor Graphics and the Boeing Company, where he participated in the transformation from manual drafting to CAD software. Mr. Vergottini holds a Bachelor of Science in Electrical Engineering from Cleveland State University, where he graduated Summa Cum Laude.

Ari Hershowitz (the lawyer) is a consultant at Xcential, and founder of Tabulaw. Tabulaw develops software for lawyers, including a web-based legal research and writing platform. Prior to Tabulaw, Ari worked to protect wildlife and habitats from Chile to Mexico as Director of the BioGems project for Latin America at the Natural Resources Defense Council. Ari has a law degree from Georgetown University Law Center, a Masters in Computation and Neural Systems from Caltech, and a Bachelors in Molecular Biophysics & Biochemistry from Yale College.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

[Editor’s Note: For topic-related VoxPopuLII posts please see: Núria Casellas, Semantic Enhancement of legal information … Are we up for the challenge?; João Lima, et.al, LexML Brazil Project; and Rinke Hoekstra, The MetaLex Document Server

The MetaLex Document Server

Legal identifiers, Legal metadata, Legal semantic web, Legal text processing, Legal XML, Legislative information systems, Regulatory information systems, Semantic Web and law 3 Responses »

Oct 252011

In this post I describe the process and requirements that eventually led to the MetaLex Document Server, a server that hosts all versions of Dutch national statutes and regulations published since May 2011, both as CEN MetaLex, and as Linked Data. Before I set out to do so, however, I would like to emphasize that, although the development of the server and its contents was a one-man-job, the road to make it possible surely was not solitary. A couple of people I’d like to mention here are Alexander Boer, Radboud Winkels, and Tom van Engers of the Leibniz Center for Law, together with whom I have worked over the past ten years to develop, test, and publish the ideas that underlie CEN MetaLex. Also, the team around legislation.gov.uk clearly has done a lot of great and inspiring work in this area.

So, what happened? Over the course of last spring, I was involved in several small-scale projects that shared a specific need: version-aware identifiers for all parts of legislative texts. The first of these was a report for the Swiss Federal Chancellery on possible technological solutions for a regulation drafting system to be used by the Swiss government. Second to arrive on my desk was a project for the Dutch Tax and Customs Administration (Belastingdienst), in which we were asked to develop a concept-extraction toolkit that would allow them to make explicit where concepts are defined, where they are reused, and how they relate to other concepts (e.g., from an external thesaurus). The purpose of this project was to investigate whether we could replace with technology what is currently a manual process of turning legislation into business processes that fuel citizen- and business-oriented services. The Belastingdienst needs this to better cope with the yearly changes to tax regulation issued by the Ministry of Finance. The Dutch Immigration and Naturalisation Service (IND) faces exactly the same problem: of discovering what part of their business processes is affected by each legislative modification. Updates to legislation require continuous, significant investment in IT re-engineering.

The root of the problem

But don’t modern European governments already have elaborate facilities for supporting this workflow? I’m afraid not.

Currently, regulation drafting is a process of sending around Word documents, copy-and-pasting from older texts, “version hell,” signing by a Minister, and sending the enacted regulations off to a publisher, who will then turn it into some XML format to feed a publishing platform to generate HTML, PDF, and paper versions of the texts. This process is not designed with a content management perspective, and most if not all metadata is thrown away in the process.

Part of the problem is one of organisational change: convincing legislative drafters to use a more structured approach in their daily work. The Dutch Ministry of Security and Justice is currently developing a legislative editing environment (similar to the MetaVex editor developed at the University of Amsterdam), but it will take awhile before this is adopted in practice.

Requirements

To develop a chain of tools for managing legal information, both as text and as knowledge models, we need to address a number of key requirements:

An integrated legislative drafting and editing environment that supports advanced version and provenance tracking (e.g., version tracking of successive changes to draft texts). Provenance information is very important for eliciting the procedure that led to an official version (both pre- and post-publication), as well as its underlying motivation.
A format in which these texts are stored that is flexible enough to allow both editing and publication to various formats (such as PDF and HTML).
The ability to persistently identify every element of a legal text. Versioning of texts, references, and metadata requires identifiers that reflect the different versions of these resources. The various parts of a text should be versioned independently, allowing for transitory regimes.

A versioning mechanism should distinguish between a regulation text as it exists at a particular time, and the final regulation. The IFLA Functional Requirements for Bibliographic Records (FRBR) (Saur, ’98) makes the following distinctions: the work as a “distinguishable intellectual or artistic creation” (e.g., the constitution); the expression as the “intellectual or artistic form that a work takes each time it is realised” (e.g., “The Constitution of July 15th, 2008”); the manifestation as the “physical embodiment of an expression of a work” (e.g., a PDF version of “The Constitution of July 15th, 2008”); and the item as a “single exemplar of a manifestation” (e.g., the PDF version of “The Constitution of July 15th, 2008” residing on my USB stick).

These identifiers should be dereferenceable to the element they describe, or a description of the element’s metadata.
Metadata and annotations should be traceable to the most detailed part of a text, as well as to its version, when needed. The same requirement holds for references between texts, allowing for fine-grained analysis of interdependencies between texts.
It is furthermore a requirement that these identifiers be transparent and follow a prescribed naming convention. This allows third parties to construct valid identifiers without having to first query a name service.
The metadata itself should be made accessible in a standard format as well.

Making do with what we’ve got

As we don’t have any time to waste, and have neither the organisational infrastructure, nor the funds, to use or develop any other (richer) information source, we need to make do with what’s currently available. How hard is it to build a chain of tools that meets at least part of these requirements? And, what information does the Dutch government already provide on which we can build the services that it itself so dearly needs?

Wetten.overheid.nl is the de facto source for legislative information in The Netherlands. Users can perform a full text search through the titles and text of all statutes and regulations of the Kingdom of the Netherlands. They can search for a specific article, as well as for the version of a text as it stood at a specified date. Wetten.overheid.nl also provides an API for retrieving XML manifestations of statutes and regulations.

Identifiers

What about identifiers? Wetten.nl supports deeplinks to particular versions of statutes and regulations, but is not very consistent about it. For example:

http://wetten.overheid.nl/cgi-bin/deeplink/law1/bwbid=BWBR0005416/article=6/date=2005-01-14

and

http://wetten.overheid.nl/BWBR0005416/TitelII698946/HoofdstukII/Artikel16/geldigheidsdatum_14-01-2005

both point to article 6 of the Municipal law, as it was in effect on January 14th, 2005. A third mechanism for identifying regulations is the Juriconnect standard for referring to parts of statutes and regulations. XML documents hosted by Wetten.nl use these identifiers to specify citations between statutes and regulations. For instance, the Juriconnect identifier for article 6 of the Municipal law is:

1.0:c:BWBR0005416&artikel=16&g=2005-01-14

But… the standard does not prescribe a mechanism for dereferenecing an identifier to the actual text of (part of) a statute or regulation.

The BWB XML service and format

XML manifestations of statutes and regulations are retrievable through an API on top of the “Basiswettenbestand” (BWB) content management system. This REST Web service only provides the latest version of an entire statute or regulation. The BWB XML document returned is stripped of all version history: it does not even contain the version date of the text itself.

An index of all BWB identifiers, with basic attributes such as official and abbreviated titles, enactment and publication dates, retroactivity, etc. is available as a zipped XML dump or a SOAP service. Unfortunately, the XML file is corrupt, and the date of the latest change to a statute or regulation reported in the index is not really the date of the latest modification, but of the latest update of the statute or regulation in the CMS. See the picture above.

The BWB uses its own XML schema for storing statutes and regulations; this schema does not allow for intermixing with any third-party elements or attributes, ruling out obvious extensions for rich annotations such as RDFa. And, BWB XML elements do not carry any identifiers.

A more general XML format: CEN MetaLex

CEN MetaLex is a jurisdiction-independent XML format for representing, publishing, and interchanging legal texts. It was developed to allow traceability of legal knowledge representations to their original source. MetaLex elements are purely structural. Syntactic elements (structure) are strictly distinct from the meaning of elements by specifying for each element a name and its content model. What this essentially does is to pave the way for a semantic description of the types of content of elements in an XML document. The standard prescribes the existence of a naming convention for minting URI-based identifiers for all structural elements of a legal document. MetaLex explicitly encourages the use of RDFa attributes on its elements, and provides special metadata-elements for serialising additional RDF triples that cannot be expressed on structural elements themselves. MetaLex includes an ontology, which defines an event model for legislative modifications. The legislation.gov.uk portal has adopted the MetaLex event model for representing modifications.

Getting from BWB to CEN MetaLex

The MetaLex schema is designed to be independent of jurisdiction, which means that it should be possible to map each legacy XML element to a MetaLex element in an unambiguous fashion. Fortunately, we were able to define a straightforward 1:n mapping between BWB and MetaLex (see below) by a semi-automatic conversion of the BWB XML DTD.

The transformation of legacy XML to MetaLex and RDF is implemented in the MetaLex converter, an open source Python script available from GitHub. Conversion occurs in four stages:

mapping legacy elements to MetaLex elements,
minting identifiers for newly created elements,
producing metadata for these elements, and
serialising to appropriate formats.

Step 1: Mapping

For the transformation of BWB XML files, the converter is sequentially fed with all BWB XML files and identifiers listed in the BWB ID index. Based on the mapping table, the converter traverses the DOM tree of the source document, and synchronously builds a DOM tree for the target document. In cases where the MetaLex schema doesn’t quite “fit,” the converter has to make additional repairs.

We evaluated the ability of MetaLex to map onto the BWB XML by running the converter on 300 randomly selected BWB identifiers. The artikel element accounts for 72% of all corrections, and corresponds to 68% of all htitle substitutions (5 % of total). This means that only a very small part of BWB XML does not directly fit onto the MetaLex schema. The main cause for incompatibility is the restriction in MetaLex that hcontainer elements are not allowed to contain block elements (and that’s perhaps something to consider for the MetaLex workshop).

Because of the limitations of the API, version information, citation titles, and other metadata are retrieved through a custom-built scraper of the information pages on the wetten.nl Website.

Step 2: Minting Identifiers

For every element in the document we create transparent URL-like URIs for the work, expression and manifestation levels, and two opaque URIs for the expression and item levels in the FRBR specification.

For transparent URIs, we use a naming scheme that is based on the URIs used at legislation.gov.uk, with slight adaptations to allow for the Dutch situation. In short, work level identifiers are based on the standard BWB identifier, followed by a hierarchical path to an element in the source, e.g., “chapter/1/article/1”. These URIs are extended to expression URIs by appending version and language information. Similarly, manifestation URIs are extensions of expression URIs that specify format information such as XML, RDF, etc. Juriconnect references in the source BWB XML are automatically translated to this naming scheme.

The opaque version URI is needed to distinguish different versions of a text. The current Webservice does not provide access to all versions of statutes and regulations (only to the latest), let alone at a level of granularity lower than entire statutes or regulations. We therefore need some way of constructing a version history by regularly checking for new versions, and comparing them to those we looked at before. By including in the opaque URI a unique SHA1 hash of the textual content of an XML element, and simultaneously maintaining a link between the opaque URI and the transparent identifier, different expressions of a work can be automatically distinguished through time. This is needed to work around issues with identifiers based on numbers: the insertion of a new element can change the position (and therefore the identifier) of other elements without a change in the content of the elements.

By this method, globally persistent URIs of every element in a legal text can be consistently generated for both current and future versions of the text. By simultaneously generating an opaque and a transparent expression level URI, identification of these text versions does not have to rely on numbering.

Step 3: Producing Metadata

The MetaLex converter produces three types of metadata. First, legacy metadata from attributes in the source XML is directly translated to RDF triples. Second, metadata describing the structural and identity relations between elements is created. This includes typing resources according to the MetaLex ontology, e.g., as ml:BibliographicExpression; creating ml:realizes relations between expressions and works; and creating owl:sameAs relations between opaque and transparent expression URIs. The official title, abbreviation, and publication date of statutes and regulations are represented using the dcterms:title, dcterms:alternative and dct:valid properties.

Events and Processes

Event information plays a central role in determining what version of a regulation was valid when. Making explicit which events and modifying processes contributed to an expression of a regulation provides for a flexible and extensible model. The MDS uses the MetaLex ontology for legislative modification events, the Simple Event Model (SEM) and the W3C Time Ontology for an abstract description of events and event types, and the Open Provenance Model Vocabulary (OPMV) for describing processes and provenance information. These vocabularies can be combined in a compatible fashion, allowing for maximal reuse of event and process descriptions by third parties that may not necessarily commit to the MetaLex ontology.

Step 4: Serialization

The MetaLex converter supports three formats for serialising a legal text to a manifestation: the MetaLex format itself, viewable in a browser by linking a CSS stylesheet; RDFa, Turtle and RDF/XML serializations of the RDF metadata; and a citation graph. The converter can automatically upload RDF to a triple store through either the Sesame API, or SPARQL updates. The citation graph is exported as a ‘”net” network file, for further analysis in social network software tools such as Pajek and Gephi. We are exploring ways to use these networks for determining the importance of articles (in degree) and the dependency of legislation on certain articles (betweenness centrality), and for analysing the correlation between legislation and case law.

Publication

The results of this procedure are published through the MetaLex Document Server (MDS). The server follows the Cool URIs specification, and implements HTTP-based redirects for work- and expression-level URIs to corresponding manifestations based on the HTTP accept header. Requests for an HTML mime-type are redirected to a Marbles HTML rendering of a Symmetric Concise Bounded Description (SCBD) of the RDF resource. Similarly, requests for RDF content return the SCBD itself; supported formats are RDF/XML and Turtle. A request for XML will return a snippet of MetaLex for the specified part of a statute or regulation.

The MDS provides two convenient methods for retrieving manifestations of a statute or regulation. Appending “/latest” to a work URI will redirect to the latest expression present in the triple store. Appending an arbitrary ISO date will return the last expression published before that date if no direct match is available. Lastly, the MDS offers a simple search interface for finding statutes and regulations based on the title and version date.

Results and Take Up

We have been running the converter on a daily basis, on all versions of statutes and regulations made available through the wetten.nl portal since May 2011. This has resulted in a current total of 29,120 document versions: 28 thousand versions in the first run, the rest accumulated through time. For these document versions, we now store 119 million triples of RDF metadata in a 4Store triple store. Compared to the size of legislation.gov.uk (1.9 billion triples, since the 1200s), this is a modest number, but at the current growth rate we will soon need to look for alternative (more professional) solutions. Check the http://doc.metalex.eu Website for the latest numbers.

I am happy to say, also, that this work has not gone unnoticed. The IND was particularly enthused by the versioning mechanism, and is in the process of adopting the MDS approach as their internal content management system. Similarly, the ability to link concept descriptions to reliably versioned parts of legislation has been an eye opener for the Belastingdienst. We are also in touch with several people at ICTU, the organisation behind Wetten.nl, to help them improve their services.

Dr. Rinke Hoekstra is a postdoctoral researcher at the Leibniz Center for Law at the University of Amsterdam. He is the developer of the MetaLex Document Server, the principal author of the LKIF Core ontology of basic legal concepts, and one of the initial developers of the MetaLex XML format for legal sources.

VoxPopuLII is edited by Judith Pratt. Editor-in-Chief is Robert Richards, to whom queries should be directed. The statements above are not legal advice or legal representation. If you require legal advice, consult a lawyer. Find a lawyer in the Cornell LII Lawyer Directory.

LexML Brazil Project

elegislation, elegislation systems, information retrieval, Legal identifiers, Legal metadata, Legal ontologies, Legal text processing, Legal XML, Legislative information systems, open source software, search 2 Responses »

Oct 152010

This post is divided into three topical sections. The first one is an introduction to the LexML Brazil Project and its unified search portal, after which some aspects related to semantic interoperability shall be presented and, at the end, we show the current work and future direction of the project.

Before going on to the aforementioned subjects, a few words about Brazil and its legislative and legal systems are necessary. Brazil is a country of continental proportions, composed of 27 states and more than five thousand municipalities, or cities, as in Brazil no distinction is made between town and city. As a federative system, each state and municipality has its own legislative chamber. While states and cities follow a unicameral system, the Federation itself has a bicameral system, with the National Congress divided into a Chamber of Deputies and the Federal Senate. These legislatures generate a great number of laws, or normative acts. The abundance of normative acts is very significant, considering that, in contrast with Common Law systems, Brazil’s legal system, based on the Civil Law, is characterized by the predominance of normative acts.

According to Edilenice Passos, “the proliferation of normative acts, of higher or lower hierarchy, eventually causes total chaos, for this big mass of juridical documents hampers the work of lawyers, of researchers, and of the very citizens, who are ruled by Brazilian laws.” Edilenice Passos also cites Arnoldo Wald, who, in 1969, was already alerting Brazilians that “the true legislative labyrinth created as a result of an inflation of statutes passed in recent years has turned the ruling Brazilian law into a patchwork, in which the mere legislative updating becomes a daily torture for a lawyer and a judge who are searching for the rules applicable to a specific subject, from among acts, supplementary acts, institutional acts, decree-laws, and other normative acts.”

Almost all Brazilian legal and legislative information is available through the Internet. However, this information is distributed among several thousand sites, each containing documents produced by a specific government institution. Thus, the relationships between acts of different institutions is not available explicitly, making it very hard to understand this “legal patchwork.”

Nowadays, much time is lost looking for this information, filtering the results of search engines. As Roy Tennant says, “Librarians like to search; everyone else likes to find,” and further adds: “People generally want to find everything they can on a topic, ranked by relevance and displayed in ways that make it easy to narrow in on their goal.”

Born to address these issues, LexML Brazil is an information network that aims to organize Brazil’s legislative and legal information. The project is an initiative of the “Comunidade TI Controle” (IT Control Community) and is being implemented by the Brazilian Federal Senate, through PRODASEN (the Senate’s special secretariat for information systems) and Interlegis (a virtual community of Brazilian legislatures).

LexML Brazil’s first product is the Legislative and Legal Information Portal, which opened on June 30, 2009, indexing 1.28 million documents. In September 2010, its index ranged through more than 1.5 million documents. By indexing the metadata collected from several institutions using the OAI-PMH protocol, the portal unifies access to a variety of legislative and legal information sources, which is a step toward the goal of guaranteeing Brazilians’ constitutional right of access to information.

LexML Portal

The LexML Portal home page layout is very simple and is similar to Google‘s main page. At this screen, it is possible to restrict the search to Legislation, Jurisprudence, or Bills.

The search results page allows the user to refine the search by using filters, according to his or her information requirements. Five filters are available: location, issuing authority, document type, date, and acronyms.

The detail page provides links to the official publication version of each document, and to other publications available in information systems of network participants, which, in this particular case, are: National Press, Presidency, Chamber of Deputies, and Federal Senate. General information about the document is available by clicking one of “Mais Detalhes (More details)” links, which directs the Web browser to the corresponding network participant’s metadata page. A service providing automatic identification of textual references can be activated by clicking the “Linker” label.

Semantic Interoperability

While systems interoperability and syntactic issues can be managed with the estabilished standards of representation, codification, and exchange (XML, METS, Unicode, OAI-PMH, etc.), structural and semantic interoperability demands the adoption of a reference model that allows the integration of several models and the use of a unified terminology for indexing different sources of information. According to Patel et al., the general purpose of semantic interoperability is “to support complex and advanced context-sensitive query processing over heterogeneous information resources.” Lack of semantic interoperability generates then the “information silos” problem, characterized by the lack of information integration and consequent inability to process complex queries.

The next section presents the design choices made by the LexML Brazil Project to address issues related to semantic interoperability using Ranganathan‘s “stratification planes” classification system, featuring: an idea plane, a verbal plane, and a notational plane.

Idea Plane

The idea plane is composed of the abstract entities of a domain, independently of how they are nominated or identified.

The metadata standards that propose to address interoperability issues do so either for a specific, restricted domain or for heterogeneous domains. Specialized metadata standards (MARC, EAD, MODS, etc.) allow different sources of information about specific domains (bibliographical or archival information) to be integrated and searched in an advanced form. On the other hand, the Dublin Core standard is one of the few that try to integrate arbitrarily heterogeneous sources using a minimum set of elements and qualifiers. Its characteristic simplicity enables easy adoption by multiple actors, but also hinders query processing, preventing the use of the rich chain of relationships among entities. The lack of generality or expressiveness of these standards precludes their use for achieving semantic interoperability of heterogeneous sources of legislative and legal information in Brazil.

An alternative is to use formal ontologies instead of metadata standards. According to Martin Doerr, “recently, more and more projects and theoreticians support the use of formal ontologies as common conceptual schema for information integration.” One such ontology, the CIDOC CRM model, was designed to help the integration, mediation, and interchange of heterogeneous cultural heritage information. It was developed in 1994 and has since been approved as the ISO 21127:2006 standard. The CIDOC CRM model is then a natural choice for conceptual schemas of legal and legislative information, if one considers that the text corpus consisting of a nation’s sources of law is a part of the nation’s cultural heritage information.

However, the CIDOC CRM “document” concept lacks the necessary detail needed to describe the relationships among the several information abstraction levels: work, expression, manifestation, and item. That requirement is fulfilled by the FRBR_ER entity-relationship model, which was considered as a reference model in earlier phases of the project (“An Adaptation of the FRBR Model to Legal Norms,” João Lima, Proceedings of the V Legislative XML Workshop, Florence, 2005) .

The FRBR_OO standard, an ontology created by a working group formed in 2003 by representatives of IFLA (International Federation of Library Associations and Institutions) and ICOM (International Council of Museums) for purposes of harmonizing both models, was adopted by the LexML project because it combines the advantages of both models while addressing their shortcomings. As such, FRBR_OO manifests a great affinity to the LexML domain (“A Time-aware Ontology for Legal Resources,” João Lima et al., Proceedings of the Tenth International ISKO Conference, 2008).

One of the great innovations of the CIDOC CRM model is the information structuring around temporal events, a central concept in the model. This contrasts with most other metadata models, which have resources as the central objects of interest. This innovative approach defines events as entities that connect actors, things (concrete and abstract), places, and time intervals.

This particular emphasis could be criticized on the ground that the user is generally interested in a specific resource, such as the text of a law. However, the result of a search for information about a law is much more relevant if it includes an organized list of events related to the resource, along with the resource itself.

The importance of choosing a suitable reference model is easily observable in the present discussion about what particular syntax to use to codify persistent identifiers — urn:lex, LegisLink, Akoma Ntoso, etc. Before reaching the syntax level, such discussions should focus first on the idea plane, where a greater potential for integration exists. A consensus reached at this level would allow great flexibility for the specification of diverse persistent identifier syntaxes.

Verbal Plane

The CIDOC CRM ontology separates the class of types and denominations from other classes. Multiple names, identifiers, and types can be attributed to all entities of the CRM, allowing any domain class to be classified by several taxonomies and be known by multiple names and identifiers.

This approach is used in LexML to represent different terms that identify the same concepts. Six classes form LexML’s uniform resource identifiers: place, authority, type of document, event, type of content, and language. To externalize the LexML vocabularies specification, we recommend, and use, the W3C SKOS (Simple Knowledge Organization System).

Notational Plane

The definition of uniform and persistent identifiers is fundamental for the creation and maintenance of an information chain. Identifiers are already part of the legal domain. For identification purposes, numbers are attributed to rulings, decisions, abridgments, and bills, allowing references by means of textual remissions. In the computational environment, the creation of persistent and uniform identifiers allows not only identification and reference, but also access to documents by means of textual hyperlinks.

Based on the experience of the Italian project Norme in Rete with respect to URN (Uniform Resource Name) identifiers, LexML defines a grammar for the construction of identifiers for legislative and legal documents in Brazil. As an example, the name “urn:lex:br:federal:lei:1993-06-21;8666” identifies, in a persistent and unique way, the “Federal Act No. 8666, of June 21, 1993.” If all information systems agree with respect to the identifiers, it is possible to share descriptive metadata, as well as information about semantic relationships, such as regulation, amendment, abrogation, etc.

The Linker service, accessible through the LexML Portal (see, e.g., Act 11.705 without linker and Act 11.705 with linker), creates hyperlinks automatically through a dynamic textual analysis that identifies textual remissions of [i.e., citations to] normative documents. These hyperlinks can be used to navigate through textual remissions.

Future Directions

LexML 1.0 consists of the Search Portal, the Resolution Service, the Persistent Identifier, and the Linker Service. The next version, LexML 2.0, will go further: it will involve the development of open source tools for managing the complete text of documents encoded according to the LexML Brazil XML Schema, which was derived from the schemas of the Akoma Ntoso Project.

The complete management of document texts in a structured form has been a goal of the project since its inception. In as early as 2000, the Federal Constitution Portal was implemented following this idea. This portal allows the user to see all the versions of the constitutional text through a timeline, with the option to see the list of historical changes [see, e.g., art. 12] and with the ability to navigate bi-directional links [for example, in art. 154, click on the blue arrows].

During the development of that portal, taking into account the various forms of XML used to encode normative texts in many countries, and especially the experience of the Italian project Norme in Rete, a decision was made to make a unified portal and a persistent identifier a priority of the LexML project. Presently, our efforts to build open source tools for management of document texts are being renewed. One of these tools, a LexML Document Editor, will enable the authoring of legal texts as if using a word processor, but producing a structured document at the end. Another tool is the Compiler, which will semi-automatically generate modified versions of documents that have been updated by other legal acts. The Consolidator will help to simplify the display of legal information — and users’ experience of the legal system — through the consolidation of several related normative acts into a single act. The Comparator will be used to display the differences between versions of a document. The last tool, the Publisher, will be used to render XML content in different formats, such as html, PDF, PDF-A, EPUB, etc., with the ability to choose different views of the same text, such as the original text, the updated text as of a specific date, etc.

Last but not least, the Information Management Committee, which is a community of practice composed of librarians, archivists, and information analysts of several institutions of the three Brazilian governmental branches, interested in the management of legal and legislative information, is responsible for the definition of the priority and long range planning of the LexML Brazil Project.

[Editor’s Note: For documentation, schemas, and controlled vocabularies respecting LexML Brazil, please see the LexML Brazil Project Website. For more information on these issues, please see the following VoxPopuLII posts: John Sheridan on Legislation.gov.uk, Ivan Mokanov on CANLII‘s innovative legal citation system, Joe Carmel on LegisLink, and Robb Shecter on OregonLaws.org.]

The LexML Brazil core team, from left to right: João Lima (joaolima at senado.gov.br) is the leader of The LexML Project. His Information Science Ph.D. thesis details many of the concepts presented here; João Holanda (jholanda at senado.gov.br) holds a BSc in History from UnB; João Rafael (jrafael at senado.gov.br) holds a MSc in Computer Science from UFMG and a BSc in Computer Science from UnB; Marcos Fragomeni (fragomeni at senado.gov.br) holds a BSc in Computer Science from UnB.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Teaching the Computer to Read Legal Text

Legal text processing, Legal XML, natural language processing 3 Responses »

Oct 062010

In this post, I will describe how natural language processing can help in creating computer systems dealing with the law.

A lot of computer systems are being designed to help users deal with legal texts — accessing, understanding, or applying them. [Editor’s Note: Michael Poulshock’s Jureeka is an example of a system that automates the application of legal texts.] Other systems — such as DALOS — are about creating legal texts, providing support for the writers, or simulating the effects of a text. Such systems are based on something more than “just” the legal text: there is XML mark-up, an OWL ontology, or a representation of the rules in SWRL or some programming language. This means that any piece of legislation that you want to use on your computer system needs to be translated into this computer representation.

We try to support this translation using natural language processing, so that (part of) the translation can be done by a computer. This automation should have a number of advantages. First of all, computers are cheaper than human experts, and automating the process should reduce the amount of resources needed for this task. Second, the models that are produced by automated processes are more consistent; human experts may treat two similar sentences differently, but a computer program will always behave the same. Finally, an approach employing structures ensures that there is a clear mapping between the elements of the computer model and the original text.

Natural Language Processing isn’t perfect yet: computers cannot understand human language. However, legal text is quite structured, and offers a lot more handholds for automated translation than, say, a novel.

Document Structure

The first step that we will have to undertake is to determine the structure of the document. Online services like Legislation.gov.uk and wetten.nl can make it easier to access legal documents because they can point you to the right part of the document (such as a chapter, paragraph, sentence, etc.). In most law texts, the structure has been made explicit using clear headings, like: Chapter 1 or Chapter 1. General Provisions. So, in order to detect structure, we need to detect these headings. This means we’ll need to search the document for lines starting with Chapter, followed by some designation (which we refer to as an index), and perhaps followed by some text – say, the title of the chapter. The index can be a lot of things: Arabic numbers (1, 2, 3, …), Roman numbers (I, II, III, …) or letters (a, b, c, …). Sometimes the index is an ordinal appearing before the chapter label: First chapter. It may even be a combination of several numbers and letters (5.2a). This is not a great problem, as we can more or less assume that whatever follows the word Chapter is the index.

The main problem with this approach is that there are also regular sentences that start with the word Chapter, and we need to separate those out. To do so, we can use some heuristics: A title will not end with a full stop (.); a heading will always start on a new line; etc.

This procedure to find the headings for chapters is repeated to find headings for sections, subsections, etc. Also, some sections (like numbered paragraphs or list items) will not have a full heading, but just a number, which we also need to recognise. Finally, some sections don’t have a heading but can be recognised because they start with a fixed language pattern. For example, a preamble in a (recent) Dutch Law — such as this — will start with: We, Beatrix, Queen of the Netherlands, Princess of Orange-Nassau, etc. etc. etc.

This procedure assumes that the input for the process is just text. Many documents will contain more information — such as textual markup — and headings may be more easily identified because they are marked as bold text, or even as headings. So, in situations where the input is made up of documents that are marked-up in a consistent way, it may be easier to recognise the patterns by taking layout into account in addition to text.

To actually find the patterns, we can use existing toolkits like GATE. After the patterns have been found, and the structure has been recognised, we can store it using a format such as MetaLex.

References

The second step is to detect the references from a portion of a law text to other portions of that text, or from a law text to other texts. References, like headings, follow a pattern. The simplest patterns are rather similar to headings; the text chapter 13 is probably a reference, unless it is part of a heading. Just like headings, basic references consist of a label (section, chapter, article) and an index (13, 13.2.1, XIII, m). And, just as with headings, we can find the references by looking for these patterns in the text.

However, this is only the simplest form of references. Besides references to a specific section, such as chapter 13, there are of course also references to a complete law. Some of these references follow a pattern as well, such as the law of October 1st, 2007. Most laws are cited by means of a citation title, though, such as the Railroad Act. Such titles can contain all kinds of words, and they don’t follow a strict pattern. Thus, such references cannot be detected using patterns. Instead, we use a list containing all (citation) titles to detect such references.

Other, more complex references contain multiple references in one statement, such as articles 13 and 14, or multiple levels: article 13, item e, of the Railroad Act, or even more complex combinations of the two: articles 13, item e, 14, item f, 15 and 16, items a and b, of the Railroad Act. Though more complex than the simple combination of label and index, these references still follow clear (sometimes recurring) patterns, and can be found in the text by searching for such patterns.

At the Leibniz Center for Law, we’ve created a parser based on these patterns, which had an accuracy of over 95%. For each reference found, we can construct some standardised name, and store it. With this technology, not only can we add hyperlinks to documents; we can also search for documents that refer to some specific document.

Classification

Now that we’ve got the structure and links in place, it’s time to start with the actual meaning of the text. Rather than tackling the entire text as a whole, we’ve selected sentences as the basic building blocks, and we attempt to create computer models for individual sentences first. Later, we can integrate those individual models to a complete model.

As a first step in creating the models, we start by assigning a broad meaning, or classification, to each sentence. Does the sentence give a definition for a concept, describe an obligation, or make a change in another law? In total, we distinguish fourteen different classes of sentences that appear in Dutch law texts. The next step in our automated approach is to assign a class to each sentence automatically.

To do so, we turn once again to language patterns. Legal language is rather strict, and legislative drafters don’t vary their language a lot — in a novel, variation may make for a more appealing text, but in a law, variation invites ambiguity. In fact, there are official Guidelines for Legislative Drafting that (among other things) reduce the variety of texts used. [Editor’s Note: For example, drafters of legislation in the U.S. House of Representatives Office of the Legislative Counsel have used Donald Hirsch’s Drafting Federal Law.] This means that for each of our classes, there’s a rather limited set of language patterns used. For example, definitions will look like one of these:

Under … is understood …

This law understands under … …

There are some variations in word order, but in the end, a small set of patterns is sufficient to describe all commonly used phrases. There is only one class of sentences where we cannot define a full set of patterns: obligations. In Dutch laws, obligations are often expressed without signal words like must or is obliged to. Instead, the obligations are presented as a fact:

No bodies are buried on a closed cemetery.

However, since the obligations are the only sentences lacking all-compassing patterns, we will assume that any sentence that does not mark a pattern is one of these obligations.

Based on the patterns found, we’ve created a classifier that attempts to sort sentences into these different classes. This classifier has an accuracy of 91%, and we expect that this can improved a bit further.

(As a side note: For classification tasks as these, a machine learning approach is often preferred; see, e.g., here. With such an approach, you provide the computer, not with patterns, but with a bunch of sample sentences. The computer will then extract its own patterns from those sentences, and use these to classify any new sentences. We’ve tried this approach as well (using the toolkit WEKA), and reached similarly accurate results.)

Modelling

Having classified the sentences, we now want to create models of the sentences. In essence, this means breaking down each sentence into smaller components and defining relationships between them. In some cases, the patterns used to classify the sentence already give us sufficient information to break up the sentence. Suppose we have a sentence like:

In article 7.12, sub one, second sentence, «article 7.3b» is replaced by: article 7.3c.

We classify this sentence as a replacement because of the text is replaced by. We can then also conclude that the text between angle quotes is the text to be replaced, the text following the colon is the replacing text, and the reference preceding it (which we’ve already detected) is the location where the replacement should take place.

This works fine for sentences that are somehow “about” the law. But for sentences that deal with some other domain, such as taxes, traffic, or commerce, we cannot predict all the elements. These sentences could be about anything — and statutes are full of such sentences. For such sentences, we need to follow a generic method. The aim is to model rules as a situation or action that is allowed or not allowed, similar to the models created in the HARNESS system of the ESTRELLA project. For example, for an obligation, we assume that the sentence describes some action that must be done. We try to identify who should be doing the action, and what other elements are involved. Thus, for the sentence:

Our Minister issues a warrant to the negligent person.

we would like to extract the following information:

Obligation
Action: Issue
Agent: Our Minister
Patient: Warrant
Recipient: Negligent person

(Such a table, or frame, is not the same as a computer model, but has all the elements needed to create one.)

Now, identifying these different elements of the sentence (agent, patient, recipient) is something that computer linguists have already worked on for a long time, which means we do not have to start from scratch. Instead, we can use existing parsers to do much of the work for us. For our Dutch laws, we use the Alpino parser. Such a parser will create a parse tree of a sentence. In this parse tree, the sentence will be split up in parts. The parser can identify which part is the subject, the direct object, the indirect object, etc. Based on this information, we can determine the agent, patient, and recipient (so-called semantic roles). In a sentence with a verb in the active voice, the subject is the agent, the direct object is the patient, and the indirect object is the recipient. Furthermore, the parser will determine the relationship between words, such as an adjective that modifies a noun. This information, too, helps us to make more accurate models.

We start out with the output of these parsers, and then try to extract all terms that have some more significance. If we want an application to compute whether or not a situation is allowed, a word like car can be treated in a generic way, but terms like allowed and not some special attention.

To Be Continued…

We still need to refine the method for making these models, and evaluate the results. After that, the individual models will need to be merged. But even as things stand now, we think these tools will help with getting legal text from paper into your computer systems.

[Editor’s Note: For more information about this topic, please see Dr. Adam Wyner’s post, Weaving the Legal Semantic Web with Natural Language Processing.]

Emile de Maat is a researcher at the Leibniz Center for Law (University of Amsterdam). His research focuses on the automatic extraction of metadata and meaning from legal sources.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Legislation.gov.uk

free access to law, Legal descriptive metadata, Legal identifiers, Legal knowledge representation, Legal metadata, Legal XML, Legislative information systems, Public access to legal information, Semantic Web and law 19 Responses »

Aug 152010

The launch of legislation.gov.uk by The [UK] National Archives marks a step change in public access to a primary source of legal information for citizens in the UK. Legislation.gov.uk is extensive, covering the four jurisdictions that make up the United Kingdom (England, Scotland, Wales and Northern Ireland) and over 800 years of history.

John Sheridan, Head of e-Services and Strategy at The National Archives, writes:

First, some background

We had two objectives with legislation.gov.uk: to deliver a high quality public service for people who need to consult, cite, and use legislation on the Web; and to expose the UK’s Statute Book as data, for people to take, use, and re-use for whatever purpose or application they wish. In particular, our aim was to show how the statute book can contribute to the growing Web of data as well as to the Web of documents.

Legislation.gov.uk replaces two predecessor services the UK government set up to provide access to legislation. The first was created by Her Majesty’s Stationery Office (HMSO), later to become the Office of Public Sector Information (OPSI), which is responsible for the official publication of legislation, and the London, Belfast and Edinburgh Gazettes. The functions of HMSO have been operating from The National Archives since 2006, including the provision of public access to legislation online. HMSO started publishing new legislation on the Web in 1996. Where HMSO and later OPSI provided access to legislation as it was enacted or made, a second service was developed, to provide access to the UK Statute Law Database. This contains revised versions of primary legislation, showing how they have changed over time.

Browsing the many different types of legislation in the UK

As in the United States, most lawyers in the UK rely on pay-for commercial legal research services. The people using the government’s online legislation service are generally not lawyers, but are drawn from a much wider group of people who need to know, cite, or use legislation as part of their job. These can range from police officers, to head teachers, to citizens defending their rights. Our users are people who need to know what a statute says, and who go looking for it using Google. They then quickly find their way to legislation.gov.uk.

What do people think they are seeing?

Before starting work on legislation.gov.uk, we did some research into the users of both the OPSI service and the UK Statute Law Database service. This research showed that they were very well used (over 1.5 million unique visitors per month to www.opsi.gov.uk), but that most of the people accessing legislation on the Web were not clear about the status of the material they were looking at. Our research showed that many people using legislation online assume that what they are looking at is both current and in force, simply because it is on the Web and available from an official source. Often users were accessing the original or as-enacted version of a statute, not knowing that they should be looking at the revised version, or that a revised version even existed.

Intuitive presentation

Our job is to present legislative material in such a way that the context and status of the information are clear. Legislation is complicated to understand; for example, an Act may have multiple sections, each with a different commencement date, or the Act may have prospective provisions. With legislation.gov.uk we have tried to develop a user interface that makes the status of each Act clear, so people know whether the statute they are viewing is current and in force. The usability challenge is to align what people think they are seeing with what they are actually looking at. We have done this by presenting both an original (see, e.g., here) and a latest-available version (see, e.g., here) of each Act, and a toggle between the two.

For more advanced users there is a timeline (see, e.g., here) which can be turned on to see how the legislation has changed and to navigate through an Act at particular points in time, including future or prospective versions.

Point in time navigation and the timeline

Open data

On the surface, legislation.gov.uk is an attractive Website, providing simple and direct access to legislation; at legislation.gov.uk people can view whole Acts, or a particular section, in either HTML (see, e.g., here) or in a print version in PDF (see, e.g., here). To achieve this, under the hood two very different sources of data have been combined. The data model for the original (or as-enacted) versions of legislation is largely driven by the typographic layout of legislative documents. For revised legislation, the data model is largely driven by version control, the management of multiple versions of different segments of a statute at different points in time. Reconciling these two different data models was a prerequisite step to developing our system.

An ‘on the fly’ created PDF

We aimed to make legislation.gov.uk a source of open data from the outset. The importance of open legal data is made powerfully by people like Carl Malamud and the Law.Gov campaign. Our desire to make the statute book available as open data motivated a number of technology choices we made. For example, the legislation.gov.uk Website is built on top of an open Application Programming Interface (API). The same API is available for others to use to access the raw data.

Using the API

The simplest way to get hold of the underlying data on legislation.gov.uk is to go to a piece of legislation on the Website, either a whole item, or a part or section, and just append /data.xml or /data.rdf to the URL. So, the data for, say, Section 1 of the Communications Act 2003, which is at http://www.legislation.gov.uk/ukpga/2003/21/section/1, is available at http://www.legislation.gov.uk/ukpga/2003/21/section/1/data.xml. We have taken a similar approach with lists, both in browse and search results. When looking at any list of legislation on legislation.gov.uk, it is easy to view the data. Simply append /data.feed to return that list in ATOM. (See, e.g., here.)

Open standards have played an important role throughout the development of legislation.gov.uk. All the data is held in XML, using a native XML database. The application logic is similarly constructed using open standards, in XSLTs and XQueries. Data and application portability were key objectives. We made considerable use of open source software like Orbeon Forms, Squid, and Apache.

The XML conforms to the Crown Legislation Markup Language (CLML) and associated schema. More general interchange formats for legislation such as CEN MetaLex lack the expressive power we need for UK legislation, but could relatively easily be wrapped around the XML we are making available. We have sought to surface richer metadata about legislation using RDF, but we would welcome feedback from users of the XML data about whether a MetaLex wrapper would be useful. (Note: We have used the MetaLex vocabulary in our RDF along with FRBR, as discussed below.) Similarly, it should be relatively easy to add a wrapper for the OAI-PMH protocol on top of the API we have built. We are not yet clear who would make use of such a service, if we built one, or whether we should leave the creation of an OAI-PMH interface to others. It is another open issue where we would welcome some feedback.

Persistent URIs

A major influence on legislation.gov.uk was a blog posting by Rick Jelliffe for O’Reilly’s XML.com. Jelliffe writes about something he calls PRESTO. He describes this as a system for legislation and public information in which “all documents, views and metadata at all significant levels of granularity and composition should be available in the best formats practical from their own permanent hierarchical URIs.”

Persistent URIs to pieces of legislation are very important, as they are to sources of law more generally. Initiatives like LegisLink, which Joe Carmel has written about here, attempt to retrofit a reliable naming scheme for legislation onto existing document-based systems. The URN:LEX namespace aims to facilitate the process of creating URIs for legal sources independent of a document’s online availability, location, and access mode.

We wanted to create high quality, persistent URIs for UK legislation from the outset. There are a number of different ways one might assign an unequivocal identifier to a legislative document. We have decided to use HTTP URIs and see no particular advantage in using URNs over HTTP URIs and indeed some disadvantages with URNs. Most importantly, HTTP URIs are actionable names. The advantage is that there is a built-in, ready-made, widely deployed and cost-effective resolution mechanism for resolving the identifier to a document, and a document to a representation. Having said that, we would consider supporting URN:LEX URNs in addition to our own URI Set, and would greatly welcome feedback from the community on this issue -– so please do comment if you have a view.

So, it follows, there are three types of URI for legislation on legislation.gov.uk, namely, identifier URIs, document URIs and representation URIs. Identifier URIs are for the ‘concept’ of a piece of legislation, how it was, how it is, and how it will be. (See, e.g., here.) Our use of these follow the Linked Data principles — the identifier URI is for a so-called non-information resource, something which can’t be conveyed in an electronic message. In other words, the URI is for the notion of a piece of legislation, rather than a particular rendition of it in a document. These URIs have been designed following the guidelines the UK Government has created for URI Sets, which our work helped to shape.

With legislation.gov.uk we support content negotiation, and follow the HTTP-Range 14 resolution approach, of responding to a request for the ‘non-information resource’ URI with a 303 response which redirects to a document URI.

Our document URIs refer to particular documents on the Web, for example the current, in-force version of a particular section of an Act. (See, e.g., here.) Crucially there are also point-in-time URIs for documents, which shows how that Act stood on a particular date (/yyyy-mm-dd) (see, e.g., here), or how it was when originally made (/enacted) (see, e.g., here). For any document we can return different representations or formats: a Web page on legislation.gov.uk, the underlying XML, a PDF, an HTML snippet, or even some RDF metadata. We recommend that people cite UK legislation in HTML by pointing to the identifier URI and by using the rel=”cite” attribute in the anchor tag.

Of course, we quickly discovered, it is one thing to suggest a design approach like PRESTO, and quite another to actually implement it. Jeni Tennison, who, working as a consultant to The Stationery Office, devised the URI Set for legislation (and much else about the legislation.gov.uk system), has blogged about the limitations of PRESTO and XPath-based URLs. I hope Jeni will find the time to blog some more about legislation.gov.uk, as there are many stories to be told.

One of the earliest pieces of design work we did for legislation.gov.uk was the URI Set. We wanted to follow PRESTO principles, but also account for changes over time, and for some of the peculiarities of UK legislation, in particular different geographic extents. (See, e.g., here.) PRESTO thinking is very evident on legislation.gov.uk; just look at the URLs as you move through the site.

Linked Data

We were also keen that the UK’s Statute Book make a contribution to the growing Web of Linked Data. The UK government is working hard to publish government data using Linked Data standards as part of work on data.gov.uk. The idea of the Web of Linked Data is to connect related information across the Web based on its meaning. In practice this means creating names for things (by ‘thing’ I mean anything: people, places, ideas) using HTTP, and when someone requests some information about that thing, returning data about it, ideally using RDF.

Legislation can make an important contribution to the Web of Linked Data. First, many important concepts and ideas are formally defined by statute. For example, there are 27 types of school in the UK and each one has a statutory definition. (See, e.g., here and here.) What it means to be a private limited company is again defined by statute, as are the UK’s eight data protection principles. One of our objectives with legislation.gov.uk is to enable people creating vocabularies and ontologies to exploit these definitions. This can be done, for example, by using the skos:definition property, to link terms in a vocabulary to the statute. The idea is to ease the process of rooting the Semantic Web in legally defined concepts. Part of the value of this linking is that it enables automatic checking to determine whether a part of the statute book has been repealed, in which case the related concept no longer exists. Crucially, legislation.gov.uk gives accurate information about when a section is repealed, by what piece of legislation, and when that repeal comes into force.

At the moment, the RDF from legislation.gov.uk is limited to largely bibliographic information. We have made use of the Functional Requirements for Bibliographic Records (FRBR) and the MetaLex vocabularies, primarily to relate the different types of resource we are making available. FRBR has the notion of a work, expressions of that work, manifestations of those expressions, and items. Similarly, MetaLex has the concepts of a BibliographicWork and BibliographicExpression. In the context of legislation.gov.uk, the identifier URIs relate to the work. Different versions of the legislation (current, original, different points in time, or prospective) relate to different expressions. The different formats (HTML, HTML Snippets, XML, and PDF) relate to the different manifestations. We have also made extensive use of Dublin Core Terms, for example to reflect that different versions apply to geographic extents. This is important as, for example, the same section of a statute may have been amended in one way as it applies in Scotland and in another way for England and Wales. We think FRBR, MetaLex, and Dublin Core Terms have all worked well, individually and in combination, for relating the different types of resource that we are making available.

One challenge we have is with changes to legislation that have yet to be applied to the data by the editorial team. Since we know what these effects are, we have also tried to represent this in RDF. We have used the MetaLex vocabulary to do this, but the result is complicated to interpret, and thus we suspect difficult for users of the data. MetaLex does not aid the elegant expression of amendment information (such as: statute A is changed by statute B, but only when commencement order C brings that change into force). We will be developing our own light-weight ontology for expressing some of these relationships, with the primary focus on ease of querying our data, rather than creating an ontology with the expressive power to be a cross-jurisdictional model.

It should then be possible to align this ontology with others post hoc. Our current use of RDF — and the potential to do more — is another issue where we would welcome feedback from the community.

Early adopters

People have already started to make use of the legislation.gov.uk URIs to support their Linked Data. One example is a project by ESD Toolkit. They have a created a SKOS vocabulary for all the different types of service that Local Authorities need to provide. They have linked this vocabulary to the powers and duties placed on Local Authorities in the legislation, using legislation.gov.uk identifier URIs. They have also used the API to pull back some of the text of the relevant statutes.

The future

We think there is huge potential over the next few years in the development of “accountable systems”. These are systems that are explicitly aware of statutory and other legal requirements and are able to process information explicitly in a way that complies with the (ever-changing) law. Here the legislation URIs can help enormously, either for people seeking to develop such accountable systems or any time someone wants to integrate an external system with the official source for statutory information. If the API is used in this way, we will need to consider carefully whether, and if so, how, the data is authenticated. We are not currently supplying digitally signed versions of UK legislation (unlike the GPO in the US) but we will be supporting the use of HTTPS, to provide a reasonable level of secure access to the data. However, if the data starts to be increasingly used in a new generation of accountable systems, we may need to address authenticity, with a view to increasing the guarantees we can make over the data.

There is much more we can do with legislation as data. Parts of the statute book are surprisingly well structured. For example, every year there is one or more Appropriation Acts. These typically contain a schedule with a table listing each government department, the amount allocated to it by Parliament for the year, and what that departments’ objectives are (see, e.g., here). It wouldn’t take much to create an XSLT just for these tables in the Appropriation Acts, from the XML provided from the API, to extract this data from all the Appropriation Acts, and publish that as Linked Data. There are many other examples of almost-structured data in legislation, waiting to be freed by developers, now that they have easy access to the underlying source.

We see this as a start. There is still much to do if we are to realise the potential of the statute book as public source of data. We are aiming to improve the modelling and the quantity of RDF data we make available about legislation, but it’s what others will do with the data that is really interesting. Now the UK has opened its statute book as Linked Data, we are keen to share our work with other governments, and to engage with academics in the legal informatics community and others with an interest in exploiting this rich source of information.

John Sheridan is Head of e-Services and Strategy at The [UK] National Archives, where he leads the team responsible for legislation.gov.uk. He is a specialist in official publishing on the Web, and in using Linked Data standards for government information. He also co-chairs the W3C eGovernment Interest Group.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Weaving the Legal Semantic Web with Natural Language Processing

Annotation of legal texts, information retrieval, Legal knowledge representation, Legal semantic web, Legal text mining, Legal text processing, Legal XML, natural language processing, Semantic annotation of legal texts, Semantic Web and law 4 Responses »

May 172010

The World Wide Web is a virtual cornucopia of legal information bearing on all manner of topics and in a spectrum of formats, much of it textual. However, to make use of this storehouse of textual information, it must be annotated and structured in such a way as to be meaningful to people and processable by computers. One of the visions of the Semantic Web has been to enrich information on the Web with annotation and structure. Yet, given that text is in a natural language (e.g., English, German, Japanese, etc.), which people can understand but machines cannot, some automated processing of the text itself is needed before further processing can be applied. In this article, we discuss one approach to legal information on the World Wide Web, the Semantic Web, and Natural Language Processing (NLP). Each of these are large, complex, and heterogeneous topics of research; in this short post, we can only hope to touch on a fragment and that heavily biased to our interests and knowledge. Other important approaches are mentioned at the end of the post. We give small working examples of legal textual input, the Semantic Web output, and how NLP can be used to process the input into the output.

Legal Information on the Web

For clients, legal professionals, and public administrators, the Web provides an unprecedented opportunity to search for, find, and reason with legal information such as case law, legislation, legal opinions, journal articles, and material relevant to discovery in a court procedure. With a search tool such as Google or indexed searches made available by Lexis-Nexis, Westlaw, or the World Legal Information Institute, the legal researcher can input key words into a search and get in return a (usually long) list of documents which contain, or are indexed by, those key words.

As useful as such searches are, they are also highly limited to the particular words or indexation provided, for the legal researcher must still manually examine the documents to find the substantive information. Moreover, current legal search mechanisms do not support more meaningful searches such as for properties or relationships, where, for example, a legal researcher searches for cases in which a company has the property of being in the role of plaintiff or where a lawyer is in the relationship of representing a client. Nor, by the same token, can searches be made with respect to more general (or more specific) concepts, such as “all cases in which a company has any role,” some particular fact pattern, legislation bearing on related topics, or decisions on topics related to a legal subject.

The underlying problem is that legal textual information is expressed in natural language. What literate people read as meaningful words and sentences appear to a computer as just strings of ones and zeros. Only by imposing some structure on the binary code is it converted to textual characters as we know them. Yet, there is no similar widespread system for converting the characters into higher levels of structure which correlate to our understanding of meaning. While a search can be made for the string plaintiff, there are no (widely available) searches for a string that represents an individual who bears the role of plaintiff. To make language on the Web more meaningful and structured, additional content must be added to the source material, which is where the Semantic Web and Natural Language Processing come into play.

Semantic Web

The Semantic Web is a complex of design principles and technologies which are intended to make information on the Web more meaningful and usable to people. We focus on only a small portion of this structure, namely the syntactic XML (eXtensible Markup Language) level, where elements are annotated so as to indicate linguistically relevant information and structure. (Click here for more on these points.) While the XML level may be construed as a ‘lower’ level in the Semantic Web “stack” — i.e., the layers of interrelated technologies that make up the Semantic Web — the XML level is nonetheless crucial to providing information to higher levels where ontologies (and click here for more on this) and logic play a role. So as to be clear about the relation between the Semantic Web and NLP, we briefly review aspects of XML by example, and furnish motivations as we go.

Suppose one looks up a case where Harris Hill is the plaintiff and Jane Smith is the attorney for Harris Hill. In a document related to this case, we would see text such as the following portions:

Harris Hill, plaintiff.
Jane Smith, attorney for the plaintiff.

While it is relatively straightforward to structure the binary string into characters, adding further information is more difficult. Consider what we know about this small fragment: Harris and Jane are (very likely) first names, Hill and Smith are last names, Harris Hill and Jane Smith are full names of people, plaintiff and attorney are roles in a legal case, Harris Hill has the role of plaintiff, attorney for is a relationship between two entities, and Jane Smith is in the attorney for relationship to Harris Hill. It would be useful to encode this information into a standardised machine-readable and processable form.

XML helps to encode the information by specifying requirements for tags that can be used to annotate the text. It is a highly expressive language, allowing one to define tags that suit one’s purposes so long as the specification requirements are met. One requirement is that each tag has a beginning and an ending; the material in between is the data that is being tagged. For example, suppose tags such as the following, where … indicates the data:


<legalcase>...</legalcase>,
<firstname>...</firstname>,
<lastname>...</lastname>,
<fullname>...</fullname>,
<plaintiff>...</plaintiff>,
<attorney>...</attorney>, 
<legalrelationship>...</legalrelationship>

Another requirement is that the tags have a tree structure, where each pair of tags in the document is included in another pair of tags and there is no crossing over:


<fullname><firstname>...</firstname>, 
<lastname>...</lastname></fullname>

is acceptable, but


<fullname><firstname>...<lastname>
</firstname> ...</lastname></fullname>

is unacceptable. Finally, XML tags can be organised into schemas to structure the tags.

With these points in mind, we could represent our fragment as:


<legalcase>
  <legalrelationship>
    <plaintiff>
      <fullname><firstname>Harris</firstname>,
           <lastname>Hill</lastname></fullname>
    </plaintiff>,
    <attorney>
      <fullname><firstname>Jane</firstname>,
           <lastname>Smith</lastname></fullname>
    </attorney>
  </legalrelationship
</legalcase>

We have added structured information — the tags — to the original text. While this is more difficult for us to read, it is very easy for a machine to read and process. In addition, the tagged text contains the content of the information, which can be presented in a range of alternative ways and formats using a transformation language such as XSLT (click here for more on this point) so that we have an easier-to-read format.

Why bother to include all this additional information in a legal text? Because these additions allow us to query the source text and submit the information to further processing such as inference. Given a query language, we could submit to the machine the query Who is the attorney in the case? and the answer would be Jane Smith. Given a rule language — such as RuleML or Semantic Web Rule Language (SWRL) — which has a rule such as If someone is an attorney for a client then that client has a privileged relationship with the attorney, it might follow from this rule that the attorney could not divulge the client’s secrets. Applying such a rule to our sample, we could infer that Jane Smith cannot divulge Harris Hill’s secrets.

Tower of Babel Though it may seem here like too much technology for such a small and obvious task, it is essential where we scale up our queries and inferences on large corpora of legal texts — hundreds of thousands if not millions of documents — which comprise vast storehouses of unstructured, yet meaningful data. Were all legal cases uniformly annotated, we could, in principle, find out every attorney for every plaintiff for every legal case. Where our tagging structure is very rich, our queries and inferences could also be very rich and detailed. Perhaps a more familiar way to view documents annotated with XML is as a database to which further processes can be applied over the Web.

Natural Language Processing

As we have presented it, we have an input, the corpus of texts, and an output, texts annotated with XML tags. The objective is to support a range of processes such as querying and inference. However, getting from a corpus of textual information to annotated output is a demanding task, generically referred to as the knowledge acquisition bottleneck. Not only is the task demanding on resources (time, money, manpower); it is also highly knowledge intensive since whoever is doing the annotation must know what to look for, and it is important that all of the annotators annotate the text in the same way (inter-annotator agreement) to support the processes. Thus, automation is central.

Yet processing language to support such richly annotated documents confronts a spectrum of difficult issues. Among them, natural language supports (1) implicit or presupposed information, (2) multiple forms with the same meaning, (3) the same form with different contextually dependent meanings, and (4) dispersed meanings. (Similar points can be made for sentences or other linguistic elements.) Here are examples of these four issues:

(1) “When did you stop taking drugs?” (presupposes that the person being questioned took drugs at sometime in the past);
(2) Jane Smith, Jane R. Smith, Smith, Attorney Smith… (different ways to refer to the same person);
(3) The individual referred to by the name “Jane Smith” in one case decision may not be the individual referred to by the name “Jane Smith” in another case decision;
(4) Jane Smith represented Jones Inc. She works for Dewey, Cheetum, and Howe. To contact her, write to j.smith@dch.com .

When we search for information, a range of linguistic structures or relationships may be relevant to our query, such as:

grammatical constructions (passive or active sentence forms, quotation, reference to other individuals, and so on),
grammatical relations among terms (e.g., whether an individual is the agent or target of some action),
ontological relations (e.g., classes and subclasses of experts, or the relationships among courts in the judicial hierarchy),
relationships among elements (e.g., who works for what organization), or
high-level patterns such as legal arguments (e.g., expert testimony) and fact patterns (e.g., culpable intent).

People grasp relationships between words and phrases, such that Bill exercises daily contrasts with the meaning of Bill is a couch potato, or that if it is true that Bill used a knife to kill Phil, then Bill killed Phil. Finally, meaning tends to be sparse; that is, there are a few words and patterns that occur very regularly, while most words or patterns occur relatively rarely in the corpus.

Natural language processing (NLP) takes on this highly complex and daunting problem as an engineering problem, decomposing large problems into smaller problems and subdomains until it gets to those which it can begin to address. Having found a solution to smaller problems, NLP can then address other problems or larger scope problems. Some of the subtopics in NLP are:

Generation – converting information in a database into natural language.
Understanding – converting natural language into a machine-readable form.
Information Retrieval – gathering documents which contain key words or phrases. This is essentially what is done by Google.
Text Summarization – summarizing (in a paragraph) the main meaning of a text or corpus.
Question Answering – making queries and giving answers to them, in natural language, with respect to some corpus of texts.
Information Extraction — identifying, annotating, and extracting information from documents for reuse, representation, or reasoning.

In this article, we are primarily (here) interested in information extraction.

NLP Approaches: Knowledge Light v. Knowledge Heavy

There are a range of techniques that one can apply to analyse the linguistic data obtained from legal texts; each of these techniques has strengths and weaknesses with respect to different problems. Statistical and machine-learning techniques are considered “knowledge light.” With statistical approaches, the processing presumes very little knowledge by the system (or analyst). Rather, algorithms are applied that compare and contrast large bodies of textual data, and identify regularities and similarities. Such algorithms encounter problems with sparse data or patterns that are widely dispersed across the text. (See Turney and Pantel (2010) for an overview of this area.) Machine learning approaches apply learning algorithms to annotated material to extend results to unannotated material, thus introducing more knowledge into the processing pipeline. However, the results are somewhat of a black box in that we cannot really know the rules that are learned and use them further.

With a “knowledge-heavy” approach, we know, in a sense, what we are looking for, and make this knowledge explicit in lists and rules for processing. Yet, this is labour- and knowledge-intensive. In the legal domain it is crucial to have humanly understandable explanations and justifications for the analysis of a text, which to our thinking warrants a knowledge-heavy approach.

One open source text-mining package, General Architecture for Text Engineering (GATE), consists of multiple components in a cascade or pipeline, each component automatically processing some aspect of the text, and then feeding into the next process. The underlying strategy in all the components is to find a pattern (from either a list or a previous process) which matches a rule, and then to apply the rule which annotates the text. Each component performs a particular process on the text, such as:

Sentence segmentation – dividing text into sentences.
Tokenisation – words identified by spaces between them.
Part-of-speech tagging – noun, verb, adjective, etc., determined by look-up and relationships among words.
Shallow syntactic parsing/chunking – dividing the text by noun phrase, verb phrase, subordinate clause, etc.
Named entity recognition – the entities in the text such as organisations, people, and places.
Dependency analysis – subordinate clauses, pronominal anaphora [i.e., identifying what a pronoun refers to], etc.

The system can also be used to annotate more specifically to elements of interest. In one study, we annotated legal cases from a case base (a corpus of cases) in order to identify a range of particular pieces of information that would be relevant to legal professionals such as:

Case citation.
Names of parties.
Roles of parties, meaning plaintiff or defendant.
Type of court.
Names of judges.
Names of attorneys.
Roles of attorneys, meaning the side they represent.
Final decision.
Cases cited.
Nature of the case, meaning using keywords to classify the case in terms of subject (e.g., criminal assault, intellectual property, etc.)

Applying our lists and rules to a corpus of legal cases, a sample output is as follows, where the coloured highlights are annotated as per the key on the right; the colours are a visualisation of the sorts of tags discussed above (to see a larger version of the image, right click on the image, then click on “View Image” or a similar phrase; when finished viewing the image, use the browser’s back button to return to the text):

Annotation of a Criminal Case

The approach is very flexible and appears in similar systems. (See, for example, de Maat and Winkels, Automatic Classification of Sentences in Dutch Laws (2008).) While it is labour intensive to develop and maintain such list and rule systems, with a collaborative, Web-based approach, it may be feasible to construct rich systems to annotate large domains.

Conclusion

In this post, we have given a very brief overview of how the Semantic Web and Natural Language Processing (NLP) apply to legal textual information to support annotation which then enables querying and inference. Of course, this is but one take on a much larger domain. In our view, it holds great promise in making legal information more transparent and available to more legal professionals. Aside from GATE, some other resources on text analytics and NLP are textbooks and lecture notes (see, e.g., Wilcock), as well as workshops (such as SPLeT and LOAIT). While applications of Natural Language Processing to legal materials are largely lab studies, the use of NLP in conjunction with Semantic Web technology to annotate legal texts is a fast-developing, results-oriented area which targets meaningful applications for legal professionals. It is well worth watching.

Dr. Adam Zachary Wyner is a Research Fellow at the University of Leeds, Institute of Communication Studies, Centre for Digital Citizenship. He currently works on the EU-funded project IMPACT: Integrated Method for Policy Making Using Argument Modelling and Computer Assisted Text Analysis. Dr. Wyner has a Ph.D. in Linguistics (Cornell, 1994) and a Ph.D. in Computer Science (King’s College London, 2008). His computer science Ph.D. dissertation is entitled Violations and Fulfillments in the Formal Representation of Contracts. He has published in the syntax and semantics of adverbs, deontic logic, legal ontologies, and argumentation theory with special reference to law. He is workshop co-chair of SPLeT 2010: Workshop on Semantic Processing of Legal Texts, to be held 23 May 2010 in Malta. He writes about his research at his blog, Language, Logic, Law, Software.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Suffusion theme by Sayontan Sinha

VoxPopuLII

Rough Consensus, Running Standards: The Restatement Project

AT4AM: the XML web editor used by Members of European Parliament

Standardizing the World’s Legislative Information—One hackathon at a time

Big Idea: Legislative Data Standards

What is a Legislative Hackathon?

Hacking With A Legislative Editor

Akoma Ntoso

University of Bologna Summer School–Ravenna

Chile National Library of Congress Browser-based editor

Other Efforts and the Future of Legislative Data Standards

The MetaLex Document Server

The root of the problem

Requirements

Making do with what we’ve got

Identifiers

The BWB XML service and format

A more general XML format: CEN MetaLex

Getting from BWB to CEN MetaLex

Step 1: Mapping

Step 2: Minting Identifiers

Step 3: Producing Metadata

Events and Processes

Step 4: Serialization

Publication

Results and Take Up

LexML Brazil Project

Teaching the Computer to Read Legal Text

Legislation.gov.uk

Weaving the Legal Semantic Web with Natural Language Processing

Recent Posts

VoxPop people and posts

Subscribe to VoxPopuLII

Blogroll

Big Idea: Legislative Data Standards

What is a Legislative Hackathon?

Hacking With A Legislative Editor

Akoma Ntoso

University of Bologna Summer School–Ravenna

Chile National Library of Congress Browser-based editor

Other Efforts and the Future of Legislative Data Standards

The root of the problem

Requirements

Making do with what we’ve got

Identifiers

The BWB XML service and format

A more general XML format: CEN MetaLex

Getting from BWB to CEN MetaLex

Step 1: Mapping

Step 2: Minting Identifiers

Step 3: Producing Metadata

Events and Processes

Step 4: Serialization

Publication

Results and Take Up

Recent Posts

VoxPop people and posts

Subscribe to VoxPopuLII

Blogroll

Tags