elegislation systems » VoxPopuLII

About LII / Get the law / Find a lawyer / Legal Encyclopedia / Help Out

Opening Up State Legal Data

Dec 082012

There have been a series of efforts to create a national legislative data standard – one master XML format to which all states will adhere for bills, laws, and regulations.Those efforts have gone poorly.

Few states provide bulk downloads of their laws. None provide APIs. Although nearly all states provide websites for people to read state laws, they are all objectively terrible, in ways that demonstrate that they were probably pretty impressive in 1995. Despite the clear need for improved online display of laws, the lack of a standard data format and the general lack of bulk data has enabled precious few efforts in the private sector. (Notably, there is Robb Schecter’s WebLaws.org, which provides vastly improved experiences for the laws of California, Oregon, and New York. There was also a site built experimentally by Ari Hershowitz that was used as a platform for last year’s California Laws Hackathon.)

A significant obstacle to prior efforts has been the perceived need to create a single standard, one that will accommodate the various textual legal structures that are employed throughout government. This is a significant practical hurdle on its own, but failure is all but guaranteed by also engaging major stakeholders and governments to establish a standard that will enjoy wide support and adoption.

What if we could stop letting the perfect be the enemy of the good? What if we ignore the needs of the outliers, and establish a “good enough” system, one that will at first simply work for most governments? And what if we completely skip the step of establishing a standard XML format? Wouldn’t that get us something, a thing superior to the nothing that we currently have?

The State Decoded
This is the philosophy behind The State Decoded. Funded by the John S. and James L. Knight Foundation, The State Decoded is a free, open source program to put legal codes online, and it does so by simply skipping over the problems that have hampered prior efforts. The project does not aspire to create any state law websites on its own but, instead, to provide the software to enable others to do so.

Still in its development (it’s at version 0.4), The State Decoded leaves it to each implementer to gather up the contents of the legal code in question and interface it with the program’s internal API. This could be done via screen-scraping off of an existing state code website, modifying the parser to deal with a bulk XML file, converting input data into the program’s simple XML import format, or by a few other methods. While a non-trivial task, it’s something that can be knocked out in an afternoon, thus avoiding the need to create a universal data format and to persuade Wexis to provide their data in that format.

The magic happens after the initial data import. The State Decoded takes that raw legal text and uses it to populate a complete, fully functional website for end-users to search and browse those laws. By packaging the Solr search engine and employing some basic textual analysis, every law is cross-referenced with other laws that cite it and laws that are textually similar. If there exists a repository of legal decisions for the jurisdiction in question, that can be incorporated, too, displaying a list of the court cases that cite each section. Definitions are detected, loaded into a dictionary, and make the laws self-documenting. End users can post comments to each law. Bulk downloads are created, letting people get a copy of the entire legal code, its structural elements, or the automatically assembled dictionary. And there’s a REST-ful, JSON-based API, ready to be used by third parties. All of this is done automatically, quickly, and seamlessly. The time elapsed varies, depending on server power and the length of the legal code, but it generally takes about twenty minutes from start to finish.

The State Decoded is a free program, released under the GNU Public License. Anybody can use it to make legal codes more accessible online. There are no strings attached.

It has already been deployed in two states, Virginia and Florida, despite not actually being a finished project yet.

State Variations
The striking variations in the structures of legal codes within the U.S. required the establishment of an appropriately flexible system to store and render those codes. Some legal codes are broad and shallow (e.g., Louisiana, Oklahoma), while others are narrow and deep (e.g., Connecticut, Delaware). Some list their sections by natural sort order, some in decimal, a few arbitrarily switch between the two. Many have quirks that will require further work to accommodate.

For example, California does not provide a catch line for their laws, but just a section number. One must read through a law to know what it actually does, rather than being able to glance at the title and get the general idea. Because this is a wildly impractical approach for a state code, the private sector has picked up the slack – Westlaw and LexisNexis each write their own titles for those laws, neatly solving the problem for those with the financial resources to pay for those companies’ offerings. To handle a problem like this, The State Decoded either needs to be able to display legal codes that lack section titles, or pointedly not support this inferior approach, and instead support the incorporation of third-party sources of title. In California, this might mean mining the section titles used internally by the California Law Revision Commission, and populating the section titles with those. (And then providing a bulk download of that data, allowing it to become a common standard for California’s section titles.)

Many state codes have oddities like this. The State Decoded combines flexibility with open source code to make it possible to deal with these quirks on a case-by-case basis. The alternative approach is too convoluted and quixotic to consider.

Regulations
There is strong interest in seeing this software adapted to handle regulations, especially from cash-strapped state governments looking to modernize their regulatory delivery process. Although this process is still in an early stage, it looks like rather few modifications will be required to support the storage and display of regulations within The State Decoded.

More significant modifications would be needed to integrate registers of regulations, but the substantial public benefits that would provide make it an obvious and necessary enhancement. The present process required to identify the latest version of a regulation is the stuff of parody. To select a state at random, here are the instructions provided on Kansas’s website:

To find the latest version of a regulation online, a person should first check the table of contents in the most current Kansas Register, then the Index to Regulations in the most current Kansas Register, then the current K.A.R. Supplement, then the Kansas Administrative Regulations. If the regulation is found at any of these sequential steps, stop and consider that version the most recent.

If Kansas has electronic versions of all this data, it seems almost punitive not to put it all in one place, rather than forcing people to look in four places. It seems self-evident that the current Kansas Register, the Index to Regulations, the K.A.R. Supplement, and the Kansas Administrative Regulations should have APIs, with a common API atop all four, which would make it trivial to present somebody with the current version of a regulation with a single request. By indexing registers of regulations in the manner that The State Decoded indexes court opinions, it would at least be possible to show people all activity around a given regulation, if not simply show them the present version of it, since surely that is all that most people want.

A Tapestry of Data
In a way, what makes The State Decoded interesting is not anything that it actually does, but instead what others might do with the data that it emits. By capitalizing on the program’s API and healthy collection of bulk downloads, clever individuals will surely devise uses for state legal data that cannot presently be envisioned.

The structural value of state laws is evident when considered within the context of other open government data.

Major open government efforts are confined largely to the upper-right quadrant of this diagram – those matters concerned with elections and legislation. There is also some excellent work being done in opening up access to court rulings, indexing scholarly publications, and nascent work in indexing the official opinions of attorneys general. But the latter group cannot be connected to the former group without opening up access to state laws. Courts do not make rulings about bills, of course – it is laws with which they concern themselves. Law journals cite far more laws than they do bills. To weave a seamless tapestry of data that connects court decisions to state laws to legislation to election results to campaign contributions, it is necessary to have a source of rich data about state laws. The State Decoded aims to provide that data.

Next Steps
The most important next step for The State Decoded is to complete it, releasing a version 1.0 of the software. It has dozens of outstanding issues – both bug fixes and new features – so this process will require some months. In that period, the project will continue to work with individuals and organizations in states throughout the nation who are interested in deploying The State Decoded to help them get started.

Ideally, The State Decoded will be obviated by states providing both bulk data and better websites for their codes and regulations. But in the current economic climate, neither are likely to be prioritized within state budgets, so unfortunately there’s liable to remain a need for the data provided by The State Decoded for some years to come. The day when it is rendered useless will be a good day.

Waldo Jaquith is a website developer with the Miller Center at the University of Virginia in Charlottesville, Virginia. He is a News Challenge Fellow with the John S. and James L. Knight Foundation and runs Richmond Sunlight, an open legislative service for Virginia. Jaquith previously worked for the White House Office of Science and Technology Policy, for which he developed Ethics.gov, and is now a member of the White House Open Data Working Group.
[Editor’s Note: For topic-related VoxPopuLII posts please see: Ari Hershowitz & Grant Vergottini, Standardizing the World’s Legal Information – One Hackathon At a Time; Courtney Minick, Universal Citation for State Codes; John Sheridan, Legislation.gov.uk; and Robb Schecter, The Recipe for Better Legal Information Services. ]

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

LexML Brazil Project

elegislation, elegislation systems, information retrieval, Legal identifiers, Legal metadata, Legal ontologies, Legal text processing, Legal XML, Legislative information systems, open source software, search 2 Responses »

Oct 152010

This post is divided into three topical sections. The first one is an introduction to the LexML Brazil Project and its unified search portal, after which some aspects related to semantic interoperability shall be presented and, at the end, we show the current work and future direction of the project.

Before going on to the aforementioned subjects, a few words about Brazil and its legislative and legal systems are necessary. Brazil is a country of continental proportions, composed of 27 states and more than five thousand municipalities, or cities, as in Brazil no distinction is made between town and city. As a federative system, each state and municipality has its own legislative chamber. While states and cities follow a unicameral system, the Federation itself has a bicameral system, with the National Congress divided into a Chamber of Deputies and the Federal Senate. These legislatures generate a great number of laws, or normative acts. The abundance of normative acts is very significant, considering that, in contrast with Common Law systems, Brazil’s legal system, based on the Civil Law, is characterized by the predominance of normative acts.

According to Edilenice Passos, “the proliferation of normative acts, of higher or lower hierarchy, eventually causes total chaos, for this big mass of juridical documents hampers the work of lawyers, of researchers, and of the very citizens, who are ruled by Brazilian laws.” Edilenice Passos also cites Arnoldo Wald, who, in 1969, was already alerting Brazilians that “the true legislative labyrinth created as a result of an inflation of statutes passed in recent years has turned the ruling Brazilian law into a patchwork, in which the mere legislative updating becomes a daily torture for a lawyer and a judge who are searching for the rules applicable to a specific subject, from among acts, supplementary acts, institutional acts, decree-laws, and other normative acts.”

Almost all Brazilian legal and legislative information is available through the Internet. However, this information is distributed among several thousand sites, each containing documents produced by a specific government institution. Thus, the relationships between acts of different institutions is not available explicitly, making it very hard to understand this “legal patchwork.”

Nowadays, much time is lost looking for this information, filtering the results of search engines. As Roy Tennant says, “Librarians like to search; everyone else likes to find,” and further adds: “People generally want to find everything they can on a topic, ranked by relevance and displayed in ways that make it easy to narrow in on their goal.”

Born to address these issues, LexML Brazil is an information network that aims to organize Brazil’s legislative and legal information. The project is an initiative of the “Comunidade TI Controle” (IT Control Community) and is being implemented by the Brazilian Federal Senate, through PRODASEN (the Senate’s special secretariat for information systems) and Interlegis (a virtual community of Brazilian legislatures).

LexML Brazil’s first product is the Legislative and Legal Information Portal, which opened on June 30, 2009, indexing 1.28 million documents. In September 2010, its index ranged through more than 1.5 million documents. By indexing the metadata collected from several institutions using the OAI-PMH protocol, the portal unifies access to a variety of legislative and legal information sources, which is a step toward the goal of guaranteeing Brazilians’ constitutional right of access to information.

LexML Portal

The LexML Portal home page layout is very simple and is similar to Google‘s main page. At this screen, it is possible to restrict the search to Legislation, Jurisprudence, or Bills.

The search results page allows the user to refine the search by using filters, according to his or her information requirements. Five filters are available: location, issuing authority, document type, date, and acronyms.

The detail page provides links to the official publication version of each document, and to other publications available in information systems of network participants, which, in this particular case, are: National Press, Presidency, Chamber of Deputies, and Federal Senate. General information about the document is available by clicking one of “Mais Detalhes (More details)” links, which directs the Web browser to the corresponding network participant’s metadata page. A service providing automatic identification of textual references can be activated by clicking the “Linker” label.

Semantic Interoperability

While systems interoperability and syntactic issues can be managed with the estabilished standards of representation, codification, and exchange (XML, METS, Unicode, OAI-PMH, etc.), structural and semantic interoperability demands the adoption of a reference model that allows the integration of several models and the use of a unified terminology for indexing different sources of information. According to Patel et al., the general purpose of semantic interoperability is “to support complex and advanced context-sensitive query processing over heterogeneous information resources.” Lack of semantic interoperability generates then the “information silos” problem, characterized by the lack of information integration and consequent inability to process complex queries.

The next section presents the design choices made by the LexML Brazil Project to address issues related to semantic interoperability using Ranganathan‘s “stratification planes” classification system, featuring: an idea plane, a verbal plane, and a notational plane.

Idea Plane

The idea plane is composed of the abstract entities of a domain, independently of how they are nominated or identified.

The metadata standards that propose to address interoperability issues do so either for a specific, restricted domain or for heterogeneous domains. Specialized metadata standards (MARC, EAD, MODS, etc.) allow different sources of information about specific domains (bibliographical or archival information) to be integrated and searched in an advanced form. On the other hand, the Dublin Core standard is one of the few that try to integrate arbitrarily heterogeneous sources using a minimum set of elements and qualifiers. Its characteristic simplicity enables easy adoption by multiple actors, but also hinders query processing, preventing the use of the rich chain of relationships among entities. The lack of generality or expressiveness of these standards precludes their use for achieving semantic interoperability of heterogeneous sources of legislative and legal information in Brazil.

An alternative is to use formal ontologies instead of metadata standards. According to Martin Doerr, “recently, more and more projects and theoreticians support the use of formal ontologies as common conceptual schema for information integration.” One such ontology, the CIDOC CRM model, was designed to help the integration, mediation, and interchange of heterogeneous cultural heritage information. It was developed in 1994 and has since been approved as the ISO 21127:2006 standard. The CIDOC CRM model is then a natural choice for conceptual schemas of legal and legislative information, if one considers that the text corpus consisting of a nation’s sources of law is a part of the nation’s cultural heritage information.

However, the CIDOC CRM “document” concept lacks the necessary detail needed to describe the relationships among the several information abstraction levels: work, expression, manifestation, and item. That requirement is fulfilled by the FRBR_ER entity-relationship model, which was considered as a reference model in earlier phases of the project (“An Adaptation of the FRBR Model to Legal Norms,” João Lima, Proceedings of the V Legislative XML Workshop, Florence, 2005) .

The FRBR_OO standard, an ontology created by a working group formed in 2003 by representatives of IFLA (International Federation of Library Associations and Institutions) and ICOM (International Council of Museums) for purposes of harmonizing both models, was adopted by the LexML project because it combines the advantages of both models while addressing their shortcomings. As such, FRBR_OO manifests a great affinity to the LexML domain (“A Time-aware Ontology for Legal Resources,” João Lima et al., Proceedings of the Tenth International ISKO Conference, 2008).

One of the great innovations of the CIDOC CRM model is the information structuring around temporal events, a central concept in the model. This contrasts with most other metadata models, which have resources as the central objects of interest. This innovative approach defines events as entities that connect actors, things (concrete and abstract), places, and time intervals.

This particular emphasis could be criticized on the ground that the user is generally interested in a specific resource, such as the text of a law. However, the result of a search for information about a law is much more relevant if it includes an organized list of events related to the resource, along with the resource itself.

The importance of choosing a suitable reference model is easily observable in the present discussion about what particular syntax to use to codify persistent identifiers — urn:lex, LegisLink, Akoma Ntoso, etc. Before reaching the syntax level, such discussions should focus first on the idea plane, where a greater potential for integration exists. A consensus reached at this level would allow great flexibility for the specification of diverse persistent identifier syntaxes.

Verbal Plane

The CIDOC CRM ontology separates the class of types and denominations from other classes. Multiple names, identifiers, and types can be attributed to all entities of the CRM, allowing any domain class to be classified by several taxonomies and be known by multiple names and identifiers.

This approach is used in LexML to represent different terms that identify the same concepts. Six classes form LexML’s uniform resource identifiers: place, authority, type of document, event, type of content, and language. To externalize the LexML vocabularies specification, we recommend, and use, the W3C SKOS (Simple Knowledge Organization System).

Notational Plane

The definition of uniform and persistent identifiers is fundamental for the creation and maintenance of an information chain. Identifiers are already part of the legal domain. For identification purposes, numbers are attributed to rulings, decisions, abridgments, and bills, allowing references by means of textual remissions. In the computational environment, the creation of persistent and uniform identifiers allows not only identification and reference, but also access to documents by means of textual hyperlinks.

Based on the experience of the Italian project Norme in Rete with respect to URN (Uniform Resource Name) identifiers, LexML defines a grammar for the construction of identifiers for legislative and legal documents in Brazil. As an example, the name “urn:lex:br:federal:lei:1993-06-21;8666” identifies, in a persistent and unique way, the “Federal Act No. 8666, of June 21, 1993.” If all information systems agree with respect to the identifiers, it is possible to share descriptive metadata, as well as information about semantic relationships, such as regulation, amendment, abrogation, etc.

The Linker service, accessible through the LexML Portal (see, e.g., Act 11.705 without linker and Act 11.705 with linker), creates hyperlinks automatically through a dynamic textual analysis that identifies textual remissions of [i.e., citations to] normative documents. These hyperlinks can be used to navigate through textual remissions.

Future Directions

LexML 1.0 consists of the Search Portal, the Resolution Service, the Persistent Identifier, and the Linker Service. The next version, LexML 2.0, will go further: it will involve the development of open source tools for managing the complete text of documents encoded according to the LexML Brazil XML Schema, which was derived from the schemas of the Akoma Ntoso Project.

The complete management of document texts in a structured form has been a goal of the project since its inception. In as early as 2000, the Federal Constitution Portal was implemented following this idea. This portal allows the user to see all the versions of the constitutional text through a timeline, with the option to see the list of historical changes [see, e.g., art. 12] and with the ability to navigate bi-directional links [for example, in art. 154, click on the blue arrows].

During the development of that portal, taking into account the various forms of XML used to encode normative texts in many countries, and especially the experience of the Italian project Norme in Rete, a decision was made to make a unified portal and a persistent identifier a priority of the LexML project. Presently, our efforts to build open source tools for management of document texts are being renewed. One of these tools, a LexML Document Editor, will enable the authoring of legal texts as if using a word processor, but producing a structured document at the end. Another tool is the Compiler, which will semi-automatically generate modified versions of documents that have been updated by other legal acts. The Consolidator will help to simplify the display of legal information — and users’ experience of the legal system — through the consolidation of several related normative acts into a single act. The Comparator will be used to display the differences between versions of a document. The last tool, the Publisher, will be used to render XML content in different formats, such as html, PDF, PDF-A, EPUB, etc., with the ability to choose different views of the same text, such as the original text, the updated text as of a specific date, etc.

Last but not least, the Information Management Committee, which is a community of practice composed of librarians, archivists, and information analysts of several institutions of the three Brazilian governmental branches, interested in the management of legal and legislative information, is responsible for the definition of the priority and long range planning of the LexML Brazil Project.

[Editor’s Note: For documentation, schemas, and controlled vocabularies respecting LexML Brazil, please see the LexML Brazil Project Website. For more information on these issues, please see the following VoxPopuLII posts: John Sheridan on Legislation.gov.uk, Ivan Mokanov on CANLII‘s innovative legal citation system, Joe Carmel on LegisLink, and Robb Shecter on OregonLaws.org.]

The LexML Brazil core team, from left to right: João Lima (joaolima at senado.gov.br) is the leader of The LexML Project. His Information Science Ph.D. thesis details many of the concepts presented here; João Holanda (jholanda at senado.gov.br) holds a BSc in History from UnB; João Rafael (jrafael at senado.gov.br) holds a MSc in Computer Science from UFMG and a BSc in Computer Science from UnB; Marcos Fragomeni (fragomeni at senado.gov.br) holds a BSc in Computer Science from UnB.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Suffusion theme by Sayontan Sinha

VoxPopuLII

Opening Up State Legal Data

LexML Brazil Project

Recent Posts

VoxPop people and posts

Subscribe to VoxPopuLII

Blogroll

VoxPopuLII

Opening Up State Legal Data

LexML Brazil Project

Recent Posts

VoxPop people and posts

Subscribe to VoxPopuLII

Blogroll

Tags