VoxPopuLII

The JUMAS Experience: Extracting Knowledge From Judicial Multimedia Digital Libraries

Annotation of legal texts, digital libraries, information retrieval, Legal knowledge management, natural language processing, Semantic annotation of legal texts, visualization 3 Responses »

Nov 022010

THE JUDICIAL CONTEXT: WHY INNOVATE?

The progressive deployment of information and communication technologies (ICT) in the courtroom (audio and video recording, document scanning, courtroom management systems), jointly with the requirement for paperless judicial folders pushed by e-justice plans (Council of the European Union, 2009), are quickly transforming the traditional judicial folder into an integrated multimedia folder, where documents, audio recordings and video recordings can be accessed, usually via a Web-based platform. This trend is leading to a continuous increase in the number and the volume of case-related digital judicial libraries, where the full content of each single hearing is available for online consultation. A typical trial folder contains: audio hearing recordings, audio/video hearing recordings, transcriptions of hearing recordings, hearing reports, and attached documents (scanned text documents, photos, evidences, etc.). The ICT container is typically a dedicated judicial content management system (court management system), usually physically separated and independent from the case management system used in the investigative phase, but interacting with it.

Most of the present ICT deployment has been focused on the deployment of case management systems and ICT equipment in the courtrooms, with content management systems at different organisational levels (court or district). ICT deployment in the judiciary has reached different levels in the various EU countries, but the trend toward full e-justice is clearly in progress. Accessibility of the judicial information, both of case registries (more widely deployed), and of case e-folders, has been strongly enhanced by state-of-the-art ICT technologies. Usability of the electronic judicial folders is still affected by a traditional support toolset, such that an information search is limited to text search, transcription of audio recordings (indispensable for text search) is still a slow and fully manual process, template filling is a manual activity, etc. Part of the information available in the trial folder is not yet directly usable, but requires a time-consuming manual search. Information embedded in audio and video recordings, describing not only what was said in the courtroom, but also the specific trial context and the way in which it was said, still needs to be exploited. While the information is there, information extraction and semantically empowered judicial information retrieval still wait for proper exploitation tools. The growing amount of digital judicial information calls for the development of novel knowledge management techniques and their integration into case and court management systems. In this challenging context a novel case and court management system has been recently proposed.

The JUMAS project (JUdicial MAnagement by digital libraries Semantics) was started in February 2008, with the support of the Polish and Italian Ministries of Justice. JUMAS seeks to realize better usability of multimedia judicial folders — including transcriptions, information extraction, and semantic search –to provide to users a powerful toolset able to fully address the knowledge embedded in the multimedia judicial folder.

The JUMAS project has several objectives:

(1) direct searching of audio and video sources without a verbatim transcription of the proceedings;
(2) exploitation of the hidden semantics in audiovisual digital libraries in order to facilitate search and retrieval, intelligent processing, and effective presentation of multimedia information;
(3) fusing information from multimodal sources in order to improve accuracy during the automatic transcription and the annotation phases;
(4) optimizing the document workflow to allow the analysis of (un)structured information for document search and evidence-based assessment; and
(5) supporting a large scale, scalable, and interoperable audio/video retrieval system.

JUMAS is currently under validation in the Court of Wroclaw (Poland) and in the Court of Naples (Italy).

THE DIMENSIONS OF THE PROBLEM

In order to explain the relevance of the JUMAS objectives, we report some volume data related to the judicial domain context. Consider, for instance, the Italian context, where there are 167 courts, grouped in 29 districts, with about 1400 courtrooms. In a law court of medium size (10 courtrooms), during a single legal year, about 150 hearings per court are held, with an average duration of 4 hours. Considering that in approximately 40% of them only audio is recorded, in 20% both audio and video, while the remaining 40% has no recording, the multimedia recording volume we are talking about is 2400 hours of audio and 1200 hours of audio/video per year. The dimensioning related to the audio and audio/video documentation starts from the hypothesis that multimedia sources must be acquired at high quality in order to obtain good results in audio transcription and video annotation, which will affect the performance connected to the retrieval functionalities. Following these requirements, one can figure out a storage space of about 8.7 megabytes per minute (MB/min) for audio and 39 MB/min for audio/video. This means that during a legal year for a court of medium size we need to allocate 4 terabytes (TB) for audio/video material. Under these hypotheses, the overall size generated by all the courts in the justice system — for Italy only — in one year is about 800 TB. This shows how the justice sector is a major contributor to the data deluge (The Economist, 2010).

In order to manage such quantities of complex data, JUMAS aims to:

Optimize the workflow of information through search, consultation, and archiving procedures;
Introduce a higher degree of knowledge through the aggregation of different heterogeneous sources;
Speed up and improve decision processes by enabling discovery and exploitation of knowledge embedded in multimedia documents, in order to consequently reduce unnecessary costs;
Model audio-video proceedings in order to compare different instances; and
Allow traceability of proceedings during their evolution.

THE JUMAS SYSTEM

To achieve the above-mentioned goals, the JUMAS project has delivered the JUMAS system, whose main functionalities (depicted in Figure 1) are: automatic speech transcription, emotion recognition, human behaviour annotation, scene analysis, multimedia summarization, template-filling, and deception recognition.

Figure 1: Overview of the JUMAS functionalities

The architecture of JUMAS, depicted in Figure 2, is based on a set of key components: a central database, a user interface on a Web portal, a set of media analysis modules, and an orchestration module that allows the coordination of all system functionalities.

Figure 2: Overview of the JUMAS architecture

The media stream recorded in the courtroom includes both audio and video that are analyzed to extract semantic information used to populate the multimedia object database. The outputs of these processes are annotations: i.e., tags attached to media streams and stored in the database (Oracle 11g). The integration among modules is performed through a workflow engine and a module called JEX (JUMAS EXchange library). While the workflow engine is a service application that manages all the modules for audio and video analysis, JEX provides a set of services to upload and retrieve annotations to and from the JUMAS database.

JUMAS: THE ICT COMPONENTS

KNOWLEDGE EXTRACTION

Automatic Speech Transcription. For courtroom users, the primary sources of information are audio-recordings of hearings/proceedings. In light of this, JUMAS provides an Automatic Speech Recognition (ASR) system (Falavigna et al., 2009 and Rybach et al., 2009) trained on real judicial data coming from courtrooms. Currently two ASR systems have been developed: the first provided by Fondazione Bruno Kessler for the Italian language, and the second delivered by RWTH Aachen University for the Polish language. Currently, the ASR modules in the JUMAS system offer 61% accuracy over the generated automatic transcriptions, and represent the first contribution for populating the digital libraries with judicial trial information. In fact, the resulting transcriptions are the main information resource that are to be enriched by other modules, and then can be consulted by end users through the information retrieval system.

Emotion Recognition. Emotional states represent an aspect of knowledge embedded into courtroom media streams that may be used to enrich the content available in multimedia digital libraries. Enabling the end user to consult transcriptions by considering the associated semantics as well, represents an important achievement, one that allows the end user to retrieve an enriched written sentence instead of a “flat” one. Even if there is an open ethical discussion about the usability of this kind of information, this achievement radically changes the consultation process: sentences can assume different meanings according to the affective state of the speaker. To this purpose an emotion recognition module (Archetti et al., 2008), developed by the Consorzio Milano Ricerche jointly with the University of Milano-Bicocca, is part of the JUMAS system. A set of real-world human emotions obtained from courtroom audio recordings has been gathered for training the underlying supervised learning model.

Human Behavior Annotation. A further fundamental information resource is related to the video stream. In addition to emotional states identification, the recognition of relevant events that characterize judicial proceedings can be valuable for end users. Relevant events occurring during proceedings trigger meaningful gestures, which emphasize and anchor the words of witnesses, and highlight that a relevant concept has been explained. For this reason, the human behavior recognition modules (Briassouli et al., 2009, Kovacs et al., 2009), developed by CERTH-ITI and by MTA SZTAKI Research Institute, have been included in the JUMAS system. The video analysis captures relevant events that occur during the course of a trial in order to create semantic annotations that can be retrieved by judicial end users. The annotations are mainly concerned with the events related to the witness: change of posture, change of witness, hand gestures, gestures indicating conflict or disagreement.

Deception Detection. Discriminating between truthful and deceptive assertions is one of the most important activities performed by judges, lawyers, and prosecutors. In order to support these individuals’ reasoning activities, respecting corroborating/contradicting declarations (in the case of lawyers and prosecutors) and judging the accused (judges), a deception recognition module has been developed as a support tool. The deception detection module developed by the Heidelberg Institute for Theoretical Studies is based on the automatic classification of sentences performed by the ASR systems (Ganter and Strube, 2009). In particular, in order to train the deception detection module, a manual annotation of the output of the ASR module — with the help of the minutes of the transcribed sessions — has been performed. The knowledge extracted for training the classification module deals with lies, contradictory statements, quotations, and expressions of vagueness.

Information Extraction. The current amount of unstructured textual data available in the judicial domain, especially related to transcriptions of proceedings, highlights the necessity of automatically extracting structured data from unstructured material, to facilitate efficient consultation processes. In order to address the problem of structuring data coming from the automatic speech transcription system, Consorzio Milano Ricerche has defined an environment that combines regular expressions, probabilistic models, and background information available in each court database system. Thanks to this functionality, the judicial actors can view each individual hearing as a structured summary, where the main information extracted consists of the names of the judge, lawyers, defendant, victim, and witnesses; the names of the subjects cited during a deposition; the date cited during a deposition; and data about the verdict.

KNOWLEDGE MANAGEMENT

Information Retrieval. Currently, to retrieve audio/video materials acquired during a trial, the end user must manually consult all of the multimedia tracks. The identification of a particular position or segment of a multimedia stream, for purposes of looking at and/or listening to specific declarations, is possible either by remembering the time stamp when the events occurred, or by watching or hearing the whole recording. The amalgamation of automatic transcriptions, semantic annotations, and ontology representations allows us to build a flexible retrieval environment, based not only on simple textual queries, but also on broad and complex concepts. In order to define an integrated platform for cross-modal access to audio and video recordings and their automatic transcriptions, a retrieval module able to perform semantic multimedia indexing and retrieval has been developed by the Information Retrieval group at MTA SZTAKI. (Darczy et al., 2009)

Ontology as Support to Information Retrieval. An ontology is a formal representation of the knowledge that characterizes a given domain, through a set of concepts and a set of relationships that obtain among them. In the judicial domain, an ontology represents a key element that supports the retrieval process performed by end users. Text-based retrieval functionalities are not sufficient for finding and consulting transcriptions (and other documents) related to a given trial. A first contribution of the ontology component developed by the University of Milano-Bicocca (CSAI Research Center) for the JUMAS system provides query expansion functionality. Query expansion aims at extending the original query specified by end users with additional related terms. The whole set of keywords is then automatically submitted to the retrieval engine. The main objective is to narrow the search focus or to increase recall.

User Generated Semantic Annotations. Judicial users usually manually tag some documents for purposes of highlighting (and then remembering) significant portions of the proceedings. An important functionality, developed by the European Media Laboratory and offered by the JUMAS system, relates to the possibility of digitally annotating relevant arguments discussed during a proceeding. In this context, the user-generated annotations may aid judicial users in future retrieval and reasoning processes. The user-generated annotations module included in the JUMAS system allows end users to assign free tags to multimedia content in order to organize the trials according to their personal preferences. It also enables judges, prosecutors, lawyers, and court clerks to work collaboratively on a trial; e.g., a prosecutor who is taking over a trial can build on the notes of his or her predecessor.

KNOWLEDGE VISUALIZATION

Hyper Proceeding Views. The user interface of JUMAS — developed by ESA Projekt and Consorzio Milano Ricerche — is a Web portal, in which the contents of the database are presented in different views. The basic view allows browsing of the trial archive, as in a typical court management system, to view general information (dates of hearings, name of people involved) and documents attached to each trial. JUMAS’s distinguishing features include the automatic creation of a summary of the trial, the presentation of user-generated annotations, and the Hyper Proceeding View: i.e., an advanced presentation of media contents and annotations that allows the user to perform queries on contents, and jump directly to relevant parts of media files.

Multimedia Summarization. Digital videos represent a fundamental information resource about the events that occur during a trial: such videos can be stored, organized, and retrieved in a short time and at low cost. However, considering the dimensions that a video resource can assume during the recording of a trial, judicial actors have specified several requirements for digital trial videos: fast navigation of the stream, efficient access to data within the stream, and effective representation of relevant contents. One possible solution to these requirements lies in multimedia summarization, which derives a synthetic representation of audio/video contents with a minimal loss of meaningful information. In order to address the problem of defining a short and meaningful representation of a proceeding, a multimedia summarization environment based on an unsupervised learning approach has been developed (Fersini et al., 2010) by Consorzio Milano Ricerche jointly with University of Milano-Bicocca.

CONCLUSION

The JUMAS project demonstrates the feasibility of enriching a court management system with an advanced toolset for extracting and using the knowledge embedded in a multimedia judicial folder. Automatic transcription, template filling, and semantic enrichment help judicial actors not only to save time, but also to enhance the quality of their judicial decisions and performance. These improvements are mainly due to the ability to search not only text, but also events that occur in the courtroom. The initial results of the JUMAS project indicate that automatic transcription and audio/video annotations can provide additional information in an affordable way.

Elisabetta Fersini has a post-doctoral research fellow position at the University of Milano-Bicocca. She received her PhD with a thesis on “Probabilistic Classification and Clustering using Relational Models.” Her research interest is mainly focused on (Relational) Machine Learning in several domains, including Justice, Web, Multimedia, and Bioinformatics.

VoxPopuLII is edited by Judith Pratt.

Editor-in-Chief is Robert Richards, to whom queries should be directed.

LexML Brazil Project

elegislation, elegislation systems, information retrieval, Legal identifiers, Legal metadata, Legal ontologies, Legal text processing, Legal XML, Legislative information systems, open source software, search 2 Responses »

Oct 152010

This post is divided into three topical sections. The first one is an introduction to the LexML Brazil Project and its unified search portal, after which some aspects related to semantic interoperability shall be presented and, at the end, we show the current work and future direction of the project.

Before going on to the aforementioned subjects, a few words about Brazil and its legislative and legal systems are necessary. Brazil is a country of continental proportions, composed of 27 states and more than five thousand municipalities, or cities, as in Brazil no distinction is made between town and city. As a federative system, each state and municipality has its own legislative chamber. While states and cities follow a unicameral system, the Federation itself has a bicameral system, with the National Congress divided into a Chamber of Deputies and the Federal Senate. These legislatures generate a great number of laws, or normative acts. The abundance of normative acts is very significant, considering that, in contrast with Common Law systems, Brazil’s legal system, based on the Civil Law, is characterized by the predominance of normative acts.

According to Edilenice Passos, “the proliferation of normative acts, of higher or lower hierarchy, eventually causes total chaos, for this big mass of juridical documents hampers the work of lawyers, of researchers, and of the very citizens, who are ruled by Brazilian laws.” Edilenice Passos also cites Arnoldo Wald, who, in 1969, was already alerting Brazilians that “the true legislative labyrinth created as a result of an inflation of statutes passed in recent years has turned the ruling Brazilian law into a patchwork, in which the mere legislative updating becomes a daily torture for a lawyer and a judge who are searching for the rules applicable to a specific subject, from among acts, supplementary acts, institutional acts, decree-laws, and other normative acts.”

Almost all Brazilian legal and legislative information is available through the Internet. However, this information is distributed among several thousand sites, each containing documents produced by a specific government institution. Thus, the relationships between acts of different institutions is not available explicitly, making it very hard to understand this “legal patchwork.”

Nowadays, much time is lost looking for this information, filtering the results of search engines. As Roy Tennant says, “Librarians like to search; everyone else likes to find,” and further adds: “People generally want to find everything they can on a topic, ranked by relevance and displayed in ways that make it easy to narrow in on their goal.”

Born to address these issues, LexML Brazil is an information network that aims to organize Brazil’s legislative and legal information. The project is an initiative of the “Comunidade TI Controle” (IT Control Community) and is being implemented by the Brazilian Federal Senate, through PRODASEN (the Senate’s special secretariat for information systems) and Interlegis (a virtual community of Brazilian legislatures).

LexML Brazil’s first product is the Legislative and Legal Information Portal, which opened on June 30, 2009, indexing 1.28 million documents. In September 2010, its index ranged through more than 1.5 million documents. By indexing the metadata collected from several institutions using the OAI-PMH protocol, the portal unifies access to a variety of legislative and legal information sources, which is a step toward the goal of guaranteeing Brazilians’ constitutional right of access to information.

LexML Portal

The LexML Portal home page layout is very simple and is similar to Google‘s main page. At this screen, it is possible to restrict the search to Legislation, Jurisprudence, or Bills.

The search results page allows the user to refine the search by using filters, according to his or her information requirements. Five filters are available: location, issuing authority, document type, date, and acronyms.

The detail page provides links to the official publication version of each document, and to other publications available in information systems of network participants, which, in this particular case, are: National Press, Presidency, Chamber of Deputies, and Federal Senate. General information about the document is available by clicking one of “Mais Detalhes (More details)” links, which directs the Web browser to the corresponding network participant’s metadata page. A service providing automatic identification of textual references can be activated by clicking the “Linker” label.

Semantic Interoperability

While systems interoperability and syntactic issues can be managed with the estabilished standards of representation, codification, and exchange (XML, METS, Unicode, OAI-PMH, etc.), structural and semantic interoperability demands the adoption of a reference model that allows the integration of several models and the use of a unified terminology for indexing different sources of information. According to Patel et al., the general purpose of semantic interoperability is “to support complex and advanced context-sensitive query processing over heterogeneous information resources.” Lack of semantic interoperability generates then the “information silos” problem, characterized by the lack of information integration and consequent inability to process complex queries.

The next section presents the design choices made by the LexML Brazil Project to address issues related to semantic interoperability using Ranganathan‘s “stratification planes” classification system, featuring: an idea plane, a verbal plane, and a notational plane.

Idea Plane

The idea plane is composed of the abstract entities of a domain, independently of how they are nominated or identified.

The metadata standards that propose to address interoperability issues do so either for a specific, restricted domain or for heterogeneous domains. Specialized metadata standards (MARC, EAD, MODS, etc.) allow different sources of information about specific domains (bibliographical or archival information) to be integrated and searched in an advanced form. On the other hand, the Dublin Core standard is one of the few that try to integrate arbitrarily heterogeneous sources using a minimum set of elements and qualifiers. Its characteristic simplicity enables easy adoption by multiple actors, but also hinders query processing, preventing the use of the rich chain of relationships among entities. The lack of generality or expressiveness of these standards precludes their use for achieving semantic interoperability of heterogeneous sources of legislative and legal information in Brazil.

An alternative is to use formal ontologies instead of metadata standards. According to Martin Doerr, “recently, more and more projects and theoreticians support the use of formal ontologies as common conceptual schema for information integration.” One such ontology, the CIDOC CRM model, was designed to help the integration, mediation, and interchange of heterogeneous cultural heritage information. It was developed in 1994 and has since been approved as the ISO 21127:2006 standard. The CIDOC CRM model is then a natural choice for conceptual schemas of legal and legislative information, if one considers that the text corpus consisting of a nation’s sources of law is a part of the nation’s cultural heritage information.

However, the CIDOC CRM “document” concept lacks the necessary detail needed to describe the relationships among the several information abstraction levels: work, expression, manifestation, and item. That requirement is fulfilled by the FRBR_ER entity-relationship model, which was considered as a reference model in earlier phases of the project (“An Adaptation of the FRBR Model to Legal Norms,” João Lima, Proceedings of the V Legislative XML Workshop, Florence, 2005) .

The FRBR_OO standard, an ontology created by a working group formed in 2003 by representatives of IFLA (International Federation of Library Associations and Institutions) and ICOM (International Council of Museums) for purposes of harmonizing both models, was adopted by the LexML project because it combines the advantages of both models while addressing their shortcomings. As such, FRBR_OO manifests a great affinity to the LexML domain (“A Time-aware Ontology for Legal Resources,” João Lima et al., Proceedings of the Tenth International ISKO Conference, 2008).

One of the great innovations of the CIDOC CRM model is the information structuring around temporal events, a central concept in the model. This contrasts with most other metadata models, which have resources as the central objects of interest. This innovative approach defines events as entities that connect actors, things (concrete and abstract), places, and time intervals.

This particular emphasis could be criticized on the ground that the user is generally interested in a specific resource, such as the text of a law. However, the result of a search for information about a law is much more relevant if it includes an organized list of events related to the resource, along with the resource itself.

The importance of choosing a suitable reference model is easily observable in the present discussion about what particular syntax to use to codify persistent identifiers — urn:lex, LegisLink, Akoma Ntoso, etc. Before reaching the syntax level, such discussions should focus first on the idea plane, where a greater potential for integration exists. A consensus reached at this level would allow great flexibility for the specification of diverse persistent identifier syntaxes.

Verbal Plane

The CIDOC CRM ontology separates the class of types and denominations from other classes. Multiple names, identifiers, and types can be attributed to all entities of the CRM, allowing any domain class to be classified by several taxonomies and be known by multiple names and identifiers.

This approach is used in LexML to represent different terms that identify the same concepts. Six classes form LexML’s uniform resource identifiers: place, authority, type of document, event, type of content, and language. To externalize the LexML vocabularies specification, we recommend, and use, the W3C SKOS (Simple Knowledge Organization System).

Notational Plane

The definition of uniform and persistent identifiers is fundamental for the creation and maintenance of an information chain. Identifiers are already part of the legal domain. For identification purposes, numbers are attributed to rulings, decisions, abridgments, and bills, allowing references by means of textual remissions. In the computational environment, the creation of persistent and uniform identifiers allows not only identification and reference, but also access to documents by means of textual hyperlinks.

Based on the experience of the Italian project Norme in Rete with respect to URN (Uniform Resource Name) identifiers, LexML defines a grammar for the construction of identifiers for legislative and legal documents in Brazil. As an example, the name “urn:lex:br:federal:lei:1993-06-21;8666” identifies, in a persistent and unique way, the “Federal Act No. 8666, of June 21, 1993.” If all information systems agree with respect to the identifiers, it is possible to share descriptive metadata, as well as information about semantic relationships, such as regulation, amendment, abrogation, etc.

The Linker service, accessible through the LexML Portal (see, e.g., Act 11.705 without linker and Act 11.705 with linker), creates hyperlinks automatically through a dynamic textual analysis that identifies textual remissions of [i.e., citations to] normative documents. These hyperlinks can be used to navigate through textual remissions.

Future Directions

LexML 1.0 consists of the Search Portal, the Resolution Service, the Persistent Identifier, and the Linker Service. The next version, LexML 2.0, will go further: it will involve the development of open source tools for managing the complete text of documents encoded according to the LexML Brazil XML Schema, which was derived from the schemas of the Akoma Ntoso Project.

The complete management of document texts in a structured form has been a goal of the project since its inception. In as early as 2000, the Federal Constitution Portal was implemented following this idea. This portal allows the user to see all the versions of the constitutional text through a timeline, with the option to see the list of historical changes [see, e.g., art. 12] and with the ability to navigate bi-directional links [for example, in art. 154, click on the blue arrows].

During the development of that portal, taking into account the various forms of XML used to encode normative texts in many countries, and especially the experience of the Italian project Norme in Rete, a decision was made to make a unified portal and a persistent identifier a priority of the LexML project. Presently, our efforts to build open source tools for management of document texts are being renewed. One of these tools, a LexML Document Editor, will enable the authoring of legal texts as if using a word processor, but producing a structured document at the end. Another tool is the Compiler, which will semi-automatically generate modified versions of documents that have been updated by other legal acts. The Consolidator will help to simplify the display of legal information — and users’ experience of the legal system — through the consolidation of several related normative acts into a single act. The Comparator will be used to display the differences between versions of a document. The last tool, the Publisher, will be used to render XML content in different formats, such as html, PDF, PDF-A, EPUB, etc., with the ability to choose different views of the same text, such as the original text, the updated text as of a specific date, etc.

Last but not least, the Information Management Committee, which is a community of practice composed of librarians, archivists, and information analysts of several institutions of the three Brazilian governmental branches, interested in the management of legal and legislative information, is responsible for the definition of the priority and long range planning of the LexML Brazil Project.

[Editor’s Note: For documentation, schemas, and controlled vocabularies respecting LexML Brazil, please see the LexML Brazil Project Website. For more information on these issues, please see the following VoxPopuLII posts: John Sheridan on Legislation.gov.uk, Ivan Mokanov on CANLII‘s innovative legal citation system, Joe Carmel on LegisLink, and Robb Shecter on OregonLaws.org.]

The LexML Brazil core team, from left to right: João Lima (joaolima at senado.gov.br) is the leader of The LexML Project. His Information Science Ph.D. thesis details many of the concepts presented here; João Holanda (jholanda at senado.gov.br) holds a BSc in History from UnB; João Rafael (jrafael at senado.gov.br) holds a MSc in Computer Science from UFMG and a BSc in Computer Science from UnB; Marcos Fragomeni (fragomeni at senado.gov.br) holds a BSc in Computer Science from UnB.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Teaching the Computer to Read Legal Text

Legal text processing, Legal XML, natural language processing 3 Responses »

Oct 062010

In this post, I will describe how natural language processing can help in creating computer systems dealing with the law.

A lot of computer systems are being designed to help users deal with legal texts — accessing, understanding, or applying them. [Editor’s Note: Michael Poulshock’s Jureeka is an example of a system that automates the application of legal texts.] Other systems — such as DALOS — are about creating legal texts, providing support for the writers, or simulating the effects of a text. Such systems are based on something more than “just” the legal text: there is XML mark-up, an OWL ontology, or a representation of the rules in SWRL or some programming language. This means that any piece of legislation that you want to use on your computer system needs to be translated into this computer representation.

We try to support this translation using natural language processing, so that (part of) the translation can be done by a computer. This automation should have a number of advantages. First of all, computers are cheaper than human experts, and automating the process should reduce the amount of resources needed for this task. Second, the models that are produced by automated processes are more consistent; human experts may treat two similar sentences differently, but a computer program will always behave the same. Finally, an approach employing structures ensures that there is a clear mapping between the elements of the computer model and the original text.

Natural Language Processing isn’t perfect yet: computers cannot understand human language. However, legal text is quite structured, and offers a lot more handholds for automated translation than, say, a novel.

Document Structure

The first step that we will have to undertake is to determine the structure of the document. Online services like Legislation.gov.uk and wetten.nl can make it easier to access legal documents because they can point you to the right part of the document (such as a chapter, paragraph, sentence, etc.). In most law texts, the structure has been made explicit using clear headings, like: Chapter 1 or Chapter 1. General Provisions. So, in order to detect structure, we need to detect these headings. This means we’ll need to search the document for lines starting with Chapter, followed by some designation (which we refer to as an index), and perhaps followed by some text – say, the title of the chapter. The index can be a lot of things: Arabic numbers (1, 2, 3, …), Roman numbers (I, II, III, …) or letters (a, b, c, …). Sometimes the index is an ordinal appearing before the chapter label: First chapter. It may even be a combination of several numbers and letters (5.2a). This is not a great problem, as we can more or less assume that whatever follows the word Chapter is the index.

The main problem with this approach is that there are also regular sentences that start with the word Chapter, and we need to separate those out. To do so, we can use some heuristics: A title will not end with a full stop (.); a heading will always start on a new line; etc.

This procedure to find the headings for chapters is repeated to find headings for sections, subsections, etc. Also, some sections (like numbered paragraphs or list items) will not have a full heading, but just a number, which we also need to recognise. Finally, some sections don’t have a heading but can be recognised because they start with a fixed language pattern. For example, a preamble in a (recent) Dutch Law — such as this — will start with: We, Beatrix, Queen of the Netherlands, Princess of Orange-Nassau, etc. etc. etc.

This procedure assumes that the input for the process is just text. Many documents will contain more information — such as textual markup — and headings may be more easily identified because they are marked as bold text, or even as headings. So, in situations where the input is made up of documents that are marked-up in a consistent way, it may be easier to recognise the patterns by taking layout into account in addition to text.

To actually find the patterns, we can use existing toolkits like GATE. After the patterns have been found, and the structure has been recognised, we can store it using a format such as MetaLex.

References

The second step is to detect the references from a portion of a law text to other portions of that text, or from a law text to other texts. References, like headings, follow a pattern. The simplest patterns are rather similar to headings; the text chapter 13 is probably a reference, unless it is part of a heading. Just like headings, basic references consist of a label (section, chapter, article) and an index (13, 13.2.1, XIII, m). And, just as with headings, we can find the references by looking for these patterns in the text.

However, this is only the simplest form of references. Besides references to a specific section, such as chapter 13, there are of course also references to a complete law. Some of these references follow a pattern as well, such as the law of October 1st, 2007. Most laws are cited by means of a citation title, though, such as the Railroad Act. Such titles can contain all kinds of words, and they don’t follow a strict pattern. Thus, such references cannot be detected using patterns. Instead, we use a list containing all (citation) titles to detect such references.

Other, more complex references contain multiple references in one statement, such as articles 13 and 14, or multiple levels: article 13, item e, of the Railroad Act, or even more complex combinations of the two: articles 13, item e, 14, item f, 15 and 16, items a and b, of the Railroad Act. Though more complex than the simple combination of label and index, these references still follow clear (sometimes recurring) patterns, and can be found in the text by searching for such patterns.

At the Leibniz Center for Law, we’ve created a parser based on these patterns, which had an accuracy of over 95%. For each reference found, we can construct some standardised name, and store it. With this technology, not only can we add hyperlinks to documents; we can also search for documents that refer to some specific document.

Classification

Now that we’ve got the structure and links in place, it’s time to start with the actual meaning of the text. Rather than tackling the entire text as a whole, we’ve selected sentences as the basic building blocks, and we attempt to create computer models for individual sentences first. Later, we can integrate those individual models to a complete model.

As a first step in creating the models, we start by assigning a broad meaning, or classification, to each sentence. Does the sentence give a definition for a concept, describe an obligation, or make a change in another law? In total, we distinguish fourteen different classes of sentences that appear in Dutch law texts. The next step in our automated approach is to assign a class to each sentence automatically.

To do so, we turn once again to language patterns. Legal language is rather strict, and legislative drafters don’t vary their language a lot — in a novel, variation may make for a more appealing text, but in a law, variation invites ambiguity. In fact, there are official Guidelines for Legislative Drafting that (among other things) reduce the variety of texts used. [Editor’s Note: For example, drafters of legislation in the U.S. House of Representatives Office of the Legislative Counsel have used Donald Hirsch’s Drafting Federal Law.] This means that for each of our classes, there’s a rather limited set of language patterns used. For example, definitions will look like one of these:

Under … is understood …

This law understands under … …

There are some variations in word order, but in the end, a small set of patterns is sufficient to describe all commonly used phrases. There is only one class of sentences where we cannot define a full set of patterns: obligations. In Dutch laws, obligations are often expressed without signal words like must or is obliged to. Instead, the obligations are presented as a fact:

No bodies are buried on a closed cemetery.

However, since the obligations are the only sentences lacking all-compassing patterns, we will assume that any sentence that does not mark a pattern is one of these obligations.

Based on the patterns found, we’ve created a classifier that attempts to sort sentences into these different classes. This classifier has an accuracy of 91%, and we expect that this can improved a bit further.

(As a side note: For classification tasks as these, a machine learning approach is often preferred; see, e.g., here. With such an approach, you provide the computer, not with patterns, but with a bunch of sample sentences. The computer will then extract its own patterns from those sentences, and use these to classify any new sentences. We’ve tried this approach as well (using the toolkit WEKA), and reached similarly accurate results.)

Modelling

Having classified the sentences, we now want to create models of the sentences. In essence, this means breaking down each sentence into smaller components and defining relationships between them. In some cases, the patterns used to classify the sentence already give us sufficient information to break up the sentence. Suppose we have a sentence like:

In article 7.12, sub one, second sentence, «article 7.3b» is replaced by: article 7.3c.

We classify this sentence as a replacement because of the text is replaced by. We can then also conclude that the text between angle quotes is the text to be replaced, the text following the colon is the replacing text, and the reference preceding it (which we’ve already detected) is the location where the replacement should take place.

This works fine for sentences that are somehow “about” the law. But for sentences that deal with some other domain, such as taxes, traffic, or commerce, we cannot predict all the elements. These sentences could be about anything — and statutes are full of such sentences. For such sentences, we need to follow a generic method. The aim is to model rules as a situation or action that is allowed or not allowed, similar to the models created in the HARNESS system of the ESTRELLA project. For example, for an obligation, we assume that the sentence describes some action that must be done. We try to identify who should be doing the action, and what other elements are involved. Thus, for the sentence:

Our Minister issues a warrant to the negligent person.

we would like to extract the following information:

Obligation
Action: Issue
Agent: Our Minister
Patient: Warrant
Recipient: Negligent person

(Such a table, or frame, is not the same as a computer model, but has all the elements needed to create one.)

Now, identifying these different elements of the sentence (agent, patient, recipient) is something that computer linguists have already worked on for a long time, which means we do not have to start from scratch. Instead, we can use existing parsers to do much of the work for us. For our Dutch laws, we use the Alpino parser. Such a parser will create a parse tree of a sentence. In this parse tree, the sentence will be split up in parts. The parser can identify which part is the subject, the direct object, the indirect object, etc. Based on this information, we can determine the agent, patient, and recipient (so-called semantic roles). In a sentence with a verb in the active voice, the subject is the agent, the direct object is the patient, and the indirect object is the recipient. Furthermore, the parser will determine the relationship between words, such as an adjective that modifies a noun. This information, too, helps us to make more accurate models.

We start out with the output of these parsers, and then try to extract all terms that have some more significance. If we want an application to compute whether or not a situation is allowed, a word like car can be treated in a generic way, but terms like allowed and not some special attention.

To Be Continued…

We still need to refine the method for making these models, and evaluate the results. After that, the individual models will need to be merged. But even as things stand now, we think these tools will help with getting legal text from paper into your computer systems.

[Editor’s Note: For more information about this topic, please see Dr. Adam Wyner’s post, Weaving the Legal Semantic Web with Natural Language Processing.]

Emile de Maat is a researcher at the Leibniz Center for Law (University of Amsterdam). His research focuses on the automatic extraction of metadata and meaning from legal sources.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Time to Turn the Page on Print Legal Information

authentication, digital law, Legal citation, Legal citations, Public access to legal information 15 Responses »

Sep 152010

Question: Is there a good reason why judges should not be blogging their opinions?

Follow my thinking here.

I, like many librarians, love books. By that I mean I love physical books. I love the feel of paper in my hand. I love the smell of books. When I attended library school, there was no doubt in my mind that I would work in a place surrounded by shelf after shelf of beautiful books. I was confident that I would be able to transfer that love of books to a new generation.

That’s not how things turned out. Without recounting exactly how I got here, I should say that I am a technology librarian, and have been since even before I graduated library school. Technology is where I found my calling, and where libraries seem to need the most help. As I delve deeper into the world of library technology, particularly in the academic setting, I am increasingly forced to confront an uncomfortable reality: Print formats are inferior to electronic. And in some of my darker moments, I may even go so far as to echo the comments of Jeff Jarvis in his book “What Would Google Do” when he writes: “print sucks.”

On page 71, talking about the burden of physical “stuff,” Jarvis writes:

“It’s expensive to produce content for print, expensive to manufacture, and expensive to deliver. Print limits your space and your ability to give readers all they want. It restricts your timing and the ability to keep readers up-to-the-minute. Print is already stale when it’s fresh. It is one-size-fits-all and can’t be adapted to the needs of each customer. It comes with no ability to click for more. It can’t be searched or forwarded. It has no archive. It kills trees. It uses energy. And you really should recycle it, though that’s just a pain. Print sucks. Stuff sucks.”

In this paragraph, Jarvis may as well have been talking about the current state of online legal information. Although we may not have figured out the magic bullets of authenticity and preservation, the fact remains that print is a burden. In many cases, it is a burden to our governments, and our libraries.

There are good reasons to proceed cautiously towards online legal information. However, the most significant barriers to accepting new modes of publishing official legal information online, like judges’ blogging opinions, may be cultural and political. In the end, law librarians and other legal professionals can’t allow our own nostalgia and habit to stand in the way of changes that can, should, and must happen.

AALL Working Groups

As many readers may know, the American Association of Law Libraries (AALL) began forming state working groups earlier this year. The purpose of those working groups was to “help AALL ensure access to electronic legal information in your state.” This is certainly a worthwhile goal, and one I obviously support. But the PDF document online, calling for formation of these working groups, sends a mixed message.

The very first duty of each working group is to “take action to oppose any plan in your state to eliminate an official print legal resource in favor of online-only unless the electronic version is digitally authenticated and will be preserved for permanent public access, or to charge fees to access legal information electronically. This is an increasingly common problem as states respond to severe budget cuts.”

Perhaps it’s just the phrasing of the document that bothered me. Rather than even providing guidance to states planning to eliminate print legal resources, AALL has set as its default position the opposition to any such plan.

In fairness, I note that the document hints that online-only legal resources might be acceptable if states don’t charge for them, or if such resources meet the rather complex standards laid out in the Association of Reporters of Judicial Decisions’ Statement of Principles.

The Association of Reporters of Judicial Decisions (ARJD) published Statement of Principles: “Official” On-Line Documents in February 2007, revised in May 2008. Most tellingly, in Principle 3 of the Statement they write: “Print publication, because of its reliability, is the preferred medium for government documents at present.”

Later in the document we find out why print is so reliable. Talking about electronic versions, the ARJD says they should not be considered official unless they are “permanent in that they are impervious to corruption by natural disaster, technological obsolescence, and similar factors and their digitized form can be readily translated into each successive electronic medium used to publish them.”

Without question, electronic material must be able to survive a natural disaster. The practice of storing information on a single server or keeping all backups in the same facility could be problematic. But emerging trends and best practices could help safeguard against these problems. In addition, programs like LOCKSS (Lots of Copies Keep Stuff Safe) can help alleviate some of these concerns by making sure many copies of each digital item exist at multiple geographic locations.

Also, digital format obsolescence has largely been overstated. PDF documents are not going anywhere anytime soon. Even conservative estimates establish PDF as a reliable format for the foreseeable future.

HTML may be no different. Consider that the very first Web document, Links and Anchors, is almost valid HTML5. Nearly 20 years later, that document is compatible with modern Web browsers.

On the other side of the equation, is print impervious to natural disaster, or even technological obsolescence? Of course not. At Yale, with our rare books library and large historical collection, I have witnessed first hand the damage time can do to a physical book. Even more importantly, books in the last hundred years have been published so cheaply they may fall apart even sooner than books published centuries ago.

Print and Electronic Costs

The reality is that moving to online-only legal information is a good thing for everyone involved in producing and consuming such information. The burden of print is not limited to the costs forced upon states that produce it; that burden is also borne by libraries and citizens who consume it.

As mentioned above respecting the AALL working group document, many states are already looking at going online-only to cut costs, and why shouldn’t they? With current budget situations across the country being what they are, printing costs being particularly high, and electronic publishing costs being so low, of course states are looking at saving money by ending needless printing.

But libraries would also benefit from the cost savings of governments’ moving to electronic formats. Not only do libraries currently have to subsidize printing costs by paying for the “official” print copies of legal materials; libraries also have to pay for the shelf space, as well as manpower to process incoming material and place it on the shelf, and may also have to pay additional costs for preserving the physical material. Not to mention the fact that we may pay for additional services that furnish access to the exact same material in an electronic format.

The costs involved in dealing with print legal resources are well known to most librarians. So why aren’t we clamoring for governments to publish online-only legal information?

Officialness, Authenticity, Preservation, and Citeability

Of course there are genuine concerns about online-only legal information. The big sticking points seem to be (in no particular order) officialness, authenticity, preservation, and citeability. Each issue is worthy of, and has been the subject of, much discussion.

Officiality may be in some ways the easiest and most difficult hurdle for online-only legal information to leap. To make an online version of legal material official, an appropriate authoritative body need only declare that version “official.” The task seems simple enough.

The more difficult part may be political. With organizations like AALL and ARJD currently opposing online-only options, that action may be politically difficult. Persuading lawyers, judges, and legislatures to approve such a declaration could be even more difficult. Can you imagine a bill, regulation, or some other action making a blog the “official” outlet for a particular court’s opinions?

The question of authenticity is more difficult to deal with from a technological perspective, although there has been interesting work done with respect to PDFs, electronic signatures, and public and private keys. The Government Printing Office (GPO) has done a great job leading the way in the area of authenticity: http://www.gpoaccess.gov/authentication/. The new Legislation.gov.uk site unveiled recently has taken a different approach from the GPO’s. As John Sheridan has written in an earlier post, at the moment The U.K. National Archives are not taking any steps towards authenticating the information on the Legislation.gov.uk site, but they recognize the need to address the issue at some point. John Joergensen at Rutgers-Camden has taken yet another approach. And Claire Germain, in a recent paper about authentication practices respecting international legal information (pdf), states that those practices vary throughout the world. Thus the prickly question of authenticating online legal information is an issue that’s not going away any time soon.

AALL and ARJD have made a big deal about preservation of online legal information, an issue that’s important for librarians, too. Unfortunately, this is another area where no good answer exists to guide us. As Sarah Rhodes wrote earlier this year, “our current digital preservation strategies and systems are imperfect – and they most likely will never be perfected.”

The Library of Congress National Digital Information Infrastructure & Preservation Program (NDIIPP) has some helpful resources. The Legal Information Preservation Alliance (LIPA) also provides some good guidance in this area. However, many librarians are still reluctant to accept that digital preservation practices may enable us to end our reliance on print.

A similar reluctance can be seen in resistance to the Durham Statement, which — though directed at law reviews — also says something about other kinds of online legal information. Most notably, Margaret Leary of the University of Michigan chose not to sign the Durham Statement, and discussed her decision to continue to rely on print at a recent AALL program. In a listserv posting quoted in Richard Danner’s recent paper, Ms. Leary asserted: “I do not agree with the call to stop publishing in print, nor do I think we have now or will have in the foreseeable future the requisite ‘stable, open, digital formats’.” Similarly, Richard Leiter explains that he signed the Durham Statement with an asterisk because of the statement’s call for an end to the printing of law reviews.

What constitutes ‘stable, open, digital formats’ for the purposes of satisfying some librarians is unclear. As I mentioned earlier, a number of digital formats currently fit this description. This makes me think that there’s something else going on here, a resistance to abandoning print for other reasons.

Citeability also becomes an issue as print legal information disappears. If there is no print reporter volume in which an opinion is issued, then how would one cite to an opinion (setting aside for a moment Lexis and Westlaw citations)?

However, efforts towards implementing “medium-neutral legal citation formats” have already been made. According to Ivan Mokanov’s recent VoxPopuLII post, most citations in Canada are of a neutral format. In the United States, LegisLink.org has made an effort to improve online citations, as Joe Carmel describes in his recent post. Work on URN:LEX and other standards has resulted in some progress towards dealing with the citeability issue. Organizations like the AALL Electronic Legal Information Access & Citation Committee also deserve credit for taking this on. [Editor’s Note: Those organizations have produced universal citation standards — such as the AALL Universal Citation Guide — which have been adopted by a number of U.S. jurisdictions.] Even The Bluebook supports alternative citation formats. For example, rule 10.3.3, “Public Domain Format,” specifies how to cite to a public domain or “medium-neutral format.” The Bluebook even goes so far as to allow citation in a jurisdiction’s specified format.

But despite all this work, nothing has yet stuck.

The Next Step

One thing you’ll notice respecting all of these issues is that they are currently unsettled. While AALL and ARJD have both suggested that they would look favorably on online-only legal information if it were official, authenticated, and preserved (they do not mention citeability), there is no indication of when we will reach a level of achievement on these issues that would be satisfactory to these organizations. Can governments, libraries, and citizens afford to wait?

Asking states to continue to bear the burden of publishing material in print as they run out of funding, and libraries to bear the expense of preserving that print, is irresponsible. While we might not have all of the answers now, we certainly have enough to move forward in an intelligent manner.

The National Conference of Commissioners on Uniform State Laws (NCCUSL) has been working on an Authentication and Preservation of State Electronic Legal Materials Act. [Editor’s Note: The Chair of the Act’s Drafting Committee is Michele L. Timmons, the Revisor of Statutes for the State of Minnesota, and its Reporter is Professor Barbara Bintliff of the University of Texas School of Law.] According to the Study Committee’s Report and Recommendations for the Act’s Drafting Committee, the goal of the draft should be to “describ[e] minimum standards for the authentication and preservation of online state legal materials.” This seems like an appropriate place to start.

Rather than setting unrealistic or vague expectations, the minimum standards provided by the draft act seem to allow some flexibility for how states could address some of these issues. As opposed to working towards a “stable and open digital format,” which seems more a moving target than an attainable goal, the draft act sets forth an outline for how states can get started with publishing official and authentic online-only legal information. While far from finished, the draft act appears to be a step in the right direction.

What Is the Real Issue?

I think the real sticking point on this matter is mental or emotional. It comes from an uneasiness about how to deal with new methods of publishing legal information. For hundreds of years, legal information has been based in print. Even information available on the Lexis and Westlaw online services has its roots in print, if not full print versions of the same material. It’s as if the lack of a print or print-like version will cause librarians to lose the compass that helps us navigate the complex legal information landscape.

Of course, publishing legal information electronically brings its own challenges and costs for libraries. Electronic memory and space are not free, and setting up the IT infrastructure to consume, make available, and preserve digital materials can be costly. But in the long run, dealing with electronic material can and will be much easier and less costly for all involved, as well as giving greater access to legal information to the citizens who need it.

So Judges Blogging?

Question: Is there a good reason why judges should not be blogging their opinions?

Although he was the co-chair of the ARJD committee that produced the Statement of Principles, even Frank Wagner, the outgoing U.S. Supreme Court reporter of decisions, acknowledges that “budgetary constraints may eventually force most governmental units to abandon the printed word in favor of publishing their official materials exclusively online.” He also recognizes that the GPO’s work in this area may put an end to the printed U.S. Reports sooner than other “official publications.”

So were an appropriate authority to make them official, and some form of authentication were decided on, and methods of preservation and citation had been taken into account, would you feel comfortable with judges’ blogging their opinions?

We have to get over our unease with new formats for publishing online legal information. We have to stop handcuffing governments and libraries by placing unrealistic and unattainable expectations on them for publishing online legal information. We have to prepare ourselves for a world where online is the only outlet for official legal information.

I still enjoy taking a book off the shelf and reading. I enjoy flipping through and browsing the pages. But nostalgia and habit are not valid strategies for libraries of the future.

Jason Eiseman is the Librarian for Emerging Technologies at Yale Law School. He has experience in academic and law firm libraries working with intranets, websites, and technology training.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Electronic Voting and Direct Democracy

Legislative information systems 5 Responses »

Sep 012010

In this post, I’d like to connect a specific area of my expertise—electronic voting (e-voting)—to issues of interest to the legal information community. Namely, I’ll talk about how new computerized methods of voting might affect elements of direct democracy: that is, ballot questions, including referenda and recall. Since some readers may be unfamiliar with issues related to electronic voting, I’ll spend the first two parts of this post giving some background on electronic voting and internet voting. I’ll then discuss how ballot questions change the calculus of e-voting in subtle ways.

Background on E-voting

The images of officials from 2000 closely scrutinizing punchcard ballots during the U.S. presidential election tend to give the mistaken impression that if we could just fix the outdated technology we used to cast ballots, a similar dispute wouldn’t happen again. However, elections are about “people, processes, and technology”; focusing on just one of those elements disregards the fact that elections are complex systems. Since 2000, the system of election administration in the United States has seen massive reform, with a lot of attention paid to issues of voting technology.

In the years after 2000, this system that had mostly “just worked” in previous decades was now seen as having endemic, fundamental problems. During the turn of the 20th century, frauds involving ballot box-stuffing, vote-buying, and coercion were the major policy concern and the principal focus of reform. In contrast, at the turn of the 21st century, the prevalence of close, contentious contests—e.g., see this example of an analysis of New Jersey elections—often put the winning margin well within the “error” or “noise” level associated with ballot casting methods.

In 2002, Congress passed the Help America Vote Act (HAVA), which provided the first federal funding for election administration, created the Election Assistance Commission (EAC) and established the first federal requirements for voting systems, provisional balloting, and statewide voter registration databases. As my colleague Aaron Burstein and I argue in an article currently in preparation, in terms of advancing the state-of-the-art in voting technology, HAVA conspicuously focused on providing funds that had to be spent quickly on types of voting systems that were then available on the market or soon would be available. The systems on the market at the time were invariably of a specific type: “Direct Recording Electronic” (DRE) voting machines, in which the record of a voter’s vote is kept entirely in digital form.

In the years since the passage of HAVA, computer science, usability, and information systems researchers have highlighted a number of shortcomings with this species of voting equipment. Three principal critiques voiced by this community are:

There is no proper way to do a recount on these systems. That is, if a race is close and a candidate calls for a recount, in most cases this will mean simply rerunning the software that added up all the digital votes; the exact same number would result. DREs do not keep a record that captures the voter’s intent; rather, these systems “collapse” voter intent into a digital representation kept in digital memory. In other types of systems, such as optical scan systems—where voters fill in bubbles on paper ballots which are then scanned in for counting—the voter’s marks are directly preserved with the ballot. In a traditional recount with non-DRE systems, election staffers interpret these marks made by voters and come up with a count based on how a trained human would interpret ballots. This is not possible with DRE voting systems and lever machines, which do not preserve individual records of voter intent.

There is no way to know if the software that runs DREs is correctly recording votes, and we’ve seen numerous cases of software errors, including errors that have resulted in lost ballots. However, the addition of a “voter-verified paper record” (VVPR)—that is, an independent record that the voter can verify before casting his or her vote—alleviates not only this problem of recounting records that show voter intent, but also the myriad of problems associated with software flaws and “malware” (malicious software) in these machines. If voters check these records and agree that the records reflect how they want to vote, this renders the paper records “independent” of the system’s software, and the records can safely be audited and/or recounted if there do turn out to be software-based problems.

In a number of state-level technical reviews of voting systems, of which I have been a part in California and Ohio, we have found serious vulnerabilities in each voting system we examined. These findings leave little confidence in the equipment that was purchased by election officials in the wake of the 2000 election. Moreover, this was a clear indication that the systems for certifying this equipment at the state and federal level had serious shortcomings that have allowed sub-standard systems into the field.

Now, in 2010, many states have passed laws requiring auditable voting systems, and increasing numbers of election officials are moving from DRE-based systems to optical scan systems. Despite these reforms which have, in my opinion, moved e-voting in the right direction, the specter of internet voting looms large.

Internet Voting

During public talks I am often asked, “When will we vote over the internet?” People have an intuitive feeling that since they’re doing so much online, it makes sense to vote online, too. However, we need to recognize what kinds of activities the internet is good for, and voting is perhaps the last thing we want to happen online.

Things that we do online now that require high security, such as banking, are not anonymous processes; there is a named record associated with each transaction. Yet the secret ballot is a very important part of removing coercion and vote-buying from possibly corrupting influences on the vote. (See this superb article by Allison Hayward: “Bentham & Ballots: Tradeoffs between Secrecy and Accountability in How We Vote”.)

Moreover, banks and other online establishments can purchase insurance to contain the risk of losses due to online fraud (although there are some indications that even this is becoming more difficult due to the increased sophistication and magnitude of online banking fraud). But there is still no firm that offers insurance for computer intrusions and attacks, or simply just errors, because it is very difficult to estimate the magnitude and likelihood of such losses. The “value” of a vote is very different from the value of currency: the value of your vote doesn’t just matter to you as a voter; it also matters to other voters. (“Vote dilution,” for example, is when processes conspire to render one voter’s vote more or less effective than another’s.) Also, it can be very hard to estimate the fitness of a given piece of software; said another way, we haven’t yet figured out how to write impervious or bug-free software.

Finally, as I mention above, the voting systems that the market has responded with in recent years leave a lot to be desired in terms of security, usability, and reliability. Internet voting essentially takes systems like those and adds the complications of sending voted electronic ballots over the public internet from users’ personal computers—neither of which are reliable or secure—with no VVPR.

We are far from the day in which highly secure processes can happen over the public internet from users’ computing devices. We will have to make significant technical advances in the security of personal computing devices and in network security before we can be sure that internet votes can be cast in a manner that approaches the privacy and security afforded by polling place voting.

Unfortunately, most designs for internet voting systems are un-auditable. Since these systems lack a paper trail, it is impossible to tell whether the voted ballot contents received at election headquarters correspond with what the voter intended to vote. The answer here would seem to be cryptographic voting systems, where the role of a paper trail is played by cryptographically secure records that can be transmitted over the network. Systems of this type have become increasingly more sophisticated, easy to use, and easy to understand, and have even been used in a binding municipal election here in the U.S.

E-voting and Direct Democracy

Elections don’t just elect people in the U.S.; in many states, voters vote on elements of direct democracy, specifically ballot referenda and recall questions. However, we should be even more concerned about opportunities to game these kinds of contests — and, equivalently, about how errors introduced by ballot casting methods for ballot questions could affect how we govern — than we are about the risks of voting fraud in candidate races.

It’s difficult to compare the importance of candidate elections to that of ballot questions. Certainly, ballot questions can be as simple as asking the voters to approve of city ordinances, such as increasing the amount of square footage for single-family homes. And, of course, on even-numbered years divisible by four, we elect the President of the United States, which unequivocally changes how our entire country is governed and operates. In between these two extremes are elections that many people don’t vote on, from judicial elections to highly contentious ballot propositions (like Proposition 8 in California), or transportation tax bonds that can result in hundreds of millions of dollars for local firms.

Can we compare the risks involved with candidate elections and ballot questions? In some sense, being able to bound the risk of fraud or error causing the election of the wrong candidate is similar to that resulting in “electing” the wrong decision in a ballot question; it’s equally difficult to compare the relative importance of elected contests and to decide on some level of likelihood that a contest runs a high risk of being targeted for attack or might be especially sensitive to errors in the count. Polling may help, but it’s far from perfect. However, ballot questions have one aspect that should make this process a bit easier: rather than having the considerable uncertainty of what policies a potential candidate may institute once elected, ballot measures are concrete policy proposals or actions where we know very well what will happen if they are passed. This would seem to make ballot questions more attractive to attack; the uncertainty involved with what candidates may do is not present, so the net benefit of a successful attack, all other things being equal, should be larger.

Are there special risks involved with ballot questions that we should be concerned about in the face of electronic voting methods? Certainly. First, ballot propositions are invariably at the end of the ballot; hence, they’re referred to as “down-ticket” contests. Post-election auditing, where a subset of ballot records are hand-counted as a check against the electronic results, often doesn’t include ballot questions. To be certain, states like California require post-election auditing of all contests on the ballot. But there are many states that do not do comprehensive election auditing; they either don’t do any auditing at all or focus their auditing attention on top-ticket contests on the ballot (for more, see Sections 1 and 2 of: “Implementing Risk-Limiting Post-Election Audits in California”).

While we have seen little evidence of fraud using newer computerized voting systems compared to the massive record of paper ballot fraud in our country’s past, this should serve as little comfort. Just as in finance, where “past results are no indication of future performance,” adversarial security is similar. That we haven’t seen much evidence of computer fraud involving voting systems doesn’t mean it isn’t happening and doesn’t mean it can’t happen. Multi-million dollar ballot questions and constitutional amendments are exactly the kinds of law-making activities in which I expect to see the first evidence of outright computerized election hacking. This rings especially true if we start using the public internet for casting ballots. While foreign interests or hackers out of the reach of US law enforcement might certainly be interested in top-ticket candidate contests, the opportunities to affect state and local law as well as economic interests embodied in ballot questions would seem to be especially attractive.

Where Should We Go From Here?

To be sure, there is a lot of momentum behind moving parts of our elections processes online. In some cases, such as online voter registration, the security and reliability risks are small and the net benefits are particularly high. However, I can’t say the same about internet voting, especially in the sense that elements of direct democracy may be particularly attractive to powerful foreign interests and parties outside our collective jurisdiction. The recently passed Military and Overseas Voter Empowerment (MOVE) Act has been interpreted to allow states to experiment with online ballot casting, and the relevant agencies charged with implementing the law—the Department of Defense’s Federal Voting Assistance Program (FVAP), the EAC, and the National Institute of Standards and Testing (NIST)—have collectively interpreted the MOVE Act as requiring them to institute standards and pilot programs for internet voting for military and overseas voters. I’m on record as disagreeing with this interpretation, but I can understand that they feel limited-scale pilot projects are appropriate. I predict that the first incontrovertible evidence of computerized vote manipulation will be associated with military and overseas internet voting efforts, and it’s not hard to imagine a down-ticket ballot question as being the focus of such an attack.

Should we re-think our forays into computerized voting? Definitely not. In my opinion, this is more a question of responsible uses of technology in elections than a black or white decision about using computerized voting systems or not. There is much good that stems from the use of computerized voting systems, including improved accessibility for the disabled and voters who don’t speak English, improved usability of ballots on-screen versus what can be accomplished on paper, and the speed and accuracy of computerized vote counts on election night. However, these voting systems must be recountable and auditable, and those audits must be conducted after each election in such a way that we limit the risk of an incorrect candidate or ballot measure being certified as the winner.

In contrast to the beginning of the past decade, when election officials were swimming in federal money for the purchase of equipment and trying to spend these funds before a looming deadline, what we really need is regular commitments of federal funding to improve local election administration. With a sustained source of federal funds to budget and plan for technology upgrades, the market will be stable, rather than going through the upheaval of mergers and dissolutions we have recently seen. Elections are perhaps the most poorly funded of all of the critical elements of democracy in the U.S., and we get what we pay for.

Joseph Lorenzo Hall is a postdoctoral researcher at the UC Berkeley School of Information and a visiting postdoctoral fellow at the Princeton Center for Information Technology Policy. His Ph.D. thesis examined electronic voting as a critical case study in the transparency of digital government systems.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Legislation.gov.uk

free access to law, Legal descriptive metadata, Legal identifiers, Legal knowledge representation, Legal metadata, Legal XML, Legislative information systems, Public access to legal information, Semantic Web and law 19 Responses »

Aug 152010

The launch of legislation.gov.uk by The [UK] National Archives marks a step change in public access to a primary source of legal information for citizens in the UK. Legislation.gov.uk is extensive, covering the four jurisdictions that make up the United Kingdom (England, Scotland, Wales and Northern Ireland) and over 800 years of history.

John Sheridan, Head of e-Services and Strategy at The National Archives, writes:

First, some background

We had two objectives with legislation.gov.uk: to deliver a high quality public service for people who need to consult, cite, and use legislation on the Web; and to expose the UK’s Statute Book as data, for people to take, use, and re-use for whatever purpose or application they wish. In particular, our aim was to show how the statute book can contribute to the growing Web of data as well as to the Web of documents.

Legislation.gov.uk replaces two predecessor services the UK government set up to provide access to legislation. The first was created by Her Majesty’s Stationery Office (HMSO), later to become the Office of Public Sector Information (OPSI), which is responsible for the official publication of legislation, and the London, Belfast and Edinburgh Gazettes. The functions of HMSO have been operating from The National Archives since 2006, including the provision of public access to legislation online. HMSO started publishing new legislation on the Web in 1996. Where HMSO and later OPSI provided access to legislation as it was enacted or made, a second service was developed, to provide access to the UK Statute Law Database. This contains revised versions of primary legislation, showing how they have changed over time.

Browsing the many different types of legislation in the UK

As in the United States, most lawyers in the UK rely on pay-for commercial legal research services. The people using the government’s online legislation service are generally not lawyers, but are drawn from a much wider group of people who need to know, cite, or use legislation as part of their job. These can range from police officers, to head teachers, to citizens defending their rights. Our users are people who need to know what a statute says, and who go looking for it using Google. They then quickly find their way to legislation.gov.uk.

What do people think they are seeing?

Before starting work on legislation.gov.uk, we did some research into the users of both the OPSI service and the UK Statute Law Database service. This research showed that they were very well used (over 1.5 million unique visitors per month to www.opsi.gov.uk), but that most of the people accessing legislation on the Web were not clear about the status of the material they were looking at. Our research showed that many people using legislation online assume that what they are looking at is both current and in force, simply because it is on the Web and available from an official source. Often users were accessing the original or as-enacted version of a statute, not knowing that they should be looking at the revised version, or that a revised version even existed.

Intuitive presentation

Our job is to present legislative material in such a way that the context and status of the information are clear. Legislation is complicated to understand; for example, an Act may have multiple sections, each with a different commencement date, or the Act may have prospective provisions. With legislation.gov.uk we have tried to develop a user interface that makes the status of each Act clear, so people know whether the statute they are viewing is current and in force. The usability challenge is to align what people think they are seeing with what they are actually looking at. We have done this by presenting both an original (see, e.g., here) and a latest-available version (see, e.g., here) of each Act, and a toggle between the two.

For more advanced users there is a timeline (see, e.g., here) which can be turned on to see how the legislation has changed and to navigate through an Act at particular points in time, including future or prospective versions.

Point in time navigation and the timeline

Open data

On the surface, legislation.gov.uk is an attractive Website, providing simple and direct access to legislation; at legislation.gov.uk people can view whole Acts, or a particular section, in either HTML (see, e.g., here) or in a print version in PDF (see, e.g., here). To achieve this, under the hood two very different sources of data have been combined. The data model for the original (or as-enacted) versions of legislation is largely driven by the typographic layout of legislative documents. For revised legislation, the data model is largely driven by version control, the management of multiple versions of different segments of a statute at different points in time. Reconciling these two different data models was a prerequisite step to developing our system.

An ‘on the fly’ created PDF

We aimed to make legislation.gov.uk a source of open data from the outset. The importance of open legal data is made powerfully by people like Carl Malamud and the Law.Gov campaign. Our desire to make the statute book available as open data motivated a number of technology choices we made. For example, the legislation.gov.uk Website is built on top of an open Application Programming Interface (API). The same API is available for others to use to access the raw data.

Using the API

The simplest way to get hold of the underlying data on legislation.gov.uk is to go to a piece of legislation on the Website, either a whole item, or a part or section, and just append /data.xml or /data.rdf to the URL. So, the data for, say, Section 1 of the Communications Act 2003, which is at http://www.legislation.gov.uk/ukpga/2003/21/section/1, is available at http://www.legislation.gov.uk/ukpga/2003/21/section/1/data.xml. We have taken a similar approach with lists, both in browse and search results. When looking at any list of legislation on legislation.gov.uk, it is easy to view the data. Simply append /data.feed to return that list in ATOM. (See, e.g., here.)

Open standards have played an important role throughout the development of legislation.gov.uk. All the data is held in XML, using a native XML database. The application logic is similarly constructed using open standards, in XSLTs and XQueries. Data and application portability were key objectives. We made considerable use of open source software like Orbeon Forms, Squid, and Apache.

The XML conforms to the Crown Legislation Markup Language (CLML) and associated schema. More general interchange formats for legislation such as CEN MetaLex lack the expressive power we need for UK legislation, but could relatively easily be wrapped around the XML we are making available. We have sought to surface richer metadata about legislation using RDF, but we would welcome feedback from users of the XML data about whether a MetaLex wrapper would be useful. (Note: We have used the MetaLex vocabulary in our RDF along with FRBR, as discussed below.) Similarly, it should be relatively easy to add a wrapper for the OAI-PMH protocol on top of the API we have built. We are not yet clear who would make use of such a service, if we built one, or whether we should leave the creation of an OAI-PMH interface to others. It is another open issue where we would welcome some feedback.

Persistent URIs

A major influence on legislation.gov.uk was a blog posting by Rick Jelliffe for O’Reilly’s XML.com. Jelliffe writes about something he calls PRESTO. He describes this as a system for legislation and public information in which “all documents, views and metadata at all significant levels of granularity and composition should be available in the best formats practical from their own permanent hierarchical URIs.”

Persistent URIs to pieces of legislation are very important, as they are to sources of law more generally. Initiatives like LegisLink, which Joe Carmel has written about here, attempt to retrofit a reliable naming scheme for legislation onto existing document-based systems. The URN:LEX namespace aims to facilitate the process of creating URIs for legal sources independent of a document’s online availability, location, and access mode.

We wanted to create high quality, persistent URIs for UK legislation from the outset. There are a number of different ways one might assign an unequivocal identifier to a legislative document. We have decided to use HTTP URIs and see no particular advantage in using URNs over HTTP URIs and indeed some disadvantages with URNs. Most importantly, HTTP URIs are actionable names. The advantage is that there is a built-in, ready-made, widely deployed and cost-effective resolution mechanism for resolving the identifier to a document, and a document to a representation. Having said that, we would consider supporting URN:LEX URNs in addition to our own URI Set, and would greatly welcome feedback from the community on this issue -– so please do comment if you have a view.

So, it follows, there are three types of URI for legislation on legislation.gov.uk, namely, identifier URIs, document URIs and representation URIs. Identifier URIs are for the ‘concept’ of a piece of legislation, how it was, how it is, and how it will be. (See, e.g., here.) Our use of these follow the Linked Data principles — the identifier URI is for a so-called non-information resource, something which can’t be conveyed in an electronic message. In other words, the URI is for the notion of a piece of legislation, rather than a particular rendition of it in a document. These URIs have been designed following the guidelines the UK Government has created for URI Sets, which our work helped to shape.

With legislation.gov.uk we support content negotiation, and follow the HTTP-Range 14 resolution approach, of responding to a request for the ‘non-information resource’ URI with a 303 response which redirects to a document URI.

Our document URIs refer to particular documents on the Web, for example the current, in-force version of a particular section of an Act. (See, e.g., here.) Crucially there are also point-in-time URIs for documents, which shows how that Act stood on a particular date (/yyyy-mm-dd) (see, e.g., here), or how it was when originally made (/enacted) (see, e.g., here). For any document we can return different representations or formats: a Web page on legislation.gov.uk, the underlying XML, a PDF, an HTML snippet, or even some RDF metadata. We recommend that people cite UK legislation in HTML by pointing to the identifier URI and by using the rel=”cite” attribute in the anchor tag.

Of course, we quickly discovered, it is one thing to suggest a design approach like PRESTO, and quite another to actually implement it. Jeni Tennison, who, working as a consultant to The Stationery Office, devised the URI Set for legislation (and much else about the legislation.gov.uk system), has blogged about the limitations of PRESTO and XPath-based URLs. I hope Jeni will find the time to blog some more about legislation.gov.uk, as there are many stories to be told.

One of the earliest pieces of design work we did for legislation.gov.uk was the URI Set. We wanted to follow PRESTO principles, but also account for changes over time, and for some of the peculiarities of UK legislation, in particular different geographic extents. (See, e.g., here.) PRESTO thinking is very evident on legislation.gov.uk; just look at the URLs as you move through the site.

Linked Data

We were also keen that the UK’s Statute Book make a contribution to the growing Web of Linked Data. The UK government is working hard to publish government data using Linked Data standards as part of work on data.gov.uk. The idea of the Web of Linked Data is to connect related information across the Web based on its meaning. In practice this means creating names for things (by ‘thing’ I mean anything: people, places, ideas) using HTTP, and when someone requests some information about that thing, returning data about it, ideally using RDF.

Legislation can make an important contribution to the Web of Linked Data. First, many important concepts and ideas are formally defined by statute. For example, there are 27 types of school in the UK and each one has a statutory definition. (See, e.g., here and here.) What it means to be a private limited company is again defined by statute, as are the UK’s eight data protection principles. One of our objectives with legislation.gov.uk is to enable people creating vocabularies and ontologies to exploit these definitions. This can be done, for example, by using the skos:definition property, to link terms in a vocabulary to the statute. The idea is to ease the process of rooting the Semantic Web in legally defined concepts. Part of the value of this linking is that it enables automatic checking to determine whether a part of the statute book has been repealed, in which case the related concept no longer exists. Crucially, legislation.gov.uk gives accurate information about when a section is repealed, by what piece of legislation, and when that repeal comes into force.

At the moment, the RDF from legislation.gov.uk is limited to largely bibliographic information. We have made use of the Functional Requirements for Bibliographic Records (FRBR) and the MetaLex vocabularies, primarily to relate the different types of resource we are making available. FRBR has the notion of a work, expressions of that work, manifestations of those expressions, and items. Similarly, MetaLex has the concepts of a BibliographicWork and BibliographicExpression. In the context of legislation.gov.uk, the identifier URIs relate to the work. Different versions of the legislation (current, original, different points in time, or prospective) relate to different expressions. The different formats (HTML, HTML Snippets, XML, and PDF) relate to the different manifestations. We have also made extensive use of Dublin Core Terms, for example to reflect that different versions apply to geographic extents. This is important as, for example, the same section of a statute may have been amended in one way as it applies in Scotland and in another way for England and Wales. We think FRBR, MetaLex, and Dublin Core Terms have all worked well, individually and in combination, for relating the different types of resource that we are making available.

One challenge we have is with changes to legislation that have yet to be applied to the data by the editorial team. Since we know what these effects are, we have also tried to represent this in RDF. We have used the MetaLex vocabulary to do this, but the result is complicated to interpret, and thus we suspect difficult for users of the data. MetaLex does not aid the elegant expression of amendment information (such as: statute A is changed by statute B, but only when commencement order C brings that change into force). We will be developing our own light-weight ontology for expressing some of these relationships, with the primary focus on ease of querying our data, rather than creating an ontology with the expressive power to be a cross-jurisdictional model.

It should then be possible to align this ontology with others post hoc. Our current use of RDF — and the potential to do more — is another issue where we would welcome feedback from the community.

Early adopters

People have already started to make use of the legislation.gov.uk URIs to support their Linked Data. One example is a project by ESD Toolkit. They have a created a SKOS vocabulary for all the different types of service that Local Authorities need to provide. They have linked this vocabulary to the powers and duties placed on Local Authorities in the legislation, using legislation.gov.uk identifier URIs. They have also used the API to pull back some of the text of the relevant statutes.

The future

We think there is huge potential over the next few years in the development of “accountable systems”. These are systems that are explicitly aware of statutory and other legal requirements and are able to process information explicitly in a way that complies with the (ever-changing) law. Here the legislation URIs can help enormously, either for people seeking to develop such accountable systems or any time someone wants to integrate an external system with the official source for statutory information. If the API is used in this way, we will need to consider carefully whether, and if so, how, the data is authenticated. We are not currently supplying digitally signed versions of UK legislation (unlike the GPO in the US) but we will be supporting the use of HTTPS, to provide a reasonable level of secure access to the data. However, if the data starts to be increasingly used in a new generation of accountable systems, we may need to address authenticity, with a view to increasing the guarantees we can make over the data.

There is much more we can do with legislation as data. Parts of the statute book are surprisingly well structured. For example, every year there is one or more Appropriation Acts. These typically contain a schedule with a table listing each government department, the amount allocated to it by Parliament for the year, and what that departments’ objectives are (see, e.g., here). It wouldn’t take much to create an XSLT just for these tables in the Appropriation Acts, from the XML provided from the API, to extract this data from all the Appropriation Acts, and publish that as Linked Data. There are many other examples of almost-structured data in legislation, waiting to be freed by developers, now that they have easy access to the underlying source.

We see this as a start. There is still much to do if we are to realise the potential of the statute book as public source of data. We are aiming to improve the modelling and the quantity of RDF data we make available about legislation, but it’s what others will do with the data that is really interesting. Now the UK has opened its statute book as Linked Data, we are keen to share our work with other governments, and to engage with academics in the legal informatics community and others with an interest in exploiting this rich source of information.

John Sheridan is Head of e-Services and Strategy at The [UK] National Archives, where he leads the team responsible for legislation.gov.uk. He is a specialist in official publishing on the Web, and in using Linked Data standards for government information. He also co-chairs the W3C eGovernment Interest Group.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

IT and the Access to Justice Crisis

Applications, comparative, Dissertations, free access to law, knowledge management, Legal information behavior, Legal information systems for pro se litigants, Legal information systems for self represented litigant, Legal knowledge management, Nonlawyers' legal information behavior, Nonlawyers' legal information needs, Pro se litigants, Public access to legal information, Self represented litigants 2 Responses »

Aug 012010

This post explores ways in which information technology (IT) can enhance access to justice. What does it mean when we talk about “the access to justice crisis,” and how can information technology help to resolve it? The discussion that follows is based on my 2009 book, Technology for Justice: How Information Technology Can Support Judicial Reform, particularly Part 4, on the role of information and IT in access to justice.

The normative framework for access to justice

International conventions guarantee access to a court. Everyone is entitled to a fair and public hearing by an independent tribunal in the determination of their civil rights and obligations or of any criminal charge against him or her, according to The International Covenant on Civil and Political Rights (article 14) and regional conventions like the The European Convention on Human Rights (article 6). In practice, the normative framework for access to justice does not provide us with clearly defined concepts.

The major barriers to access to justice identified in the scholarly literature are:

Distance, which can be a factor impeding access to courts. In many countries, courts are concentrated in the main urban centers or in the capital.
Language barriers, which are present when justice seekers use a language that is different from the language of the courts.
Physical challenges, like impaired sight and hearing and motor and cognitive impairments; these as a barrier to access are an emerging topic in the debate on technology support in courts.

These first three factors are all relatively straightforward and do not strike at the heart of the legal process.

Cost, for instance lawyers’ fees, court fees and other components of the price of access to justice, in many forms, has been identified as a factor affecting access to courts. However, cost is extremely hard to research and subject to a lot of ramifications. Because of this complexity, cost will not be discussed directly in this post.
Lack of information and knowledge, lack of familiarity with the court process, the complexity of legal and administrative systems, and lack of access to legal information are commonly identified factors (Cotterrell, The Sociology of Law p. 251; Hammergren, Envisioning Reform: Improving Judicial Performance in Latin America, p. 136). They are related because they all refer to the availability of information. They are the starting point for our discussion.

Potentially, information on the Internet can provide some form of solution for these problems, in two ways. First, access to information can support fairer administration of justice by equipping people to respond appropriately when confronted with problems with a potentially legal solution. Access to information can compensate, to some extent, for the disadvantage one-shotters experience in litigation, thereby increasing their chance of obtaining a fair decision. Second, the Internet provides a channel for legal information services, although experience with such online service provision is limited in most judiciaries. The discussion here will therefore focus on access to legal information and knowledge. Lack of information and knowledge as a barrier to access to justice is the focus for discussion in the first few paragraphs. The first step is to identify the barriers.

Knowledge and information barriers to access to justice

What are the information barriers individuals experience when they encounter problems with a potentially legal solution? We need empirical evidence to find an answer to this question, and fortunately some excellent research has been done, which may help us. In the U.K., Hazel Genn led a team that researched what people do and think about going to law. Their 1999 report is called Paths to Justice. A similar exercise led by Ben van Velthoven and Marijke ter Voert in The Netherlands, called Geschilbeslechtingsdelta 2003 (Dispute Resolution Delta 2003), was published in 2004. Although there are some marked differences between them, both studies looked at how people deal with “justiciable problems”: problems that are experienced as serious and have a potentially legal solution. Analysis of empirical evidence of people and their justiciable problems in England and Wales and The Netherlands produced the following findings with regard to these barriers:

Inaction in the face of a justiciable problem because of lack of information and knowledge occurs in a small percentage of cases.
Unavailability of advice negatively affects dispute resolution outcomes. It lowers the resolution rate. Cases in which people attempted to find advice were resolved with a higher rate of success than those of the self-helpers.
Respecting the inability to find advice: If people go looking for advice, the barriers to finding it have more to do with their own competencies, such as confidence, emotional fortitude, and literacy skills, than with the availability of the advice. In the United Kingdom, about 20 percent of the population is so poor at reading and writing that they cannot cope with the demands of modern life, according to data from the National Literacy Trust. In The Netherlands, the percentage of similarly low literacy is estimated at about 10 percent, according to data from the Stichting Lezen en Schrijven, the Reading and Writing Foundation.
Respecting incompetence in implementing the information received: Different competence levels will affect what can be done with information and advice. Competencies in implementing the information received include, for example, skills such as working out what the problem is, what result is wanted, and how to find help; simple case-recording skills; managing correspondence; confidence and assertiveness; and negotiating skills, according to research reported by Advicenow in 2005. Some people do not want to be empowered by having information available. They want assistance, or even someone to take over dealing with their problem. People with low levels of competence in terms of education, income, confidence, verbal skill, literacy skill, or emotional fortitude are likely to need some help in resolving justiciable problems.
Ignorance about legal rights exists across most social groups. Genn notes that people generally are not educated about their legal rights (Genn p. 102).
Respecting lack of confidence in the legal system and the courts and negative feelings about the justice system, Genn observes that people are unwilling voluntarily to become involved with the courts. People associate courts with criminal justice. People’s image of the courts is formed by media stories about high profile criminal cases (Genn p. 247). This issue is related to the public image of courts, as well as to the wider role of courts as setters of norms.

Information needs for resolving justiciable problems

After identifying knowledge and information barriers, the next step is to uncover needs for information and knowledge related to access to justice. Those needs are most strongly related to the type of problem people experience. The most frequently occurring justiciable problems are simple, easy-to-solve problems, mostly those concerning goods and services. People themselves resolve such problems, occasionally with advice from specialist organizations like the consumers’ unions (e.g., in the U.S., the National Consumers League). For more important, more complex problems, people tend to seek expert help more frequently. The most difficult to resolve are problems involving a longer-term relationship, such as labor or family problems. Any of the problems discussed in this section may lead to a court procedure. However, the problems that are the toughest to resolve are also the ones that most frequently come to court.

The first need people experience is for information on how to solve their problem. In The Netherlands, the primary sources for this type of information are specialized organizations, with legal advice providers in second place. In England and Wales, solicitors are the first port of call, followed by the Citizens’ Advice Bureaux. In both countries, the police are a significant source of information on justiciable problems. This is especially remarkable because the problems researched were not criminal justice issues.

If people require legal information, they primarily need straightforward information about rules and regulations. Next, they look for information about ways to settle and handle disputes once they arise. Information about court procedures is a separate category that becomes relevant only in the event people need to go to court.

Respecting taking their case to court: People need information on how to resolve problems, on rights and duties, and on taking a case to court. The justiciable problems that normally come to court tend to be difficult for people themselves to resolve. These problems are also experienced as serious. Many of them involve long-term relationships: family, employment, neighbors. Therefore, people will tend to go looking for advice. Some of them may need assistance. Most people seek and receive some kind of advice before they come to court.

In summary, information needs in this context are mostly problem-specific. Most problems are resolved by people themselves, sometimes with the help of information, or help in the form of advice or assistance. The help is provided by many different organizations, but mostly by specialized organizations or providers of legal aid and alternative dispute resolution (ADR).

Different dispute resolution cultures

There are, besides these general trends, interesting differences between England and Wales and The Netherlands. The results with regard to dispute outcome, for instance, show the following:

The Netherlands has fewer unresolved disputes, more disputes resolved by agreement, and the rate of resolution by adjudication is half that of England and Wales. It looks as if there is more capacity for resolving justiciable problems in Dutch society than there is in society in England and Wales. Apart from the legacy of the justice system where there is a propensity to settle differences that Voltaire described in one of his letters, many factors may be at work in The Netherlands to produce a higher level of problem-solving capacity. One probable factor is the level of education and the related competence levels for dealing with problems and the legal framework. The functional illiteracy rate is only half that in the United Kingdom. Another factor may be a propensity to settle differences by reducing the complexity of problems through policies and routines.

Diversion or access, empowerment or court improvement?

The debate respecting whether diversion or court improvement should come first as an objective of legal policy, has been going on for some time. These are the options under discussion:

Preventing problems and disputes from arising;
Equipping as many members of the public as possible to solve problems when they do arise without needing recourse to legal action;
Diverting cases away from the courts into private dispute resolution forums; and
Enhancing access to legal forums for the resolution of disputes.

Genn argues that it is not an answer to say that diversion and access should be the twin objectives of policy, because they logically conflict. I would like to contribute some observations that could provide a way out of this apparent dilemma.

First, user statistics from the introduction of the online claim service Money Claim Online and the case study in Chapter 2.3 of my book suggest that changes in procedure facilitating access do not in themselves lead to higher caseloads. Changes observed in the caseloads are attributable to market forces in both instances.

The other observation is that Paths to Justice and the Dispute Resolution Delta clearly found that self-help is experienced as more satisfying and less stressful than legal proceedings. Moreover, resolutions are to a large degree problem specific. A way out of the dilemma could be that specialist organizations that make it their business to provide specific information, advice, and assistance, should enhance their role. There is an empirical basis for this way out in the research reported in Paths to Justice and the Dispute Resolution Delta. Although goods and services problems are largely resolved through self-help, out-of-court settlement, or ADR, nonetheless a fair number of them still come to court. Devising ways to assist individuals in informal problem solving and diverting them to other dispute resolution mechanisms can keep still more of these problems out of court. Even in matters for which a court decision is compulsory, like divorce, mediation mechanisms can sort out differences before the case is filed. Clearly, information on the Internet will provide an entry point for all of these dispute resolution services. Online information can thus help to keep as many problems out of court as possible. All this should not keep us from making going to court when necessary less stressful. Information can help reduce people’s stress, even as it improves their chances of achieving justice. The Internet can be a vehicle for this kind of information service, too.

Taking up this point, the next section focuses on courts and how information technology, particularly the Internet, can support them in their role of information providers to improve access to justice. Two strains concerning the role of information in access to justice run through this theme: information to keep disputes out of court, and information on taking disputes to court.

Information to keep disputes out of court

An almost implicit understanding in the research literature is that parties with information on the “rules of thumb” of how courts deal with types of disputes will settle their differences more easily and keep them out of court. Such information supports settlement in the shadow of the law. Most of this type of settlement will be done with the support of legal or specialist organizations. In the pre-litigation stage, information about the approaches judges and courts generally take to specific types of problems can help the informal resolution of those problems. This will require that information about the way courts deal with those types of problems becomes available. Some of the ways in which courts deal with specific issues are laid down in policies. Moreover, judicial decision making is sometimes assisted by decision support systems reflecting policies. In order to help out-of-court settlement, policies and decision support systems need to be available publicly.

Information on taking disputes to court

If a dispute needs to come to court, information can reduce the disadvantage one-shotters have in dealing with the court and with legal issues. This disadvantage of the one-shotters — those who come to court only occasionally — over against the repeat players who use courts as a matter of business, was enunciated by Marc Galanter in his classic 1974 article, Why the Haves Come Out Ahead: Speculations on the Limits of Legal Change. Access to information for individual, self-represented litigants increases their chances of obtaining just and fair decisions. Litigants need information on how to take their case to court. This information needs to be legally correct, as well as effective. By “effective,” I mean that the general public can understand the information, and that someone after reading it will (1) know what to do next, and (2) be confident that this action will yield the desired result. In a case study, I have rated several court-related Web sites in the U.K. and in The Netherlands on those points, and found most of them wanting. My test was done in 2008, and most of the sites have since changed or been replaced. And although the U.K. Court Service leaflet D 184 on how to get a divorce got the best score, my favorite Web site is Advicenow.

Such an information service requires a proactive, demand-oriented attitude from courts and judiciaries. Multi-channel information services, such as a letter from the court with reference to information on the court’s or judiciary’s Web site, can meet people’s information needs.

Beyond information push

Other forms of IT, increasingly interactive, can provide access to court. [Editor’s note: Document assembly systems for self-represented litigants are a notable example.] Not all of them require full-scale implementation of electronic case management and electronic files. In order to be effective for everyone, the information services discussed will require human help backup. There are also technologies to provide this, but they may still not be sufficient for everyone. The information services discussed here, in order to be effective, will need to be provided by a central agency for the entire legal system. A final finding is the importance of public trust in the courts in order for individuals to achieve access to justice. Judiciaries can actively contribute to improved access to justice in this field by ensuring that correct information about their processes is furnished to the public.

In summary, access to justice can be effectively improved with IT services. Such services can help to ameliorate the access-to-justice crisis by keeping disputes out of court. The information services identified here should serve the purpose of getting justice done. They should not keep people from getting the justice they deserve by preventing them from taking a justified concern to court. If people need to go to court, information services can help them deal with the courts more effectively.

[Editor’s Note: A very useful list of resources about applying technology to access to justice appears at the technola blog.]

Dory Reiling, mag. iur. Ph.D., is a judge in the first instance court in Amsterdam, The Netherlands. She was the first information manager for The Netherlands’ Judiciary, and a senior judicial reform expert at The World Bank. She is currently on the editorial board of The Hague Journal on the Rule of Law and on the Board of Governors of The Netherlands’ Judiciary’s Web site Rechtspraak.nl. She has a Weblog in Dutch, and an occasional Weblog in English, and can be followed on Twitter at @doryontour.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

LegisLink.Org: Simplified Human-Readable URLs for Legislative Citations

information retrieval, Legal identifiers, Legislative information systems 8 Responses »

Jul 152010

The Problem: URLs and Internal Links for Legislative Documents

Legislative documents reside at various government Websites in various formats (TXT, HTML, XML, PDF, WordPerfect). URLs for these documents are often too long or difficult to construct. For example, here is the URL for the HTML format version of bill H.R. 3200 of the 111th U.S. Congress:

http://www.gpo.gov/fdsys/pkg/BILLS-111hr3200IH/html/BILLS-111hr3200IH.htm

More importantly, “deep” links to internal locations (often called “subdivisions” or “segments”) within a legislative document (the citations within the law, such as section 246 of bill H.R. 3200) are often not supported, or are non-intuitive for users to create or use. For most legislative Websites, users must click through or fill out forms and then scroll or search for the specific location in the text of legislation. This makes it difficult if not impossible to create and share links to official citations. Enabling internal links to subdivisions of legislative documents is crucial, because in most situations, users of legal information need access only to a subdivision of a legal document, not to the entire document.

A Solution: LegisLink

LegisLink.org is a URL Redirection Service with the goal of enabling Internet access to legislative material using citation-based URLs rather than requiring users to repeatedly click and scroll through documents to arrive at a destination. Let’s say you’re reading an article at CNN.com and the article references section 246 in H.R. 3200. If you want to read the section, you can search for H.R. 3200 and more than likely you will find the bill and then scroll to find the desired section. On the other hand, you can use something like LegisLink by typing the correct URL. For example: http://legislink.org/us/hr-3200-ih-246.

Benefits

There are several advantages of having a Web service that resolves legislative and legal citations.

(1) LegisLink provides links to citations that are otherwise not easy for users to create. In order to create a hyperlink to a location in an HTML or XML file, the publisher must include unique anchor or id attributes within their files. Even if these attributes are included, they are often not exposed as links for Internet users to re-use. On the other hand, Web-based software can easily scan a file’s text to find a requested citation and then redirect the user to the requested location. For PDF files, it is possible to create hyperlinks to specific pages and locations when using the Acrobat plug-in from Adobe. In these cases, hyperlinks can direct the user to the document location at the official Website.

For example, here is the LegisLink URL that links directly to section 246 within the PDF version of H.R. 3200: http://legislink.org/us/hr-3200-ih-246-pdf

In cases where governments have not included ids in HTML, XML or TXT files, LegisLink can replicate a government document on the LegisLink site, insert an anchor, and then redirect the user to the requested location.

(2) LegisLink makes it easy to get to a specific location in a document, which saves time. Law students and presumably all law professionals are relying on online resources to a greater extent than ever before. In 2004, Stanford Law School published the results of their survey that found that 93% of first year law students used online resources for legal research at least 80% of the time.

(3) Creating and maintaining a .org site that acts as an umbrella for all jurisdictions makes it easier to locate documents and citations, especially when they have been issued by a jurisdiction with which one is unfamiliar. Legislation and other legal documents tend to reside at multiple Websites within a jurisdiction. For example, while U.S. federal legislation (i.e., bills and slip laws) is stored at thomas.loc.gov (HTML and XML) and gpo.gov (at FDsys and GPO Access) (TXT and PDF), the United States Code is available at uscode.house.gov and at gpo.gov (FDsys and GPO Access), while roll call votes are at clerk.house.gov and www.senate.gov. Governments tend to compartmentalize activities, and their Websites reflect much of that compartmentalization. LegisLink.org or something like it could, at a minimum, provide a resource that helps casual and new users find where official documents are stored at various locations or among various jurisdictions.

(4) LegisLinks won’t break over time. Governments sometimes change the URL locations for their documents. This often breaks previously relied-upon URLs (a result that is sometimes called “link rot”). A URL Redirection Service lessens these eventual annoyances to users because the syntax for the LegisLink-type service remains the same. To “fix” the broken links, the LegisLink software is simply updated to link to the government’s new URLs. This means that previously published LegisLinks won’t break over time.

(5) A LegisLink-type service does not require governments to expend resources. The goal of LegisLink is to point to government or government-designated resources. If those resources contain anchors or id attributes, they can be used to link to the official government site. If the documents are in PDF (non-scanned), they can also be used to link to the official government site. In other cases, the files can be replicated temporarily and slightly manipulated (e.g., the tag <a name=SEC-#> can be added at the appropriate location) in order to achieve the desired results.

Alternatives

While some Websites have implemented Permalinks and handle systems (e.g., the Library of Congress’s THOMAS system), these systems tend to link users to the document level only. They also generally only work within a single Internet domain, and casual users tend not to be aware of their existence.

Other technologies at the forefront of this space include recent efforts to create a URN-based syntax for legal documents (URN:LEX). To quote from the draft specification, “In an on-line environment with resources distributed among different Web publishers, uniform resource names allow simplified global interconnection of legal documents by means of automated hypertext linking.”

The syntax for URN:LEX is a bit lengthy, but because of its specificity, it needs to be included in any universal legal citation redirection service. The inclusion of URN:LEX syntax does not, however, mitigate the need for additional simpler syntaxes. This distinction is important for the users who just want to quickly access a particular legislative document, such as a bill that is mentioned in a news article. For example, if LegisLink were widely adopted, users would come to know that the URL http://legislink.org/us/hr-3200 will link to the current Congress’s H.R. 3200; the LegisLink URL is therefore readily usable by humans. And use of LegisLink for a particular piece of legislation is to some extent consistent with the use of URN:LEX for the same legislation: for example, a URN:LEX-based address such as http://legislink.org/urn:lex/us/federal:legislation:2009; 111.hr.3200@official;thomas.loc.gov:en$text-html could also lead to the current Congress’s H.R. 3200. A LegisLink-type service can include the URN:LEX syntax, but the URN:LEX syntax cannot subsume the simplified syntax being proposed for LegisLink.org.

The goals of Citability.org, another effort to address these issues, calls for the replication of all government documents for point-in-time access. In addition, Citability.org envisions including date and time information as part of the URL syntax in order to provide access to the citable content that was available at the specified date and time. LegisLink has more modest goals: it focuses on linking to currently provided government documents and locations within those documents. Since legislation is typically stored as separate, un-revisable documents for a given legislative term (lasting 2 years in many U.S. jurisdictions), the use of date and time information is redundant with legislative session information.

The primary goal of a legislative URL Redirection Service such as LegisLink.org is to expedite the delivery of needed information to the Internet user. In addition, the LegisLink tools used to link to legislative citations in one jurisdiction can be re-used for other jurisdictions; this reduces developers’ labor as more jurisdictions are added.

Next Steps

The LegisLink.org site is organized by jurisdiction: each jurisdiction has its own script, and all scripts can re-use common functions. The prototype is currently being built to handle the United States (us), Colorado (us-co), and New Zealand (nz). The LegisLink source code is available as text files at http://legislink.org/code.html.

The challenges of a service like LegisLink.org are: (1) determining whether the legal community is interested in this sort of solution, (2) finding legislative experts to define the needed syntax and results for jurisdictions of interest, and (3) finding software developers interested in helping to work on the project.

This project cannot be accomplished by one or two people. Your help is needed, whether you are an interested user or a software developer. At this point, the code for LegisLink is written in Perl. Please join the LegisLink wiki site at http://legislink.wikispaces.org to add your ideas, to discuss related information, or just to stay informed about what’s going on with LegisLink.

Joe Carmel is a part-time consultant and software developer hobbyist. He was previously Chief of the Legislative Computer Systems at the U.S. House of Representatives (2001-2005) and spearheaded the use of XML for the drafting of legislation, the publication of roll call votes, and the creation and maintenance of the U.S. Congressional Biographical Directory.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Rule-Based Legal Information Systems

Legal decision support systems, Legal expert systems, Legal inference engines, Legal knowledge representation, Legal reasoning, Legal rule based systems, Legal rule engines, Modeling legal logic, Modeling legal reasoning, Modeling legal rules, Rule based legal information systems 6 Responses »

Jul 012010

There has been much discussion on this blog about law-related information retrieval systems,
ontologies, and metadata. Today, I’d like to take you into another corner of legal informatics: rule-based legal information systems. I’ll tell you what they are, what their strengths and limitations are, and how they’re made. I’ll also explain why I’m optimistic about their potential to expand public access to law and to improve the way legal expertise is deployed and consumed.

First, what are they?

A rule-based expert system represents knowledge of a particular domain — such as medicine, finance, or law — in the form of “if-then” rules. Here’s an example of a rule:

the employee is entitled to standard FMLA leave IF the employee is an eligible employee AND the reason for the leave is enumerated in 29 U.S.C. § 2612

A rule consists of a bunch of variables (here, three Boolean statements) together with some logical operators (if, then, and, or, not, mathematical operators, etc.). Rules are chained together to form a rulebase, which is basically a database of rules. “Chained together” means that the rules connect to each other: a condition in one rule is the consequent or conclusion in another rule. For example, here’s a rule that links to our first rule:

the reason for the leave is enumerated in 29 U.S.C. § 2612 IF the employee needs to care for a newborn child OR the employee is becoming an adoptive or foster parent OR the employee’s relative has a serious health condition OR the employee cannot perform their job due to a serious health condition

Each of the conditions in this new rule can be defined by yet more rules. And other rules can sprout off of the main rule tree to form a complex web of inference. If we were to visualize such a network of rules, it might begin to look something like this:

The rulebase inputs are shown in blue and the outputs – or “goals” – are highlighted in orange. The core function of the inference engine (or rule engine) is to figure out what conclusions can be drawn from the input facts. Also, given incomplete information, an inference engine will figure out what additional facts are needed in order to reach one of the goals.

Rule-based systems in context

From this extremely simple example we can start to get a sense of the strengths and limitations of rule-based representations of legal knowledge. Let’s start with the strengths. First, the law, to a significant degree, seems to consist of rules, and representing them in a constrained, logical language is fairly straightforward and natural. As a result, rule-based systems are transparent: the system code looks a lot like the text that’s being represented. This “isomorphism” means that you can trace the system logic back to the original source material, easily spot errors, and quickly adapt to changes in the law. Furthermore, rule-based systems can justify their determinations by explaining how they arrived at a particular conclusion and by providing audit trails. It’s also fairly easy for people to interact with rule-based systems, as they integrate well with interviews. In short, it’s relatively easy to put legal knowledge into rule-based systems, easy to maintain it, and easy to get it out.

But all this simplicity comes with a price: the sophistication of the knowledge that can be represented. For one thing, common sense knowledge does not lend itself to simple rule-based representations, as the decades-long Cyc project illustrates. A significant portion of my own rule-authoring effort is spent representing mundane concepts, like figuring whether a given date falls on a legal holiday or counting the number of weeks in which a given condition is true. Secondly, there’s the problem of how to model vague or “open-textured” concepts. For instance, if a liability determination turns upon whether a person’s conduct was “reasonable”, the uncertainty and fuzziness of that term can’t be modeled in a way analogous to human thinking. A third limitation facing rule-based systems is the “knowledge acquisition bottleneck.” This is the effort required to codify, test, and validate expert domain knowledge. Part of the challenge derives from the reasons I’ve already mentioned, and part results from the need to capture the knowledge of human subject matter experts who don’t always think in complete and precise “if-then” constructs. Another criticism often lodged at legal expert systems is that law is in essence not rule-based but is instead a fray of competing textual interpretations which cannot be accurately modeled.

My view is that, even given these limitations, there are still many problems that can be solved by rule-based systems. No one is asking them to solve all legal automation problems, or claiming that all legal knowledge can be represented in the form of rules. (Part of why little attention is paid to these systems today is that they were over-hyped during the artificial intelligence boom of the 1970s and 80s.) But there is a place for them, and that place is quite large even given the semantic confines that I just described. Rule-based systems are ideal for encoding legal principles found in statutes, regulations, and agency decisions — that is, law that’s explicit and knowable, but logically complicated. And there are millions of pages of such law, across thousands of jurisdictions around the world, just waiting to be embedded in rule-based systems.

Let me give you a few examples of what rule-based information systems can do, although chances are that you’ve already encountered one. Perhaps, like millions of American taxpayers, you used TurboTax tax preparation software to file your taxes this year. This and other tax preparation programs interview you about your income and finances, perform a multitude of behind-the-scenes calculations, and then fill out the relevant tax forms for you. I don’t actually know how this software was constructed, but if I were doing it I would absolutely take a rule-based approach. In fact, my team did use a rule engine when tasked to build a tax law advisory system for the IRS. That system, the Interactive Tax Assistant, answers seven common tax questions, is driven by about 1,300 rules, and contains around 200 question screens. Rule-based design can also produce systems like the Australian Visa Wizard, DirectLaw, and The Benefit Bank. Other rule-driven systems work behind the scenes at government agencies and corporations to process claims by making fast, consistent, and transparent decisions.

Available tools

In my view, the premier tool for engineering rule-based legal information systems is Oracle Policy Modeling (OPM, formerly known as Haley Office Rules, RuleBurst, and Softlaw). (Full disclosure: I used to work for Oracle.) OPM lets you write natural language rules that capture statutory text, calculations, date and time-based reasoning, and basic ontological relationships. It has decent debugging and rulebase visualization features (that’s how I created the rule network diagram above), and an excellent regression testing facility. OPM lets you deploy rulebases as Web interviews and integrate them into other computer systems. The major downside to OPM is its cost: I understand the list price to be in the ballpark of $100K per license.

You can also model legal rules using other business rule engines, such as ILOG, Blaze Advisor, JBoss Drools (free), and Jess (free). JBoss Drools has a promising feature that lets you create Domain Specific Languages by mapping natural language expressions to the underlying programming code. You could also use traditional logic programming / expert system languages like Prolog or CLIPS, which are extremely powerful but which do not allow for isomorphic representation of the law. OWL-centric ontology editors such as Protege are also beginning to support rule-based knowledge representation.

To address the lack of freely-available, practical legal modeling tools, I’ve been working on Jureeka.org, a project affiliated with Stanford’s CodeX Center for Computers and Law. Jureeka is an open, Web-based rule authoring platform that lets lawyers, law students, and other subject matter experts represent their knowledge as “if-then” rules. Jureeka then uses the rules to generate jurisdiction-specific interviews, which present the relevant topic in a digestible manner. Its strengths are that it’s completely Web-based, it makes navigation of the rules easy, and it lets rule authors work collaboratively to rapidly develop knowledge bases in a wiki-like fashion. The motivating vision is to provide a way for legal knowledge engineers to build topical rulebases, and then connect these modules together to form an information backbone that drives other IT systems and helps the general public get answers to their legal questions.

Jureeka is very much a work in progress, and I’ll be the first to admit that its main weakness is the oversimplicity of its rule syntax. (For example, I’m currently working on an ontology layer and a way to reason across multiple instances of an object or variable.) But this is the type of knowledge-generating project that I’d like to see a developer community coalesce around.

Future potential

Rule-based programming is not the be-all and end-all of legal informatics, but it does have significant untapped potential. Government agencies are beginning to adopt rule-based legal information systems as a way to better serve the public. I think there are also lucrative opportunities available for law firms to seize the first mover advantage by automating slices of the law of interest to consumers. Rule-based systems can help nonprofit organizations advance their missions by guiding constituents through labyrinthine legal processes. And these systems are of obvious benefit to corporations, which need to comply with a variety of regulations across numerous jurisdictions.

Rule-based systems can also benefit the legislative drafting process. For example, an early incarnation of the OPM software helped the Australian Taxation Office simplify that country’s tax code. In addition to this kind of legislative refactoring (which entails clarifying and reorganizing Rube Goldberg-like legal texts), legislatures could also promulgate law in an “inference-ready” machine readable form. That is, portions of the law could be written in a syntax that both humans and machines can read, making the law not only accessible but executable. I’m not merely referring to high-level metadata; I’m talking about code that is intended to be run in an inference engine and that can be deployed as is into society’s computing infrastructure. [See, e.g., Professor Monica Palmirani’s example of legal rules coded in the Legal Knowledge Interchange Format (LKIF) (at slides 48 through 50); please note that this is a 4.5M download.]

Some people have raised the objection that rule-based systems and their creators engage in the unauthorized practice of law by dispensing “legal advice.” I think this concern is overblown and founded upon a lack of understanding of how these systems work. Legal advice entails applying the law to the facts of a particular case or, conversely, interpreting facts in light of the applicable law. Rule-based systems don’t do that. Instead, they break up complicated legal provisions into atomic pieces and ask users to determine how each atom applies to them. Conceptually, it’s no different than reading a plain language description of legal rules and applying those rules to your own situation.

My goal in this post has been to introduce you to something that you may not have heard about and to convince you that it is a viable and worthwhile activity. Rule-based legal information systems have been around for a few decades, but we still have a long way to go until our rule-based legal modeling tools are as sophisticated as the Mathematica software is in the domain of mathematical computation. As we move in that direction, and as our legal knowledge engineering proficiency grows, we can advance toward the day when all people can take equal advantage of their legal rights. Knowing that they have them is the first step.

Michael Poulshock is a consultant specializing in legal knowledge engineering and a Fellow at Stanford University’s CodeX Center for Computers and Law. He is the creator of Jureeka.org and the Jureeka legal research browser add-on for Firefox and Chrome. He was previously a human rights lawyer.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

The Impact of Metadata Standards on Traditional Legal Online Services in Germany

Legal metadata, Public access to legal information, Standards 2 Responses »

Jun 152010

It is common sense within the information industry that revenue will shift from print to online (see Ulrich Hermann, CEO of Wolters Kluwer Germany, in FAZ on April 7th 2010, p. 15).

But what is the impact of current technical trends in metadata standards on legal online services in Germany? Is there any impact? If you take the real market penetration of metadata standards into account, one could say that the concept of metadata standards has failed. At least, standards such as the “Saarbrücker Standard” (a standard, created in 2000, for court decisions in German jurisdictions) have never been used widely. Nevertheless we kicked off jurMeta — a proposed new metadata standard for German-language legal resources — at the EDV-Gerichtstag 2009, a major German conference on judicial information systems.

Market trends in Germany

In order to understand the impact of metadata standards, we need to consider such standards in the context of legal online services in Germany, and current trends.

What are the market trends in online services in Germany? The content coverage is rapidly growing, particularly from an end-user’s perspective. State authorities and publishing houses are publishing primary content free of charge for various reasons; law firms are running blogs with comments on the latest court cases (JuraBlogs); lawyers and clients are using services that publish advice, even when related to concrete disputes (frag-einen-anwalt.de). European legislation (in particular the Public Sector Information Directive) is encouraging member states and public sector bodies to take proactive measures to promote reuse of public sector information in order to exploit its business potential.

At the same time the technical infrastructure for sharing information is getting better. Interested persons do not need any technical skills to instantly set up a vertical search engine with Google custom search. Open source tools are even covering high end demands of private online publishers. There is an open source tool or project that addresses the private publisher’s needs in any aspect, e.g. performance (CouchDB), machine learning (Mahout), or relevance ranking (Open Relevance Project). And in most cases it is not only a solution but a high end solution that can be adopted free of (licensing) cost.

Of course social networks have to be named as a trend in law online. Services such as Marktplatz-Recht.de, JUSMEUM, and justanswer are creating communities of legal professionals. These services combine the benefits of online communities with expertise in the area of law and thus generate new revenue streams. Social networks organize value-adding-processes efficiently since they allow one to exploit cognition [i.e., the people having the most knowledge] at the source. (I once called this the paradigm of the “cheapest value adder”; see Gütezeichen als Qualitätsaussage im digitalen Informationsmarkt: dargestellt am Beispiel elektronischer Rechtsdatenbanken (2000), p. 25.) Chances are high that vertical social networks will repeat the success of their generic predecessors, at least at a smaller scale.

Do legal online services exhibit these features only in Germany? The answer is No. These market trends are most likely valid throughout the world. I will come to the specifics of legal online services in Germany shortly.

First let me sum up what I consider to be the major driver in legal online services for the next couple of years. This is what I would like to label “Digital Convergence”–not the above-mentioned individual trends in the areas of content, technical infrastructure, and users/community, but a synergetic combination of these three trends that will drive the future of legal online services.

What are the characteristics of legal online services in Germany in 2010?

First, the fact that Germany is a code-based country makes legal online services in Germany an ideal target for any kind of invention in the area of text retrieval. For example, at the federal level there are more than 5,000 pieces of legislation with more than 100 provisions each. If you think of one specific term in paragraph 4 of § 97 GWB as a very precise reference to a specific legal issue, consider that

§ 97 GWB is one out of more than 500,000 provisions,
these provisions follow a semantic structure, and
nearly all existing legal documents refer to specific provisions at the European, federal, state, or municipal level.

You thus get a sense of the challenge of legal online services in Germany. But you also get a sense of their potential. One could put it this way: The semantic web of legal content in Germany already exists; we “only” have to apply a common syntactical representation in order to create a very rich business resource by linking millions of individual and heterogeneous documents. The projects that share a common syntactical representation (e.g., jurMeta) will be rich sources for additional value-adding processes such as data mining.

Second, end-users of legal documents are by nature very conscious about retrieval quality. No matter whether they are aware of the parameters for retrieval quality such as scope, recall and precision, end users (e.g., lawyers) know that the one make-or-break case could exist; therefore the goal of any kind of information retrieval effort is to retrieve this one case but not others. To measure the relevance of retrieval results is very difficult, but determining relevance in law is far easier than in other domains. The importance of retrieval quality to end users of legal information systems is of course not specific to Germany, but is still an important driver for resource allocation in online retrieval projects. Therefore, legal research is the ideal test area for technical innovations in search and retrieval. Taking into account that lawyers are an attractive target group, investors will to a larger degree focus on legal research as a business opportunity. Such investment should speed progress.

Third, an important parameter for legal online services in Germany is the availability of primary legal content (legislation and case law). Public authorities in Germany claim copyright — at least to some extent — in official documents, and their efforts at publishing primary legal content online have been rudimentary. The existing offerings of legal content set up by state authorities are end-user oriented and very heterogeneous. The service providers — such as this firm — that technically publish the legal documents on behalf of the public authorities are private companies with business interests relating to legal information. Thus, allowing these companies to publish primary legal content on behalf of the public authorities is like letting the fox rule the henhouse. After all, official documents are the result of the tax-funded work of public authorities.

A modern publication infrastructure for primary legal content should functionally separate data collection from dissemination. Official documents should, if required, be anonymized, and stored on servers in a well-structured format for anybody to download either free-of-charge, or at cost of dissemination. This would allow non-commercial projects as well as commercial users to focus on value-adding processes, rather than crawling and re-engineering data that already exists as part of proprietary collections. In economic terms, this would lead to improved resource allocation, strengthen electronic media as tools for democratic processes, and support the goals of the Public Sector Information Directive.

Legal online research in Germany in 2020

In order to analyze the impact of metadata standards on legal publishers, I will be so bold as to predict how legal online services could look in 2020:

a) Public authorities will be obliged to publish any official document electronically in a well-structured format on servers accessible for anybody, for commercial or non-commercial use, at cost of dissemination.
b) There will be a wide range of legal online information services free of charge serving all sorts of information needs and target groups. These will range from easy-to-use systems for one-time users to expert systems for professionals.
c) The motivations for setting up legal online information services will vary to a larger extent than they do today. Successful online projects will have to support this growing range of motivations. (See Felix Zimmermann, JurMeta: A New Metadata Initiative for Legal Documents.) Since setting up legal online services will even be easier in 2020 than in 2010, more people will set up such services.
d) If one measures retrieval quality by recall and precision, I bet nearly all free services in 2020 will be better than the existing commercial legal online services in Germany in 2010. I invite anybody to come up with a proposal for a concrete experimental design with which to test this thesis, and I am curious to see the results of the experiment that proves this thesis wrong. But in any case:
e) The beneficiary of these trends will be the end user of legal information.

Impact on fee-based services and traditional publishing houses

Assuming that legal information services will radically improve in the future, is there a business opportunity for traditional publishing houses in legal online services at all? If the answer is yes, what will be the parameters for (business) success?

Applying a SWOT (Strengths, Weaknesses, Opportunities, and Threats) analysis will identify the threats and opportunities for publishing houses. An important threat is the increasing speed of activity and change in the market. The technical lead-time of open-source-based start-ups might be too large for traditional publishers. An acquisition could help a traditional publisher catch up, but only if the acquirer is willing to adjust its processes and business principles. All traditional publishing houses I know claim to be customer oriented. I am convinced that market pressure (i.e., competition) will increase because of the numerous new legal online services being launched over the next couple of years. A substantial opportunity for traditional publishing houses lies in increasing the benefits that users derive from legal information, rather than trying to convert book selling to online media.

The biggest advantage of established legal publishing houses, as opposed to new start-ups, is their brand recognition, and the chance to position themselves as gatekeepers for high quality information. The corresponding threat is –- again –- based on new technologies: Services such as UserVoice allow new content providers to gain user feedback more effectively than through traditional market research, and at very low cost. Online social networks are analyzing the interactions among users, and are trying to exploit users’ behavior and ratings on the fly for business purposes. We will see whether traditional publishing houses will timely make use of such technologies, or whether such publishers will be able to catch up if they miss the chance to apply these new tools.

Conclusion

Assuming that, as mentioned above, a combination of trends will drive the future of legal online services, the publisher that identifies the synergies of the “digital convergence” and first applies them in customer products has the best chance to benefit from the current trends. Those that move too late might lose even more revenue than currently expected due to the change from print to online information.

So what about the impact of metadata standards such as jurMeta on legal online services in Germany? Metadata standards are an important component of the concept of the above-mentioned “digital convergence.” They are the syntactical bridge from the raw materials of German legal documents to the Semantic Web of legal information in Germany. Metadata standards for legal information services support decentralized value-adding processes, and thus are the key parameter for exploiting the synergies enabled by “digital convergence.” The metadata standards that best serve the various needs of the greatest number of legal online projects and services, will succeed.

Dr. Andreas Bock is CEO of kjur.de. After writing his Ph.D. thesis on the quality of digital information services (Gütezeichen als Qualitätsaussage im digitalen Informationsmarkt: dargestellt am Beispiel elektronischer Rechtsdatenbanken (2000)) at the Institute for Legal Informatics of the University of Hannover with Prof. Dr. Wolfgang Kilian, he worked for Westlaw (2000 to 2003) and LexisNexis (2003 to 2007), heading product development of their respective international legal online projects in Germany. Dr. Bock currently practices law at Kehr-Ritz. Together with Felix Zimmermann, he launched kjur.de, an online service for free legal content, in 2009.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Older Entries Newer Entries

Suffusion theme by Sayontan Sinha

The JUMAS Experience: Extracting Knowledge From Judicial Multimedia Digital Libraries

LexML Brazil Project

Teaching the Computer to Read Legal Text

Time to Turn the Page on Print Legal Information

Electronic Voting and Direct Democracy

Background on E-voting

Internet Voting

E-voting and Direct Democracy

Where Should We Go From Here?

Legislation.gov.uk

IT and the Access to Justice Crisis

LegisLink.Org: Simplified Human-Readable URLs for Legislative Citations

Rule-Based Legal Information Systems

The Impact of Metadata Standards on Traditional Legal Online Services in Germany

Recent Posts

VoxPop people and posts

Subscribe to VoxPopuLII

Blogroll

Background on E-voting

Internet Voting

E-voting and Direct Democracy

Where Should We Go From Here?

Recent Posts

VoxPop people and posts

Subscribe to VoxPopuLII

Blogroll

Tags