skip navigation
search

THE JUDICIAL CONTEXT: WHY INNOVATE?

The progressive deployment of information and communication technologies (ICT) in the courtroom (audio and video recording, document scanning, courtroom management systems), jointly with the requirement for paperless judicial folders pushed by e-justice plans (Council of the European Union, 2009), are quickly transforming the traditional judicial folder into an integrated multimedia folder, where documents, audio recordings and video recordings can be accessed, usually via a Web-based platform. This trend is leading to a continuous increase in the number and the volume of case-related digital judicial libraries, where the full content of each single hearing is available for online consultation. A typical trial folder contains: audio hearing recordings, audio/video hearing recordings, transcriptions of hearing recordings, hearing reports, and attached documents (scanned text documents, photos, evidences, etc.). The ICT container is typically a dedicated judicial content management system (court management system), usually physically separated and independent from the case management system used in the investigative phase, but interacting with it.

Most of the present ICT deployment has been focused on the deployment of case management systems and ICT equipment in the courtrooms, with content management systems at different organisational levels (court or district). ICT deployment in the judiciary has reached different levels in the various EU countries, but the trend toward full e-justice is clearly in progress. Accessibility of the judicial information, both of case registries (more widely deployed), and of case e-folders, has been strongly enhanced by state-of-the-art ICT technologies. Usability of the electronic judicial folders is still affected by a traditional support toolset, such that an information search is limited to text search, transcription of audio recordings (indispensable for text search) is still a slow and fully manual process, template filling is a manual activity, etc. Part of the information available in the trial folder is not yet directly usable, but requires a time-consuming manual search. Information embedded in audio and video recordings, describing not only what was said in the courtroom, but also the specific trial context and the way in which it was said, still needs to be exploited. While the information is there, information extraction and semantically empowered judicial information retrieval still wait for proper exploitation tools. The growing amount of digital judicial information calls for the development of novel knowledge management techniques and their integration into case and court management systems. In this challenging context a novel case and court management system has been recently proposed.

The JUMAS project (JUdicial MAnagement by digital libraries Semantics) was started in February 2008, with the support of the Polish and Italian Ministries of Justice. JUMAS seeks to realize better usability of multimedia judicial folders — including transcriptions, information extraction, and semantic search –to provide to users a powerful toolset able to fully address the knowledge embedded in the multimedia judicial folder.

The JUMAS project has several objectives:

  • (1) direct searching of audio and video sources without a verbatim transcription of the proceedings;
  • (2) exploitation of the hidden semantics in audiovisual digital libraries in order to facilitate search and retrieval, intelligent processing, and effective presentation of multimedia information;
  • (3) fusing information from multimodal sources in order to improve accuracy during the automatic transcription and the annotation phases;
  • (4) optimizing the document workflow to allow the analysis of (un)structured information for document search and evidence-based assessment; and
  • (5) supporting a large scale, scalable, and interoperable audio/video retrieval system.

JUMAS is currently under validation in the Court of Wroclaw (Poland) and in the Court of Naples (Italy).

THE DIMENSIONS OF THE PROBLEM

In order to explain the relevance of the JUMAS objectives, we report some volume data related to the judicial domain context. Consider, for instance, the Italian context, where there are 167 courts, grouped in 29 districts, with about 1400 courtrooms. In a law court of medium size (10 courtrooms), during a single legal year, about 150 hearings per court are held, with an average duration of 4 hours. Considering that in approximately 40% of them only audio is recorded, in 20% both audio and video, while the remaining 40% has no recording, the multimedia recording volume we are talking about is 2400 hours of audio and 1200 hours of audio/video per year. The dimensioning related to the audio and audio/video documentation starts from the hypothesis that multimedia sources must be acquired at high quality in order to obtain good results in audio transcription and video annotation, which will affect the performance connected to the retrieval functionalities. Following these requirements, one can figure out a storage space of about 8.7 megabytes per minute (MB/min) for audio and 39 MB/min for audio/video. This means that during a legal year for a court of medium size we need to allocate 4 terabytes (TB) for audio/video material. Under these hypotheses, the overall size generated by all the courts in the justice system — for Italy only — in one year is about 800 TB. This shows how the justice sector is a major contributor to the data deluge (The Economist, 2010).

In order to manage such quantities of complex data, JUMAS aims to:

  • Optimize the workflow of information through search, consultation, and archiving procedures;
  • Introduce a higher degree of knowledge through the aggregation of different heterogeneous sources;
  • Speed up and improve decision processes by enabling discovery and exploitation of knowledge embedded in multimedia documents, in order to consequently reduce unnecessary costs;
  • Model audio-video proceedings in order to compare different instances; and
  • Allow traceability of proceedings during their evolution.

THE JUMAS SYSTEM

To achieve the above-mentioned goals, the JUMAS project has delivered the JUMAS system, whose main functionalities (depicted in Figure 1) are: automatic speech transcription, emotion recognition, human behaviour annotation, scene analysis, multimedia summarization, template-filling, and deception recognition.

 

Figure 1: Overview of the JUMAS functionalities

The architecture of JUMAS, depicted in Figure 2, is based on a set of key components: a central database, a user interface on a Web portal, a set of media analysis modules, and an orchestration module that allows the coordination of all system functionalities.

Figure 2: Overview of the JUMAS architecture

The media stream recorded in the courtroom includes both audio and video that are analyzed to extract semantic information used to populate the multimedia object database. The outputs of these processes are annotations: i.e., tags attached to media streams and stored in the database (Oracle 11g). The integration among modules is performed through a workflow engine and a module called JEX (JUMAS EXchange library). While the workflow engine is a service application that manages all the modules for audio and video analysis, JEX provides a set of services to upload and retrieve annotations to and from the JUMAS database.

JUMAS: THE ICT COMPONENTS

KNOWLEDGE EXTRACTION

Automatic Speech Transcription. For courtroom users, the primary sources of information are audio-recordings of hearings/proceedings. In light of this, JUMAS provides an Automatic Speech Recognition (ASR) system (Falavigna et al., 2009 and Rybach et al., 2009) trained on real judicial data coming from courtrooms. Currently two ASR systems have been developed: the first provided by Fondazione Bruno Kessler for the Italian language, and the second delivered by RWTH Aachen University for the Polish language. Currently, the ASR modules in the JUMAS system offer 61% accuracy over the generated automatic transcriptions, and represent the first contribution for populating the digital libraries with judicial trial information. In fact, the resulting transcriptions are the main information resource that are to be enriched by other modules, and then can be consulted by end users through the information retrieval system.

Emotion Recognition. Emotional states represent an aspect of knowledge embedded into courtroom media streams that may be used to enrich the content available in multimedia digital libraries. Enabling the end user to consult transcriptions by considering the associated semantics as well, represents an important achievement, one that allows the end user to retrieve an enriched written sentence instead of a “flat” one. Even if there is an open ethical discussion about the usability of this kind of information, this achievement radically changes the consultation process: sentences can assume different meanings according to the affective state of the speaker. To this purpose an emotion recognition module (Archetti et al., 2008), developed by the Consorzio Milano Ricerche jointly with the University of Milano-Bicocca, is part of the JUMAS system. A set of real-world human emotions obtained from courtroom audio recordings has been gathered for training the underlying supervised learning model.

Human Behavior Annotation. A further fundamental information resource is related to the video stream. In addition to emotional states identification, the recognition of relevant events that characterize judicial proceedings can be valuable for end users. Relevant events occurring during proceedings trigger meaningful gestures, which emphasize and anchor the words of witnesses, and highlight that a relevant concept has been explained. For this reason, the human behavior recognition modules (Briassouli et al., 2009, Kovacs et al., 2009), developed by CERTH-ITI and by MTA SZTAKI Research Institute, have been included in the JUMAS system. The video analysis captures relevant events that occur during the course of a trial in order to create semantic annotations that can be retrieved by judicial end users. The annotations are mainly concerned with the events related to the witness: change of posture, change of witness, hand gestures, gestures indicating conflict or disagreement.

Deception Detection. Discriminating between truthful and deceptive assertions is one of the most important activities performed by judges, lawyers, and prosecutors. In order to support these individuals’ reasoning activities, respecting corroborating/contradicting declarations (in the case of lawyers and prosecutors) and judging the accused (judges), a deception recognition module has been developed as a support tool. The deception detection module developed by the Heidelberg Institute for Theoretical Studies is based on the automatic classification of sentences performed by the ASR systems (Ganter and Strube, 2009). In particular, in order to train the deception detection module, a manual annotation of the output of the ASR module — with the help of the minutes of the transcribed sessions — has been performed. The knowledge extracted for training the classification module deals with lies, contradictory statements, quotations, and expressions of vagueness.

Information Extraction. The current amount of unstructured textual data available in the judicial domain, especially related to transcriptions of proceedings, highlights the necessity of automatically extracting structured data from unstructured material, to facilitate efficient consultation processes. In order to address the problem of structuring data coming from the automatic speech transcription system, Consorzio Milano Ricerche has defined an environment that combines regular expressions, probabilistic models, and background information available in each court database system. Thanks to this functionality, the judicial actors can view each individual hearing as a structured summary, where the main information extracted consists of the names of the judge, lawyers, defendant, victim, and witnesses; the names of the subjects cited during a deposition; the date cited during a deposition; and data about the verdict.

KNOWLEDGE MANAGEMENT

Information Retrieval. Currently, to retrieve audio/video materials acquired during a trial, the end user must manually consult all of the multimedia tracks. The identification of a particular position or segment of a multimedia stream, for purposes of looking at and/or listening to specific declarations, is possible either by remembering the time stamp when the events occurred, or by watching or hearing the whole recording. The amalgamation of automatic transcriptions, semantic annotations, and ontology representations allows us to build a flexible retrieval environment, based not only on simple textual queries, but also on broad and complex concepts. In order to define an integrated platform for cross-modal access to audio and video recordings and their automatic transcriptions, a retrieval module able to perform semantic multimedia indexing and retrieval has been developed by the Information Retrieval group at MTA SZTAKI. (Darczy et al., 2009)

Ontology as Support to Information Retrieval. An ontology is a formal representation of the knowledge that characterizes a given domain, through a set of concepts and a set of relationships that obtain among them. In the judicial domain, an ontology represents a key element that supports the retrieval process performed by end users. Text-based retrieval functionalities are not sufficient for finding and consulting transcriptions (and other documents) related to a given trial. A first contribution of the ontology component developed by the University of Milano-Bicocca (CSAI Research Center) for the JUMAS system provides query expansion functionality. Query expansion aims at extending the original query specified by end users with additional related terms. The whole set of keywords is then automatically submitted to the retrieval engine. The main objective is to narrow the search focus or to increase recall.

User Generated Semantic Annotations. Judicial users usually manually tag some documents for purposes of highlighting (and then remembering) significant portions of the proceedings. An important functionality, developed by the European Media Laboratory and offered by the JUMAS system, relates to the possibility of digitally annotating relevant arguments discussed during a proceeding. In this context, the user-generated annotations may aid judicial users in future retrieval and reasoning processes. The user-generated annotations module included in the JUMAS system allows end users to assign free tags to multimedia content in order to organize the trials according to their personal preferences. It also enables judges, prosecutors, lawyers, and court clerks to work collaboratively on a trial; e.g., a prosecutor who is taking over a trial can build on the notes of his or her predecessor.

KNOWLEDGE VISUALIZATION

Hyper Proceeding Views. The user interface of JUMAS — developed by ESA Projekt and Consorzio Milano Ricerche — is a Web portal, in which the contents of the database are presented in different views. The basic view allows browsing of the trial archive, as in a typical court management system, to view general information (dates of hearings, name of people involved) and documents attached to each trial. JUMAS’s distinguishing features include the automatic creation of a summary of the trial, the presentation of user-generated annotations, and the Hyper Proceeding View: i.e., an advanced presentation of media contents and annotations that allows the user to perform queries on contents, and jump directly to relevant parts of media files.

 

Multimedia Summarization. Digital videos represent a fundamental information resource about the events that occur during a trial: such videos can be stored, organized, and retrieved in a short time and at low cost. However, considering the dimensions that a video resource can assume during the recording of a trial, judicial actors have specified several requirements for digital trial videos: fast navigation of the stream, efficient access to data within the stream, and effective representation of relevant contents. One possible solution to these requirements lies in multimedia summarization, which derives a synthetic representation of audio/video contents with a minimal loss of meaningful information. In order to address the problem of defining a short and meaningful representation of a proceeding, a multimedia summarization environment based on an unsupervised learning approach has been developed (Fersini et al., 2010) by Consorzio Milano Ricerche jointly with University of Milano-Bicocca.

CONCLUSION

The JUMAS project demonstrates the feasibility of enriching a court management system with an advanced toolset for extracting and using the knowledge embedded in a multimedia judicial folder. Automatic transcription, template filling, and semantic enrichment help judicial actors not only to save time, but also to enhance the quality of their judicial decisions and performance. These improvements are mainly due to the ability to search not only text, but also events that occur in the courtroom. The initial results of the JUMAS project indicate that automatic transcription and audio/video annotations can provide additional information in an affordable way.

Elisabetta Fersini has a post-doctoral research fellow position at the University of Milano-Bicocca. She received her PhD with a thesis on “Probabilistic Classification and Clustering using Relational Models.” Her research interest is mainly focused on (Relational) Machine Learning in several domains, including Justice, Web, Multimedia, and Bioinformatics.

VoxPopuLII is edited by Judith Pratt.

Editor-in-Chief is Robert Richards, to whom queries should be directed.

The Problem: URLs and Internal Links for Legislative Documents

LegisLink LogoLegislative documents reside at various government Websites in various formats (TXT, HTML, XML, PDF, WordPerfect). URLs for these documents are often too long or difficult to construct. For example, here is the URL for the HTML format version of bill H.R. 3200 of the 111th U.S. Congress:

http://www.gpo.gov/fdsys/pkg/BILLS-111hr3200IH/html/BILLS-111hr3200IH.htm

More importantly, “deep” links to internal locations (often called “subdivisions” or “segments”) within a legislative document (the citations within the law, such as section 246 of bill H.R. 3200) are often not supported, or are non-intuitive for users to create or use. For most legislative Websites, users must click through or fill out forms and then scroll or search for the specific location in the text of legislation. This makes it difficult if not impossible to create and share links to official citations. Enabling internal links to subdivisions of legislative documents is crucial, because in most situations, users of legal information need access only to a subdivision of a legal document, not to the entire document.

A Solution: LegisLink

LegisLink.org is a URL Redirection Service with the goal of enabling Internet access to legislative material using citation-based URLs rather than requiring users to repeatedly click and scroll through documents to arrive at a destination.  Let’s say you’re reading an article at CNN.com and the article references section 246 in H.R. 3200.  If you want to read the section, you can search for H.R. 3200 and more than likely you will find the bill and then scroll to find the desired section.  On the other hand, you can use something like LegisLink by typing the correct URL.  For example: http://legislink.org/us/hr-3200-ih-246.

LegisLink Screen Shot

 

Benefits

There are several advantages of having a Web service that resolves legislative and legal citations.

(1)   LegisLink provides links to citations that are otherwise not easy for users to create.  In order to create a hyperlink to a location in an HTML or XML file, the publisher must include unique anchor or id attributes within their files.  Even if these attributes are included, they are often not exposed as links for Internet users to re-use.   On the other hand, Web-based software can easily scan a file’s text to find a requested citation and then redirect the user to the requested location.  For PDF files, it is possible to create hyperlinks to specific pages and locations when using the Acrobat plug-in from Adobe.  In these cases, hyperlinks can direct the user to the document location at the official Website.

For example, here is the LegisLink URL that links directly to section 246 within the PDF version of H.R. 3200: http://legislink.org/us/hr-3200-ih-246-pdf

In cases where governments have not included ids in HTML, XML or TXT files, LegisLink can replicate a government document on the LegisLink site, insert an anchor, and then redirect the user to the requested location.

(2)   LegisLink makes it easy to get to a specific location in a document, which saves time.  Law students and presumably all law professionals are relying on online resources to a greater extent than ever before.  In 2004, Stanford Law School published the results of their survey that found that 93% of first year law students used online resources for legal research at least 80% of the time.

hanoicyclers.png(3)   Creating and maintaining a .org site that acts as an umbrella for all jurisdictions makes it easier to locate documents and citations, especially when they have been issued by a jurisdiction with which one is unfamiliar.  Legislation and other legal documents tend to reside at multiple Websites within a jurisdiction.  For example, while U.S. federal legislation (i.e., bills and slip laws) is stored at thomas.loc.gov (HTML and XML) and gpo.gov (at FDsys and GPO Access) (TXT and PDF), the United States Code is available at uscode.house.gov and at gpo.gov (FDsys and GPO Access), while roll call votes are at clerk.house.gov and www.senate.gov.   Governments tend to compartmentalize activities, and their Websites reflect much of that compartmentalization.  LegisLink.org or something like it could, at a minimum, provide a resource that helps casual and new users find where official documents are stored at various locations or among various jurisdictions.

(4) LegisLinks won’t break over time. Governments sometimes change the URL locations for their documents. This often breaks previously relied-upon URLs (a result that is sometimes called “link rot”). A URL Redirection Service lessens these eventual annoyances to users because the syntax for the LegisLink-type service remains the same. To “fix” the broken links, the LegisLink software is simply updated to link to the government’s new URLs. This means that previously published LegisLinks won’t break over time.

(5)   A LegisLink-type service does not require governments to expend resources.  The goal of LegisLink is to point to government or government-designated resources.  If those resources contain anchors or id attributes, they can be used to link to the official government site.  If the documents are in PDF (non-scanned), they can also be used to link to the official government site.  In other cases, the files can be replicated temporarily and slightly manipulated (e.g., the tag <a name=SEC-#> can be added at the appropriate location) in order to achieve the desired results.

RocksAlternatives

While some Websites have implemented Permalinks and handle systems (e.g., the Library of Congress’s THOMAS system), these systems tend to link users to the document level only. They also generally only work within a single Internet domain, and casual users tend not to be aware of their existence.

Other technologies at the forefront of this space include recent efforts to create a URN-based syntax for legal documents (URN:LEX). To quote from the draft specification, “In an on-line environment with resources distributed among different Web publishers, uniform resource names allow simplified global interconnection of legal documents by means of automated hypertext linking.”

The syntax for URN:LEX is a bit lengthy, but because of its specificity, it needs to be included in any universal legal citation redirection service. The inclusion of URN:LEX syntax does not, however, mitigate the need for additional simpler syntaxes.  This distinction is important for the users who just want to quickly access a particular legislative document, such as a bill that is mentioned in a news article.  For example, if LegisLink were widely adopted, users would come to know that the URL http://legislink.org/us/hr-3200 will link to the current Congress’s H.R. 3200; the LegisLink URL is therefore readily usable by humans. And use of LegisLink for a particular piece of legislation is to some extent consistent with the use of URN:LEX for the same legislation: for example, a URN:LEX-based address such as http://legislink.org/urn:lex/us/federal:legislation:2009;
111.hr.3200@official;thomas.loc.gov:en$text-html
could also lead to the current Congress’s H.R. 3200. A LegisLink-type service can include the URN:LEX syntax, but the URN:LEX syntax cannot subsume the simplified syntax being proposed for LegisLink.org.

The goals of Citability.org, another effort to address these issues, calls for the replication of all government documents for point-in-time access. In addition, Citability.org envisions including date and time information as part of the URL syntax in order to provide access to the citable content that was available at the specified date and time. LegisLink has more modest goals: it focuses on linking to currently provided government documents and locations within those documents. Since legislation is typically stored as separate, un-revisable documents for a given legislative term (lasting 2 years in many U.S. jurisdictions), the use of date and time information is redundant with legislative session information.

The primary goal of a legislative URL Redirection Service such as LegisLink.org is to expedite the delivery of needed information to the Internet user. In addition, the LegisLink tools used to link to legislative citations in one jurisdiction can be re-used for other jurisdictions; this reduces developers’ labor as more jurisdictions are added.

PathwayNext Steps

The LegisLink.org site is organized by jurisdiction: each jurisdiction has its own script, and all scripts can re-use common functions. The prototype is currently being built to handle the United States (us), Colorado (us-co), and New Zealand (nz). The LegisLink source code is available as text files at http://legislink.org/code.html.

The challenges of a service like LegisLink.org are: (1) determining whether the legal community is interested in this sort of solution, (2) finding legislative experts to define the needed syntax and results for jurisdictions of interest, and (3) finding software developers interested in helping to work on the project.

This project cannot be accomplished by one or two people. Your help is needed, whether you are an interested user or a software developer. At this point, the code for LegisLink is written in Perl. Please join the LegisLink wiki site at http://legislink.wikispaces.org to add your ideas, to discuss related information, or just to stay informed about what’s going on with LegisLink.

Joe_CarmelJoe Carmel is a part-time consultant and software developer hobbyist. He was previously Chief of the Legislative Computer Systems at the U.S. House of Representatives (2001-2005) and spearheaded the use of XML for the drafting of legislation, the publication of roll call votes, and the creation and maintenance of the U.S. Congressional Biographical Directory.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

CornucopiaThe World Wide Web is a virtual cornucopia of legal information bearing on all manner of topics and in a spectrum of formats, much of it textual. However, to make use of this storehouse of textual information, it must be annotated and structured in such a way as to be meaningful to people and processable by computers. One of the visions of the Semantic Web has been to enrich information on the Web with annotation and structure. Yet, given that text is in a natural language (e.g., English, German, Japanese, etc.), which people can understand but machines cannot, some automated processing of the text itself is needed before further processing can be applied. In this article, we discuss one approach to legal information on the World Wide Web, the Semantic Web, and Natural Language Processing (NLP). Each of these are large, complex, and heterogeneous topics of research; in this short post, we can only hope to touch on a fragment and that heavily biased to our interests and knowledge. Other important approaches are mentioned at the end of the post. We give small working examples of legal textual input, the Semantic Web output, and how NLP can be used to process the input into the output.

Legal Information on the Web

For clients, legal professionals, and public administrators, the Web provides an unprecedented opportunity to search for, find, and reason with legal information such as case law, legislation, legal opinions, journal articles, and material relevant to discovery in a court procedure. With a search tool such as Google or indexed searches made available by Lexis-Nexis, Westlaw, or the World Legal Information Institute, the legal researcher can input key words into a search and get in return a (usually long) list of documents which contain, or are indexed by, those key words.

As useful as such searches are, they are also highly limited to the particular words or indexation provided, for the legal researcher must still manually examine the documents to find the substantive information. Moreover, current legal search mechanisms do not support more meaningful searches such as for properties or relationships, where, for example, a legal researcher searches for cases in which a company has the property of being in the role of plaintiff or where a lawyer is in the relationship of representing a client. Nor, by the same token, can searches be made with respect to more general (or more specific) concepts, such as “all cases in which a company has any role,” some particular fact pattern, legislation bearing on related topics, or decisions on topics related to a legal subject.

Binary MysteryThe underlying problem is that legal textual information is expressed in natural language. What literate people read as meaningful words and sentences appear to a computer as just strings of ones and zeros. Only by imposing some structure on the binary code is it converted to textual characters as we know them. Yet, there is no similar widespread system for converting the characters into higher levels of structure which correlate to our understanding of meaning. While a search can be made for the string plaintiff, there are no (widely available) searches for a string that represents an individual who bears the role of plaintiff. To make language on the Web more meaningful and structured, additional content must be added to the source material, which is where the Semantic Web and Natural Language Processing come into play.

Semantic Web

The Semantic Web is a complex of design principles and technologies which are intended to make information on the Web more meaningful and usable to people.Semantic Web Stack We focus on only a small portion of this structure, namely the syntactic XML (eXtensible Markup Language) level, where elements are annotated so as to indicate linguistically relevant information and structure. (Click here for more on these points.) While the XML level may be construed as a ‘lower’ level in the Semantic Web “stack” — i.e., the layers of interrelated technologies that make up the Semantic Web — the XML level is nonetheless crucial to providing information to higher levels where ontologies (and click here for more on this) and logic play a role. So as to be clear about the relation between the Semantic Web and NLP, we briefly review aspects of XML by example, and furnish motivations as we go.

Suppose one looks up a case where Harris Hill is the plaintiff and Jane Smith is the attorney for Harris Hill. In a document related to this case, we would see text such as the following portions:

Harris Hill, plaintiff.
Jane Smith, attorney for the plaintiff.

While it is relatively straightforward to structure the binary string into characters, adding further information is more difficult. Consider what we know about this small fragment: Harris and Jane are (very likely) first names, Hill and Smith are last names, Harris Hill and Jane Smith are full names of people, plaintiff and attorney are roles in a legal case, Harris Hill has the role of plaintiff, attorney for is a relationship between two entities, and Jane Smith is in the attorney for relationship to Harris Hill. It would be useful to encode this information into a standardised machine-readable and processable form.

XML helps to encode the information by specifying requirements for tags that can be used to annotate the text. It is a highly expressive language, allowing one to define tags that suit one’s purposes so long as the specification requirements are met. One requirement is that each tag has a beginning and an ending; the material in between is the data that is being tagged. For example, suppose tags such as the following, where … indicates the data:


<legalcase>...</legalcase>,
<firstname>...</firstname>,
<lastname>...</lastname>,
<fullname>...</fullname>,
<plaintiff>...</plaintiff>,
<attorney>...</attorney>, 
<legalrelationship>...</legalrelationship>

Another requirement is that the tags have a tree structure, where each pair of tags in the document is included in another pair of tags and there is no crossing over:


<fullname><firstname>...</firstname>, 
<lastname>...</lastname></fullname>

is acceptable, but


<fullname><firstname>...<lastname>
</firstname> ...</lastname></fullname>

is unacceptable. Finally, XML tags can be organised into schemas to structure the tags.

With these points in mind, we could represent our fragment as:


<legalcase>
  <legalrelationship>
    <plaintiff>
      <fullname><firstname>Harris</firstname>,
           <lastname>Hill</lastname></fullname>
    </plaintiff>,
    <attorney>
      <fullname><firstname>Jane</firstname>,
           <lastname>Smith</lastname></fullname>
    </attorney>
  </legalrelationship
</legalcase>

We have added structured information — the tags — to the original text. While this is more difficult for us to read, it is very easy for a machine to read and process. In addition, the tagged text contains the content of the information, which can be presented in a range of alternative ways and formats using a transformation language such as XSLT (click here for more on this point) so that we have an easier-to-read format.

Why bother to include all this additional information in a legal text? Because these additions allow us to query the source text and submit the information to further processing such as inference. Given a query language, we could submit to the machine the query Who is the attorney in the case? and the answer would be Jane Smith. Given a rule language — such as RuleML or Semantic Web Rule Language (SWRL) — which has a rule such as If someone is an attorney for a client then that client has a privileged relationship with the attorney, it might follow from this rule that the attorney could not divulge the client’s secrets. Applying such a rule to our sample, we could infer that Jane Smith cannot divulge Harris Hill’s secrets.

Tower of BabelThough it may seem here like too much technology for such a small and obvious task, it is essential where we scale up our queries and inferences on large corpora of legal texts — hundreds of thousands if not millions of documents — which comprise vast storehouses of unstructured, yet meaningful data. Were all legal cases uniformly annotated, we could, in principle, find out every attorney for every plaintiff for every legal case. Where our tagging structure is very rich, our queries and inferences could also be very rich and detailed. Perhaps a more familiar way to view documents annotated with XML is as a database to which further processes can be applied over the Web.

Natural Language Processing

As we have presented it, we have an input, the corpus of texts, and an output, texts annotated with XML tags. The objective is to support a range of processes such as querying and inference. However, getting from a corpus of textual information to annotated output is a demanding task, generically referred to as the knowledge acquisition bottleneck. Not only is the task demanding on resources (time, money, manpower); it is also highly knowledge intensive since whoever is doing the annotation must know what to look for, and it is important that all of the annotators annotate the text in the same way (inter-annotator agreement) to support the processes. Thus, automation is central.

Yet processing language to support such richly annotated documents confronts a spectrum of difficult issues. Among them, natural language supports (1) implicit or presupposed information, (2) multiple forms with the same meaning, (3) the same form with different contextually dependent meanings, and (4) dispersed meanings. (Similar points can be made for sentences or other linguistic elements.) Here are examples of these four issues:

(1) “When did you stop taking drugs?” (presupposes that the person being questioned took drugs at sometime in the past);
(2) Jane Smith, Jane R. Smith, Smith, Attorney Smith… (different ways to refer to the same person);
(3) The individual referred to by the name “Jane Smith” in one case decision may not be the individual referred to by the name “Jane Smith” in another case decision;
(4) Jane Smith represented Jones Inc. She works for Dewey, Cheetum, and Howe. To contact her, write to j.smith@dch.com .

When we search for information, a range of linguistic structures or relationships may be relevant to our query, such as:

People grasp relationships between words and phrases, such that Bill exercises daily contrasts with the meaning of Bill is a couch potato, or that if it is true that Bill used a knife to kill Phil, then Bill killed Phil. Finally, meaning tends to be sparse; that is, there are a few words and patterns that occur very regularly, while most words or patterns occur relatively rarely in the corpus.

Natural language processing (NLP) takes on this highly complex and daunting problem as an engineering problem, decomposing large problems into smaller problems and subdomains until it gets to those which it can begin to address. Having found a solution to smaller problems, NLP can then address other problems or larger scope problems. Some of the subtopics in NLP are:

  • Generation – converting information in a database into natural language.
  • Understanding – converting natural language into a machine-readable form.
  • Information Retrieval – gathering documents which contain key words or phrases. This is essentially what is done by Google.
  • Text Summarization – summarizing (in a paragraph) the main meaning of a text or corpus.
  • Question Answering – making queries and giving answers to them, in natural language, with respect to some corpus of texts.
  • Information Extraction — identifying, annotating, and extracting information from documents for reuse, representation, or reasoning.

In this article, we are primarily (here) interested in information extraction.

NLP Approaches: Knowledge Light v. Knowledge Heavy

There are a range of techniques that one can apply to analyse the linguistic data obtained from legal texts; each of these techniques has strengths and weaknesses with respect to different problems. Statistical and machine-learning techniques are considered “knowledge light.” With statistical approaches, the processing presumes very little knowledge by the system (or analyst). Rather, algorithms are applied that compare and contrast large bodies of textual data, and identify regularities and similarities. Such algorithms encounter problems with sparse data or patterns that are widely dispersed across the text. (See Turney and Pantel (2010) for an overview of this area.) Machine learning approaches apply learning algorithms to annotated material to extend results to unannotated material, thus introducing more knowledge into the processing pipeline. However, the results are somewhat of a black box in that we cannot really know the rules that are learned and use them further.

With a “knowledge-heavy” approach, we know, in a sense, what we are looking for, and make this knowledge explicit in lists and rules for processing. Yet, this is labour- and knowledge-intensive. In the legal domain it is crucial to have humanly understandable explanations and justifications for the analysis of a text, which to our thinking warrants a knowledge-heavy approach.

One open source text-mining package, General Architecture for Text Engineering (GATE), consists of multiple components in a cascade or pipeline, each component automatically processing some aspect of the text, and then feeding into the next process. The underlying strategy in all the components is to find a pattern (from either a list or a previous process) which matches a rule, and then to apply the rule which annotates the text. Each component performs a particular process on the text, such as:

  • Sentence segmentation – dividing text into sentences.
  • Tokenisation – words identified by spaces between them.
  • Part-of-speech tagging – noun, verb, adjective, etc., determined by look-up and relationships among words.
  • Shallow syntactic parsing/chunking – dividing the text by noun phrase, verb phrase, subordinate clause, etc.
  • Named entity recognition – the entities in the text such as organisations, people, and places.
  • Dependency analysis – subordinate clauses, pronominal anaphora [i.e., identifying what a pronoun refers to], etc.

The system can also be used to annotate more specifically to elements of interest. In one study, we annotated legal cases from a case base (a corpus of cases) in order to identify a range of particular pieces of information that would be relevant to legal professionals such as:

  • Case citation.
  • Names of parties.
  • Roles of parties, meaning plaintiff or defendant.
  • Type of court.
  • Names of judges.
  • Names of attorneys.
  • Roles of attorneys, meaning the side they represent.
  • Final decision.
  • Cases cited.
  • Nature of the case, meaning using keywords to classify the case in terms of subject (e.g., criminal assault, intellectual property, etc.)

Applying our lists and rules to a corpus of legal cases, a sample output is as follows, where the coloured highlights are annotated as per the key on the right; the colours are a visualisation of the sorts of tags discussed above (to see a larger version of the image, right click on the image, then click on “View Image” or a similar phrase; when finished viewing the image, use the browser’s back button to return to the text):

Annotation of a Criminal Case

The approach is very flexible and appears in similar systems. (See, for example, de Maat and Winkels, Automatic Classification of Sentences in Dutch Laws (2008).) While it is labour intensive to develop and maintain such list and rule systems, with a collaborative, Web-based approach, it may be feasible to construct rich systems to annotate large domains.

Conclusion

In this post, we have given a very brief overview of how the Semantic Web and Natural Language Processing (NLP) apply to legal textual information to support annotation which then enables querying and inference. Of course, this is but one take on a much larger domain. In our view, it holds great promise in making legal information more transparent and available to more legal professionals. Aside from GATE, some other resources on text analytics and NLP are textbooks and lecture notes (see, e.g., Wilcock), as well as workshops (such as SPLeT and LOAIT). While applications of Natural Language Processing to legal materials are largely lab studies, the use of NLP in conjunction with Semantic Web technology to annotate legal texts is a fast-developing, results-oriented area which targets meaningful applications for legal professionals. It is well worth watching.

Adam WynerDr. Adam Zachary Wyner is a Research Fellow at the University of Leeds, Institute of Communication Studies, Centre for Digital Citizenship. He currently works on the EU-funded project IMPACT: Integrated Method for Policy Making Using Argument Modelling and Computer Assisted Text Analysis. Dr. Wyner has a Ph.D. in Linguistics (Cornell, 1994) and a Ph.D. in Computer Science (King’s College London, 2008). His computer science Ph.D. dissertation is entitled Violations and Fulfillments in the Formal Representation of Contracts. He has published in the syntax and semantics of adverbs, deontic logic, legal ontologies, and argumentation theory with special reference to law. He is workshop co-chair of SPLeT 2010: Workshop on Semantic Processing of Legal Texts, to be held 23 May 2010 in Malta. He writes about his research at his blog, Language, Logic, Law, Software.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Ontology?The organization and formalization of legal information for computer processing in order to support decision-making or enhance information search, retrieval and knowledge management is not recent, and neither is the need to represent legal knowledge in a machine-readable form. Nevertheless, since the first ideas of computerization of the law in the late 1940s, the appearance of the first legal information systems in the 1950s, and the first legal expert systems in the 1970s, claims, such as Hafner’s, that “searching a large database is an important and time-consuming part of legal work,” which drove the development of legal information systems during the 80s, have not yet been left behind.

Similar claims may be found nowadays as, on the one hand, the amount of available unstructured (or poorly structured) legal information and documents made available by governments, free access initiatives, blawgs, and portals on the Web will probably keep growing as the Web expands. And, on the other, the increasing quantity of legal data managed by legal publishing companies, law firms, and government agencies, together with the high quality requirements applicable to legal information/knowledge search, discovery, and management (e.g., access and privacy issues, copyright, etc.) have renewed the need to develop and implement better content management tools and methods.

Information overload, however important, is not the only concern for the future of legal knowledge management; other and growing demands are increasing the complexity of the requirements that legal information management systems and, in consequence, legal knowledge representation must face in the future. Multilingual search and retrieval of legal information to enable, for example, integrated search between the legislation of several European countries; enhanced laypersons’ understanding of and access to e-government and e-administration sites or online dispute resolution capabilities (e.g., BATNA determination); the regulatory basis and capabilities of electronic institutions or normative and multi-agent systems (MAS); and multimedia, privacy or digital rights management systems, are just some examples of these demands.

How may we enable legal information interoperability? How may we foster legal knowledge usability and reuse between information and knowledge systems? How may we go beyond the mere linking of legal documents or the use of keywords or Boolean operators for legal information search? How may we formalize legal concepts and procedures in a machine-understandable form?

In short, how may we handle the complexity of legal knowledge to enhance legal information search and retrieval or knowledge management, taking into account the structure and dynamic character of legal knowledge, its relation with common sense concepts, the distinct theoretical perspectives, the flavor and influence of legal practice in its evolution, and jurisdictional and linguistic differences?

These are challenging tasks, for which different solutions and lines of research have been proposed. Here, I would like to draw your attention to the development of semantic solutions and applications and the construction of formal structures for representing legal concepts in order to make human-machine communication and understanding possible.

Semantic metadata

Nowadays, in the search and retrieval area, we still perform most legal searches in online or application databases using keywords (that we believe to be contained in the document that we are searching for), maybe together with a combination of Boolean operators, or supported with a set of predefined categories (metadata regarding, for example, date, type of court, etc.), a list of pre-established topics, thesauri (e.g., EUROVOC), or a synonym-enhanced search.

These searches rely mainly on syntactic matching, and — with the exception of searches enhanced with categories, synonyms, or thesauri — they will return only documents that contain the exact term searched for. To perform more complex searches, to go beyond the term, we require the search engine to understand the semantic level of legal documents; a shared understanding of the domain of knowledge becomes necessary.

Although the quest for the representation of legal concepts is not new, these efforts have recently been driven by the success of the World Wide Web (WWW) and, especially, by the later development of the Semantic Web. Sir Tim Berners-Lee described it as an extension of the Web “in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”

FRBRoo screenshot

Thus, the Semantic Web (including Linked Data efforts or the Web of Data) is envisaged as an extension of the current Web, which now also comprises collaborative tools and social networks (the Social Web or Web 2.0). The Semantic Web is sometimes also referred to as Web 3.0, although there is no widespread agreement on this matter, as different visions exist regarding the enhancement and evolution of the current Web.

From Web 2.0 to Web 3.0

Towards that shift, new languages and tools (ontologies) were needed to allow semantics to be added to the current Web, as the development of the Semantic Web is based on the formal representation of meaning in order to share with computers the flexibility, intuition, and capabilities of the conceptual structures of human natural languages. In the subfield of computer science and information science known as Knowledge Representation, the term “ontology” refers to a consensual and reusable vocabulary of identified concepts and their relationships regarding some phenomena of the world, which is made explicit in a machine-readable language. Ontologies may be regarded as advanced taxonomical structures, where concepts formalized as classes (e.g., “Actor”) are defined with axioms, enriched with the description of attributes or constraints (for example, “cardinality”), and linked to other classes through properties (e.g., “possesses” or “is_possessed_by”).
FRBRoo

The task of developing interoperable technologies (ontology languages, guidelines, software, and tools) has been taken up by the World Wide Web Consortium (W3C). These technologies were arranged in the Semantic Web Stack according to increasing levels of complexity (like a layer cake), in the sense that higher layers depend on lower layers (and the latter are inherited from the original Web). The languages include XML (eXtensible Markup Language), a superset of HTML usually used to add structure to documents, and the so-called ontology languages: RDF (Resource Description Framework), OWL, and Semantic Web StackOWL2 (Ontology Web Language). Recently, a specification to support the conversion of existing thesauri, taxonomies or subject headings into RDF has been released (the the SKOS, Simple Knowledge Organization System standard).

Although there are different views in the literature regarding the scope of the definition or main characteristics of ontologies, the use of ontologies is seen as the key to implementing semantics for human-machine communication. Many ontologies have been built for different purposes and knowledge domains, for example:

Although most domains are of interest for ontology modeling, the legal domain offers a perfect area for conceptual modeling and knowledge representation to be used in different types of intelligent applications and legal reasoning systems, not only due to its complexity as a knowledge intensive domain, but also because of the large amount of data that it generates. The use of semantically-enabled technologies for legal knowledge management could provide legal professionals and citizens with better access to legal information; enhance the storage, search, and retrieval of legal information; make possible advanced knowledge management systems; enable human-computer interaction; and even satisfy some hopes respecting automated reasoning and argumentation.

Regarding the incorporation of legal knowledge into the Web or into IT applications, or the more complex realization of the Legal Semantic Web, several directions have been taken, such as the development of XML standards for legal documentation and drafting (including Akoma Ntoso, LexML, CEN Metalex, and Norme in Rete), and the construction of legal ontologies.

Ontologizing legal knowledge

During the last decade, research on the use of legal ontologies as a technique to represent legal knowledge has increased and, as a consequence, a very interesting debate about their capacity to represent legal concepts and their relation to the different existing legal theories has arisen. It has even been suggested that ontologies could be the “missing link” between legal theory and Artificial Intelligence.

The literature suggests that legal ontologies may be distinguished by the levels of abstraction of the ideas they represent, the key distinction being between core and domain levels. Legal core ontologies model general concepts which are believed to be central for the understanding of law and may be used in all legal domains. In the past, ontologies of this type were mainly built upon insights provided by legal theory and largely influenced by normativism and legal positivism, especially by the works of Hart and Kelsen. Thus, initial legal ontology development efforts in Europe were influenced by hopes and trends in research on legal expert systems based on syllogistic approaches to legal interpretation.

More recent contributions at that level include the LRI-Core Ontology, the DOLCE+CLO (Core Legal Ontology), and the Ontology of Fundamental Legal ConceptsBlue Scene (the basis for the LKIF-Core Ontology). Such ontologies usually include references to the concepts of Norm, Legal Act, and Legal Person, and may contain the formalization of deontic operators (e.g., Prohibition, Obligation, and Permission).

Domain ontologies, on the other hand, are directed towards the representation of conceptual knowledge regarding specific areas of the law or domains of practice, and are built with particular applications in mind, especially those that enable communication (shared vocabularies), or enhance indexing, search, and retrieval of legal information. Currently, most legal ontologies being developed are domain-specific ontologies, and some areas of legal knowledge have been heavily targeted, notably the representation of intellectual property rights respecting digital rights management (IPROnto Ontology, the Copyright Ontology, the Ontology of Licences, and the ALIS IP Ontology), and consumer-related legal issues (the Customer Complaint Ontology (or CContology), and the Consumer Protection Ontology). Many other well-documented ontologies have also been developed for purposes of the detection of financial fraud and other crimes; the representation of alternative dispute resolution methods, cases, judicial proceedings, and argumentation frameworks; and the multilingual retrieval of European law, among others. (See, for example, the proceedings of the JURIX and ICAIL conferences for further references.)

A socio-legal approach to legal ontology development

Thus, there are many approaches to the development of legal ontologies. Nevertheless, in the current legal ontology literature there are few explicit accounts or insights into the methods researchers use to elicit legal knowledge, and the accounts that are available reflect a lack of consensus as to the most appropriate methodology. For example, some accounts focus solely on the use of legal text mining and statistical analysis, in which ontologies are built by means of machine learning from legal texts; while others concentrate on the analysis of legal theories and related materials. Moreover, legal ontology researchers disagree about the role that legal experts should play in ontology validation.

Orange SceneIn this regard, at the Institute of Law and Technology, we are developing a socio-legal approach to the construction of legal conceptual models. This approach stems from our collaboration with firms, government agencies, and nonprofit organizations (and their experts, clients, and other users) for the gathering of either explicit or tacit knowledge according to their needs. This empirically-based methodology may require the modeling of legal knowledge in practice (or professional legal knowledge, PLK), and the acquisition of knowledge through ethnographic and other social science research methods, together with the extraction (and merging) of concepts from a range of different sources (acts, regulations, case law, protocols, technical reports, etc.) and their validation by both legal experts and users.

For example, the Ontology of Professional Judicial Knowledge (OPJK) was developed in collaboration with the Spanish School of the Judicary to enhance search and retrieval capabilities of a Web-based frequentl- asked-question system (IURISERVICE) containing a repository of practical knowledge for Spanish judges in their first appointment. The knowledge was elicited from an ethnographic survey in Spanish First Instance Courts. On the other hand, the Neurona Ontologies, for a data protection compliance application, are based on the knowledge of legal experts and the requirements of enterprise asset management, together with the analysis of privacy and data protection regulations and technical risk management standards.

This approach tries to take into account many of the criticisms that developers of legal knowledge-based systems (LKBS) received during the 1980s and the beginning of the 1990s, including, primarily, the lack of legal knowledge or legal domain understanding of most LKBS development teams at the time. These criticisms were rooted in the widespread use of legal sources (statutes, case law, etc.) directly as the knowledge for the knowledge base, instead of including in the knowledge base the “expert” knowledge of lawyers or law-related professionals.

Further, in order to represent knowledge in practice (PLK), legal ontology engineering could benefit from the use of social science research methods for knowledge elicitation, institutional/organizational analysis (institutional ethnography), as well as close collaboration with legal practitioners, users, experts, and other stakeholders, in order to discover the relevant conceptual models that ought to be represented in the ontologies. Moreover, I understand the participation of these stakeholders in ontology evaluation and validation to be crucial to ensuring consensus about, and the usability of, a given legal ontology.

Challenges and drawbacks

Although the use of ontologies and the implementation of the Semantic Web vision may offer great advantages to information and knowledge management, there are great challenges and problems to be overcome.

First, the problems related to knowledge acquisition techniques and bottlenecks in software engineering are inherent in ontology engineering, and ontology development is quite a time-consuming and complex task. Second, as ontologies are directed mainly towards enabling some communication on the basis of shared conceptualizations, how are we to determine the sharedness of a concept? And how are context-dependencies or (cultural) diversities to be represented? Furthermore, how can we evaluate the content of ontologies?

Collaborative Current research is focused on overcoming these problems through the establishment of gold standards in concept extraction and ontology learning from texts, and the idea of collaborative development of legal ontologies, although these techniques might be unsuitable for the development of certain types of ontologies. Also, evaluation (validation, verification, and assessment) and quality measurement of ontologies are currently an important topic of research, especially ontology assessment and comparison for reuse purposes.

Regarding ontology reuse, the general belief is that the more abstract (or core) an ontology is, the less it owes to any particular domain and, therefore, the more reusable it becomes across domains and applications. This generates a usability-reusability trade-off that is often difficult to resolve.

Finally, once created, how are these ontologies to evolve? How are ontologies to be maintained and new concepts added to them?

Over and above these issues, in the legal domain there are taking place more particularized discussions:  for example, the discussion of the advantages and drawbacks of adopting an empirically based perspective (bottom-up), and the complexity of establishing clear connections with legal dogmatics or general legal theory approaches (top-down). To what extent are these two different perspectives on legal ontology development incompatible? How might they complement each other? What is their relationship with text-based approaches to legal ontology modeling?

I would suggest that empirically based, socio-legal methods of ontology construction constitute a bottom-up approach that enhances the usability of ontologies, while the general legal theory-based approach to ontology engineering fosters the reusability of ontologies across multiple domains.

The scholarly discussion of legal ontology development also embraces more fundamental issues, among them the capabilities of ontology languages for the representation of legal concepts, the possibilities of incorporating a legal flavor into OWL, and the implications of combining ontology languages with the formalization of rules.

Finally, the potential value to legal ontology of other approaches, areas of expertise, and domains of knowledge construction ought to be explored, for example: pragmatics and sociology of law methodologies, experiences in biomedical ontology engineering, formal ontology approaches, salamander.jpgand the relationships between legal ontology and legal epistemology, legal knowledge and common sense or world knowledge, expert and layperson’s knowledge, and legal dogmatics and political science (e.g., in e-Government ontologies).

As you may see, the challenges faced by legal ontology engineering are great, and the limitations of legal ontologies are substantial. Nevertheless, the potential of legal ontologies is immense. I believe that law-related professionals and legal experts have a central role to play in the successful development of legal ontologies and legal semantic applications.

[Editor’s Note: For many of us, the technical aspects of ontologies and the Semantic Web are unfamiliar. Yet these technologies are increasingly being incorporated into the legal information systems that we use everyday, so it’s in our interest to learn more about them. For those of us who would like a user-friendly introduction to ontologies and the Semantic Web, here are some suggestions:

Dr. Núria Casellas Dr. Núria Casellas is a researcher at the Institute of Law and Technology and an assistant professor at the UAB Law School. She has participated in several national and European-funded research projects regarding the acquisition of knowledge in judicial settings (IURISERVICE), improving access to multimedia judicial content (E-Sentencias), on Drafting Legislation with Ontology-Based Support (DALOS), or in the Legal Case Study of the Semantically Enabled Knowledge Technologies (SEKT VI Framework project), among others. Her lines of investigation include: legal knowledge representation, legal ontologies, artificial intelligence and law, legal semantic web, law and technology, and bioethics.
She holds a Law Degree from the Universitat Autònoma de Barcelona, a Master’s Degree in Health Care Ethics and Law from the University of Manchester, and a PhD in Public Law and Legal Philosophy (UAB). Her PhD thesis is entitled “Modelling Legal Knowledge through Ontologies. OPJK: the Ontology of Professional Judicial Knowledge”.

VoxPopuLII is edited by Judith Pratt. Editor in Chief is Rob Richards.

borgestotallibrary.jpgIn an extraordinary story, Jorge Luis Borges writes of a “Total Library”, organized into ‘hexagons’ that supposedly contained all books:

When it was proclaimed that the Library contained all books, the first impression was one of extravagant happiness. All men felt themselves to be the masters of an intact and secret treasure. . . . At that time a great deal was said about the Vindications: books of apology and prophecy which . . . [contained] prodigious arcana for [the] future. Thousands of the greedy abandoned their sweet native hexagons and rushed up the stairways, urged on by the vain intention of finding their Vindication. These pilgrims disputed in the narrow corridors . . . strangled each other on the divine stairways . . . . Others went mad. . . . The Vindications exist . . . but the searchers did not remember that the possibility of a man’s finding his Vindication, or some treacherous variation thereof, can be computed as zero.  As was natural, this inordinate hope was followed by an excessive depression. The certitude that some shelf in some hexagon held precious books and that these precious books were inaccessible, seemed almost intolerable.

About three years ago I spent almost an entire sleepless month coding OpenJudis – my rather cool, “first-of-its-kind” free online database of Indian Supreme Court cases. The database hosts the full texts of about 25,000 cases decided since 1950. In this post I embark on a somewhat personal reflection on the process of creating OpenJudis – what I learnt about access to law (in India), and about “legal informatics,” along with some meditations on future pathways.

Having, by now, attended my share of FLOSS events, I know it is the invariable tendency of anyone who’s written two lines of free code to consider themselves qualified to pronounce on lofty themes – the nature of freedom and liberty, the commodity, scarcity, etc. With OpenJudis, likewise, I feel like I’ve acquired the necessary license to inflict my theory of the world on hapless readers – such as those at VoxPopuLII!

I begin this post by describing the circumstances under which I began coding OpenJudis. This is followed by some of my reflections on how “legal informatics” relates to and could relate to law.

Online Access to Law in India
India is privileged to have quite a robust ICT architecture. Internet access is relatively India Cyber Cafeinexpensive, and the ubiquity of “cyber cafes” has resulted in extensive Internet penetration, even in the absence of individual subscriptions.

Government bodies at all levels are statutorily obliged to publish, on the Internet, vital information regarding their structure and functioning. The National Informatics Centre (NIC), a public sector corporation, is responsible for hosting, maintaining and updating the websites of government bodies across the country. These include, inter alia, the websites of the Union (federal) Government, the various state governments, union and state ministries, constitutional bodies such as the Election Commission and the Planning Commission, and regulatory bodies such as the Securities Exchange Board of India (SEBI). These websites typically host a wealth of useful information including, illustratively, the full texts of applicable legislations, subordinate legislations, administrative rulings, reports, census data, application forms etc.

The NIC has also been commissioned by the judiciary to develop websites for courts at various levels and publish decisions online. As a result, beginning in around the year 2000, the Supreme Court and various high courts have been publishing their decisions on their websites. The full texts of all Supreme Court decisions rendered since 1950 have been made available, which is an invaluable free resource for the public. Most High Court websites however, have not yet made archival material available online, so at present, access remains limited to decisions from the year 2000 onwards. More recently the NIC has begun setting up websites for subordinate courts, although this process is still at a very embryonic stage.

Apart from free government websites, a handful of commercial enterprises have been providing online access to legal materials. Among them, two deserve special mention. SCCOnline – a product of one of the leading law report publishers in India – provides access to the full texts of decisions of the Indian Supreme Court. The CD version of SCCOnline sells for about INR 70,000 (about US$1,500), which is around the same price the company charges for a full set of print volumes of its reporter. For an additional charge, the company offers updates to the database. The other major commercial venture in the field is Manupatra, which offers access to the full text of decisions of various courts and tribunals as well as the texts of legislation. Access is provided for a basic charge of about US$100, plus a charge of about US$1 per document downloaded. While seemingly modest by international standards, these charges are unaffordable by large sections of the legal profession and the lay public.

OpenJudis
In December 2006, I began coding OpenJudis. My reasons were purely selfish. While the full texts of the decisions of the Supreme Court were already available online for free, the search engine on the government website was unreliable and inadequate for (my) advanced research needs. The formatting of the text of cases themselves was untidy, and it was cumbersome to extract passages from them. Frequently, the website appeared overloaded with users, and alternate free sources were unavailable. I couldn’t afford any of the commercial databases. My own private dissatisfaction with the quality of service, coupled with (in retrospect) my completely naive optimism, led me to attempt OpenJudis. A third crucial factor on the input side was time, and a “room of my own,” which I could afford only because of a generous fellowship I had from the Open Society Institute.

I began rashly, by serially downloading the full texts of the 25,000 decisions on the India’s Supreme CourtSupreme Court website. Once that was done (it took about a week), I really had no notion of how to proceed. I remember being quite exhilarated by the sheer fact of being in possession of twenty five thousand Supreme Court decisions. I don’t think I can articulate the feeling very well. (I have some hope, however, that readers of this blog and my fellow LII-ers will intuitively understand this feeling.) Here I was, an average Joe poking around on the Internet, and just-like-that I now had an archive of 25,000 key documents of our republic,  cumulatively representing the articulations of some of the finest (and  some not-so-fine) legal minds of the previous half-century,  sitting on my laptop. And I could do anything with them.

The word “archive,” incidentally, as Derrida informs us, derives from the Greek arkheion, the residence of the superior magistrates, the archons – those who commanded. The archons both “held and signified political power,” and were considered to possess the right to both “make and represent the law.” “Entrusted to such archons, these documents in effect speak the law: they recall the law and call on or impose the law”. Surely, or I am much mistaken, a very significant transformation has occurred when ordinary citizens become capable of housing Return of the Archonsarchives – when citizens can assume the role of archons at will.

Giddy with power, I had an immediate impulse to find a way to transmit this feeling, to make it portable, to dissipate it – an impulse that will forever mystify economists wedded to “rational” incentive-based models of human behavior. I wasn’t a computer engineer, I didn’t have the foggiest idea how I’d go about it, but I was somehow going to host my own online free database of Indian Supreme Court cases. The audacity of this optimism bears out one of Yochai Benkler‘s insights about the changes wrought by the new “networked information economy” we inhabit. According to Benkler,

The belief that it is possible to make something valuable happen in the world, and the practice of actually acting on that belief, represent a qualitative improvement in the condition of individual freedom [because of NIE]. They mark the emergence of new practices of self-directed agency as a lived experience, going beyond mere formal permissibility and theoretical possibility.

Without my intending it, the archive itself suggested my next task. I had to clean up the text and extract metadata. This process occupied me for the longest time during the development of OpenJudis. I was very new to programming and had only just discovered the joys of Regular Expressions. More than my inexperience with programming techniques, however, it was the utter heterogeneity of reporting styles that took me a while to accustom myself to. Both opinion-writing and reporting styles had changed dramatically in the course of the fifty years my database covered, and this made it difficult to find patterns when extracting, say, the names of judges involved. Eventually, I had cleaned up the texts of the decisions and extracted an impressive (I thought) set of metadata, including the names of parties, the names of the judges, and the date the case was decided. To compensate for the absence of headnotes, I extracted names of statutes cited in the cases as a rough indicator of what their case might relate to. I did all this programming in PHP with the data housed in a MySQL database.

And then I encountered my first major roadblock that threatened to jeopardize the wholePunching Computer operation: I ran my first full-text Boolean search on the MySQL database and the results took a staggering 20 minutes to display. I was devastated! More elaborate searches took longer. Clearly, this was not a model I could host online. Or do anything useful with. Nobody in their right mind would want to wait 20 minutes for the results of their search. I had to look for a quicker database, or, as I eventually discovered, a super fast, lightweight indexing search engine. After a number of failed attempts with numerous free search engine software programs, none of which offered either the desired speed or the search capability I wanted, I was getting quite desperate. Fortunately, I discovered Swish-e, a lightweight, Perl-based Boolean search engine which was extremely fast and, most importantly, free – exactly what I needed. The final stage of creating the interface, uploading the database, and activating the search engine happened very quickly, and sometime in the early hours of December 22nd, 2006, OpenJudis went live. I sent announcement emails out to several e-groups and waited for the millions to show up at my doorstep.

They never did. After a week, I had maybe a hundred users. In a month, a few hundred. I received some very complimentary emails, which was nice, but it didn’t compensate for the failure of “millions” to show up. Over the next year, I added some improvements:
1) First, I built an automatic update feature that would periodically check the Supreme Court website for new cases and update the database on its own.
2) In October 2007, I coded a standalone MS Windows application of the database that could be installed on any system running Windows XP. This made sense in a country where PC penetration is higher than Internet penetration. The Windows application became quite popular and I received numerous requests for CDs from different corners of the country.
3) Around the same time, I also coded a similar application for decisions of the Central Information Commission – the apex statutory tribunal for adjudicating disputes under the Right to Information Act.
4) In February 2008, both applications were included in the DVD of Digit Magazine – a popular IT magazine in India.

Unfortunately, in August 2008, the Supreme Court website changed its design so that decisions could no longer be downloaded serially in the manner I had been accustomed to. One can only speculate about what prompted this change – since no improvements were made to the actual presentation of the cases. The only thing that changed was that one could no longer download cases serially as I’d been doing. The new format was far more difficult for me to “hack” and I abandoned the attempt. My work left me with no time to attempt to circumvent the new format.

Fortunately at the same time, an exciting new project called IndianKanoon was started by Sushant Sinha, an Indian computer science graduate at Michigan. In addition to decisions of the Supreme Court, his site covers several high courts and links up to the text of legislation of various kinds. Although I have not abandoned plans to develop OpenJudis, the presence of IndianKanoon has allowed me to step back entirely from this domain – secure in the knowledge that it is being taken forward by abler hands than mine.

Predictions, Observations, Conclusions
I’d like to end this already-too-long post with some reflections, randomly ordered, about legal information online.
1) I think one crucial area commonly neglected by most LIIs is client-side software that enables users to store local copies of entire databases. The urgency of this need is highlighted in the following hypothetical about digital libraries by Siva Vaidhyanathan (from The Anarchist in the Library):

So imagine this: An electronic journal is streamed into a library. A library Anarchist in Librarynever has it on its shelf, never owns a paper copy, can’t archive it for posterity. Its patrons can access the material and maybe print it, maybe not. But if the subscription runs out, if the library loses funding and has to cancel that subscription, or if the company itself goes out of business, all the material is gone. The library has no trace of what it bought: no record, no archive. It’s lost entirely.

It may be true that the Internet will be around for some time, but it might be worthwhile for LIIs to stop emulating the commercial database models of restricting control while enabling access. Only then can we begin to take seriously the task of empowering users into archons.

2) My second observation pertains to interface and usability. I have for long been planning to incorporate a set of features including tagging, highlighting, annotating, and bookmarking that I myself would most like to use. Additionally, I have been musing about using Web 2.0 to enable user-participation in maintenance and value-add operations – allowing users to proofread the text of judgments and to compose headnotes. At its most ambitious, in these “visions” of mine, OpenJudis looks like a combination of LII + social networking + Wikipedia.

A common objection to this model is that it would upset the authority of legal texts. In his brilliant essay A Brief History of the Internet from the 15th to the 18th century, the philosopher Lawrence Liang reminds us that the authority of knowledge that we today ascribe to printed text was contested for the longest period in modern history.

Far from ensuring fixity or authority, this early history of Printing was marked by uncertainty, and the constant refrain for a long time was that you could not rely on the book; a French scholar Adrien Baillet warned in 1685 that “the multitude of books which grows every day” would cast Europe into “a state as barbarous as that of the centuries that followed the fall of the Roman Empire.”

Europe’s non-descent into barbarism offers us a degree of comfort in dealing with Adrien Baillet-type arguments made in the context of legal information. The stability that we ascribe to law reports today is a relatively recent historical innovation that began in the mid-19th century. “Modern” law has longer roots than that.

3) While OpenJudis may look like quite a mammoth endeavor for one person, I was at all times intensely aware that this was by no means a solitary undertaking, and that I was “standing on the shoulders of giants.” They included the nameless thousands at the NIC who continue to design websites, scan and upload cases on the court websites – a Sisyphian task – and  the thousands whose labor collectively produced the free software I used : Fedora Core 4, PHP, MySQL, Swish-E. And lastly, the nameless millions who toil to make the physical infrastructure of the Internet itself possible. Like the ground beneath our feet, we take it for granted, even as the tragic recent events in Haiti in recent weeks remind us to be more attentive. (For a truly Herculean endeavor, however, see Sushant Sinha’s IndianKanoon website, about which many ballads may be composed in the decades to come.)

It might be worthwhile for the custodians of LIIs to enable users to become derivative producers themselves, to engage in “practices of self-directed agency” as Benkler suggests. Without sounding immodest, I think the real story of OpenJudis is how the Internet makes it plausible and thinkable for average Joes like me (and better-than-average people like Sushant Sinha) to think of waging unilateral wars against publishing empires.

4) So, what is the impact that all this ubiquitous, instant, free electronic access to legal information is likely to have on the world of law? In a series of lectures titled “Archive Fever,” the philosopher Derrida posed a similar question in a somewhat different context: What would the discipline of psychoanalysis have looked like, he asked, if Sigmund Freud and his contemporaries had had access to computers, televisions, and email? In brief, his answer was that the discipline of psychoanalysis itself would not have been the same – it would have been transformed “from the bottom up” and its very events would have been altered. This is because, in Derrida’s view:

The archive . . . in general is not only the place for stocking and for conserving an archivable content of the past. . . .  No, the technical structure of the archiving archive also determines the structure of the archivable content even in its coming into existence and in its relationship to the future. The archivization produces as much as it records the event.

The implication, following Derrida, is that in the past, law would not have been what itDerrida currently is if electronic archives had been possible. And the obverse is true as well:  in the future, because of the Internet, “rule of law” will no longer observe the logic of the stable trajectories suggested by its classical “analog” commentators. New trajectories will have to be charted.

5) In the same book, Derrida describes a condition he calls “Archive fever”:

It is to burn with a passion. It is never to rest, interminably, from searching for the archive right where it slips away. It is to run after the archive even if there’s too much of it. It is to have a compulsive, repetitive and nostalgic desire for the archive, an irrepressible desire to return to the origin, a homesickness, a nostalgia for the return to the most archaic place of absolute commencement.

I don’t know about other readers of VoxPopulII (if indeed you’ve managed to continue reading this far!), but for the longest time during and after OpenJudis, I suffered distinctively from this malady. I downloaded indiscriminately whole sets of data that still sit unused on my computer, not having made it into OpenJudis. For those in a similar predicament, I offer Borges’s quote with which I began this text, as a reminder of the foolishness of the notion of “Total Libraries.”

Prashant IyengarPrashant Iyengar is a lawyer affiliated with the Alternative Law Forum, Bangalore, India. He is currently pursuing his graduate studies at Columbia University in New York. He runs OpenJudis, a free database of Indian Supreme Court cases.

VoxPopuLII is edited by Judith Pratt. Editor in Chief is Rob Richards.

Kangaroo BoxingIt’s been a rocky year for West’s relationship with law librarians.

First, the company declined to participate in this year’s American Association of Law Libraries Price Index for Legal Publications. This led AALL to return West’s sponsorship check for the 2009 AALL Annual Meeting. For attendees, this decision was somewhat academic, as West still occupied a large space in the Exhibitor Hall and once again hosted a well-attended Customer Appreciation Party.

Shortly after the conference, West issued an email promotion to customers that asked:

Are you on a first name basis with the librarian? If so, chances are, you’re spending too much time at the library. What you need is fast, reliable research you can access right in your office.

Many law librarians felt publicly insulted by West, expressing their outrage on listservs, blogs, Twitter, Facebook and anywhere legal information professionals could be found that week.

Most recently, West released a video of University of California, Berkeley professor and law librarian Bob Berring explaining the advantages of “free market” premium legal databases over free legal information websites run by “volunteers:”

It’s not like legal information is going to the Safeway or to buy food. You’re not buying a packaged thing. If you say I need to find statutes about this, or what’s the administrative regulations on that, or have the courts spoken about this, you have to go find it. And just saying it’s all out there — I mean, the ocean is all out there, but you need a map, and you need a compass, and… you need a GPS system now. You need someone to tell you how to get there. That’s why librarians are even more important now, because they’ve got the GPS system. But you have to be working with organized information. The value added by folks like West, where the information is edited as it goes in, and it’s classified, and the hooks are put in — easy hooks for the people who I think are sloppy researchers just playing around on the tops, really sophisticated hooks for the people who take the time to learn how to really use the system and understand it. You just can’t say enough about those kind of things, because to say to the average person, “Well, it’s all out there, the law is all out there,” well, it’s a big bunch of goo.

Adding value to the goo

Unfortunately, the West/Lexis duopoly doesn’t provide consumers with the expected advantages of a free market economy. Neither vendor uses price as a marketing strategy, and both negotiate electronic database contracts with customers rather than charge a flat rate. Considering that West has increased its own annual profit margin to 30% or higher in recent years, while raising the cost of supplements at a rate far exceeding inflation, prices are hardly being driven by free market trends, making a price war seem unlikely. (This doesn’t mean consumers aren’t hopping mad about the price of legal information. They are.)

Instead, at least in the database market, both companies rely on content and features to market their products. Each July at the AALL Annual Meeting, both Lexis and Westlaw use their exhibitor space to educate attendees about whatever new databases and customer conveniences will be rolled out in the coming months.

Thomas Edison and carI often compare these annual feature introductions to the evolution of automobile engines, thanks to a childhood spent watching my father work on the family cars. At first Dad knew every nook and cranny of our vehicles, and there was little he couldn’t repair himself over the course of a few nights. As we traded in cars for newer models, his job became more difficult as engines became more complex. None of the automakers seemed to consider ease of access when adding new parts to an automobile engine. They were simply slapped on top of the existing ones, making it harder to perform simple tasks, like replacing belts or spark plugs.

Lexis and Westlaw also add new components on top of the old ones. To generalize, Lexis tends to add new features in the form of tabs (think “Total Litigator”) while Westlaw adds them in sidebars (think “Results Plus”), to the point where once clean interfaces are now littered with disparate elements sharing adjacent screen real estate.

Finding fault with filters

In a talk at last year’s Web 2.0 Expo in New York, author Clay Shirky stated that the fundamental information problem is not “information overload,” but “filter failure.” Shirky summarized this position in a recent interview with Yale Law School’s Jason Eiseman:

As I’ve often said, there’s no such thing as information overload. It’s filter failure, right? From the minute we had more books to read than the average literate person could read in a lifetime, which depending on the region you’re talking about happened someplace between the 16th and 19th century, from that moment on we’ve always had information overload. That’s the modern condition. What’s happening, I think, to our sense that we’re suffering acutely from information overload now is that the old professional filters have broken. They’re simply not adequate to contain a world in which anyone can put material out in the public.

Whether or not you agree with Shirky’s assessment, it provides an interesting framework with which to view the Lexis/Westlaw information problem. If the primary legal information within these systems are “a big bunch of goo,” then secondary resources, headnotes, subject-specific organization, and other finding aids are the filters necessary to cope with information overload.

For West’s “Are you on a first name basis with the librarian?” promotion to work, Westlaw has to provide the “fast, reliable research you can access right in your office” that it advertises. Assuming for purposes of this essay that the presence of relevant content isn’t an issue (an assumption with which many will quibble), this means the system’s filters need to provide reliable information quickly.

There’s no question that both West and Lexis provide an abundance of subject-specific organization, particularly for case law. Headnotes, topics, digests, tables of authority, citators and cross-references to secondary resources all go above and beyond what researchers find in most freely available resources. But these add-ons, or filters, are only effective if presented in a usable manner.

Bridge CollapseFor an assignment in one of my legal research classes this semester, I provided a fact pattern and asked students to perform a Natural Language search in Westlaw of American Law Reports to find a relevant annotation. In a class of only 19 students, six of them answered with citations to resources other than ALR, including articles from American Jurisprudence, Am.Jur. Proof of Facts, and Shepards’ Causes of Action. The problem, it turned out, wasn’t that they had searched the wrong database. Every one of them searched ALR correctly, but those six students mistook Westlaw’s Results Plus, placed at the top of a sidebar on the results page, for their actual search results. Filter failure, indeed.

On another assignment, students were expected to find a particular statutory code section using a secondary resource, view the code section, then navigate to the code’s table of contents to browse related sections codified nearby. This proved nearly impossible for most of them, as the code section they accessed loaded in a pop-up window with no sidebar, thus providing no visible link to the table of contents. The problems didn’t stop there. Even once I told them to click the “Maximize” button at the bottom of the pop-up window, which reloads the code section into the main window with a sidebar, upon clicking the TOC link, anyone using Firefox for Windows loaded a blank page. (To resolve this error, you have to right-click on the frame where the TOC should’ve loaded and select “This Frame -> Reload This Frame.”)

While completing another portion of the statutory code assignment in Lexis, nearly half the students in the class became confused because numerous clickable links throughout the system display as plain black text which only appear as links when the user hovers over them. Also, within statutory code sections, the navigation links provided within the case annotation index routinely loaded an error page rather than navigating to the proper section further down the page.

This doesn’t even address basic usability issues such as broken back button functionality, heavy usage of frames, lack of permanent document URLs (Lexis and Westlaw each have external workarounds for this), and reliance on pop-up windows (something blocked by default on most browsers). In addition, Lexis still doesn’t support users accessing the system with Firefox for Mac.

The wide availability of secondary resources, annotated codes, and numerous other value-added content provides a clear advantage for Lexis and Westlaw over free and mid-level legal information services, and that’s why everyone continues to pay their steep prices. But so long as the systems themselves don’t provide usable access, each still suffers from filter failure.

Is there an incentive to improve?

VAB Under ConstructionThere is evidence that the companies have the expertise to provide a better user experience. West has two electronic versions (one for desktop computers and one for the iPhone) of Black’s Law Dictionary available that offer more intuitive functionality than what’s provided for the same text in Westlaw. Don’t expect a price break, however. The desktop version of Black’s has a list price of $99, while the iPhone version costs $49.99. By comparison, the print version of Black’s Standard Ninth Edition, which likely has substantially higher production costs than the electronic equivalents, carries a list price of $75, meaning iPhone users receive a slightly lower price while desktop users pay even more. Worse still, both electronic versions as well as the content in Westlaw contain the text of the outdated 8th Edition.

Lexis also has an iPhone app, and it’s a free download that requires an existing Lexis password to function. Substantially simplified from its traditional web interface, the user experience is clean and easy to understand. Yet while one can retrieve both primary and secondary documents, as well as Shepardize documents, none of the documents in this interface contain links, only plain citations that must be copied and pasted into the search form to be retrieved.

Of course, the bigger problem with these progressive moves is that they don’t address any of the existing problems in the web interfaces for either product. No one is redesigning the engine, so to speak. These are simply variations of the now traditional roll-out of new features and functionality on top of existing ones that still have the same significant issues.

This is the problem with a duopoly. There aren’t enough producers in the economy to assert significant pressure on either to improve usability. Consumer power is also limited because multi-year contracts prevent easy product substitution, and there’s only one true product substitute available. The producers dictate the competition, and thus far they have dictated a content competition (“The Tabs and Sidebars War”), rather than a usability one — or even a price one.

There are events on the horizon that could impact this stalemate. Bloomberg continues to develop its own legal research product, allegedly designed to be a Westlaw/Lexis competitor. Perhaps this third producer will see value in using price or usability to gain market share. Lewis & Clark law student (and VoxPopuLII author) Robb Shecter recently introduced OregonLaws.org, a free repository of Oregon law that currently features the entire Oregon Revised Statutes and a legal glossary. The site’s simple, logical navigation reflects current web usability norms more accurately than either Lexis or Westlaw, and for a “micro-fee” users can bookmark code sections for quick access and save unlimited “human readable” research trails. And, of course, Google Scholar just added “Legal opinions and journals.” It’s far too early to know if it will become a true player in legal information, but Google always has the potential to be a game changer with anything it does.

What can legal research instructors DO?

Despite the presence of these interesting new projects, consumers can’t expect a quick usability turnaround from Lexis and Westlaw, nor the sudden presence of a competitor with the same depth and breadth of content. History doesn’t support such an expectation, leaving legal research instructors in a precarious position.

Many schools leave Lexis/Westlaw training solely in the hands of the companies’ representatives. While a company rep will be knowledgeable about the system, he will also paint the product in the best possible light for the company, glossing over usability issues and emphasizing new features. After all, law students are future customers, so this instruction is part of a long-term sales pitch.

In order to provide a balanced picture of these systems, legal research instructors need to provide their own Lexis and Westlaw training. This can either be in place of or in addition to what’s provided by company reps, but students need to hear the voice of an experienced researcher who doesn’t rely on either company for a paycheck. Some may see this as an implied institutional endorsement of the high-priced systems, but the reality is most students will end up working with one or both of these systems on a daily basis after graduation. Ignoring this would be an educational disservice. Any sense of endorsement can be addressed through thorough coverage of the usability limitations and a short education on the price realities. Instructors can also discuss the availability of lower priced databases for lawyers who simply want access to primary legal materials.

If the market is going to change, it won’t be because Lexis and Westlaw spontaneously decide to improve products that generate significant profits already. Until then, legal researchers need to be better educated on the limitations of these systems so that their work product isn’t compromised by over-reliance on a duopoly disguised as a free market.

Tom BooneTom Boone is a reference librarian and adjunct professor at Loyola Law School in Los Angeles. He’s also webmaster and a contributing editor for Henderson Valley Eggs, a “themed information collective” website covering law library issues.

VoxPopuLII is edited by Judith Pratt

puzzle

To take the words of Walt Whitman, when it comes to improving legal information retrieval (IR), lawyers, legal librarians and informaticians are all to some extent, “Both in and out of the game, and watching and wondering at it“. The reason is that each group holds only a piece of the solution to the puzzle, and as pointed out in an earlier post, they’re talking past each other.

In addition, there appears to be a conservative contingent in each group who actively hinder the kind of cooperation that could generate real progress: lawyers do not always take up technical advances when they are made available, thus discouraging further research, legal librarians cling to indexing when all modern search technologies use free-text search, and informaticians are frequently impatient with, and misunderstand, the needs of legal professionals.

What’s holding progress back?

At root, what has held legal IR back may be the lack of cross-training of professionals in law and informatics, although I’m impressed with the open-mindedness I observe at law and artificial intelligence conferences, and there are some who are breaking out of their comfort zone and neat disciplinary boundaries to address issues in legal informatics.

I recently came back from a visit to the National Institute of Informatics in Japan where I met Ken Satoh, a logician who, late in his professional career, has just graduated from law school. This is not just hard work. I believe it takes a great deal of character for a seasoned academic to maintain students’ respect when they are his seniors in a secondary degree. But the end result is worth it: a lab with an exemplary balance of lawyers and computer scientists, experience and enthusiasm, pulling side-by-side.

Still, I occasionally get the feeling we’re all hoping for some sort of miracle to deliver us from the current predicament posed by the explosion of legal information. Legal professionals hope to be saved by technical wizardry, and informaticians like myself are waiting for data provision, methodical legal feedback on system requirements and performance, and in some cases research funding. In other words, we all want someone other than ourselves to get the ball rolling.

Miracle Occurs

The need to evaluate

Take for example, the lack of large corpora for study, which is one of the biggest stumbling blocks in informatics. Both IR and natural language processing (NLP) currently thrive on experimentation with vast amounts of data, which is used in statistical processing. More data means better statistical estimates and the fewer `guesses’ at relevant probabilities. Even commercial legal case retrieval systems, which give the appearance of being Boolean, use statistics and have done so for around 15 years. (They are based on inference networks that simulate Boolean retrieval with weighted indexing by reducing the rigidness associated with conditional probability estimates for Boolean operators `and’, `or’ and `not’. In this way, document ranking increasingly depends on the number of query constraints met).

The problem is that to evaluate new techniques in IR (and thus improve the field), you need not only a corpus of documents to search but also a sample of legal queries and a list of all the relevant documents in response to those queries that exist in your corpus, perhaps even with some indication of how relevant they are. This is not easy to come by. In theory a lot of case data is publicly available, but accumulating and cleaning legal text downloaded from the internet, making it amenable to search, is nothing short of tortuous. Then relevance judgments must be given by legal professionals, which is difficult given that we are talking about a community of people who charge their time by the hour.

Of course, the cooperation of commercial search providers, who own both the data and training sets with relevance judgments, would make everyone’s life much easier, but for obvious commercial reasons they keep their data to themselves.

To see the productive effects of a good data set we need only look at the research boom now occurring in e-discovery (discovery of electronic evidence, or DESI). In 2006 the TREC Legal Track, including a large evaluation corpus, was established in response to the number of trials requiring e-discovery: 75% of Fortune 500 company trials, with more than 90% of company information now stored electronically. This has generated so much interest that an annual DESI workshop has been established since 2007.

Qualitative evaluation of IR performance by legal professionals is an alternative to the quantitative evaluation usually applied in informatics. The development of new ways to visualize and browse results seems particularly well suited to this approach, where we want to know whether users perceive new interfaces to be genuine improvements. Considering the history of legal IR, qualitative evaluation may be as important as traditional IR evaluation metrics of precision and recall. (Precision is the number of relevant items retrieved out of the total number of items retrieved, and recall is the number of relevant items retrieved out of the total number of relevant items in a collection). However, it should not be the sole basis for evaluation.

A well-known study by Blair and Maron makes this point plain. The authors showed that expert legal researchers retrieve less than 20% of relevant documents when they believe they have found over 75%. In other words, even experts can be very poor judges of retrieval performance.

Context in legal retrieval

ParadigmShift

Setting this aside, where do we go from here? Dan Dabney has argued at the American Association of Law Libraries (AALL) 2005 Annual Meeting that free text search decontextualizes information, and he is right. One thing to notice about current methods in open domain IR, including vector space models, probabilistic models and language models, is that the only context they are taking into account is proximate terms (phrases). At heart, they treat all terms as independent.

However, it’s risky to conclude what was reported from the same meeting: “Using indexes improves accuracy, eliminates false positive results, and leads to completion in ways that full-text searching simply cannot.” I would be interested to know if this continues to be a general perception amongst legal librarians despite a more recent emphasis on innovating with technologies that don’t encroach upon the sacred ground of indexing. Perhaps there’s a misconception that capitalizing on full-text search methods would necessarily replace the use of index terms. This isn’t the case; inference networks used in commercial legal IR are not applied in the open domain, and one of their advantages is that they can incorporate any number of types of evidence.

Specifically, index numbers, terms, phrases, citations, topics and any other desired information are treated as representation nodes in a directed acyclic graph (the network). This graph is used to estimate the probability of a user’s information need being met given a document.

For the time being lawyers, unaware of technology under the hood, default to using inference networks in a way that is familiar, via a search interface that easily incorporates index terms and looks like a Boolean search. (Inference nets are not Boolean but they can be made to behave in the same way.) While Boolean search does tend to be more precise than other methods, the more data there is to search the less well the system performs. Further, it’s not all about precision. Recall of relevant documents is also important and this can be a weak point for Boolean retrieval. Eliminating false positives is no accolade when true positives are eliminated at the same time.

Since the current predicament is an explosion of data, arguing for indexing by contrasting it with full-text retrieval without considering how they might work together seems counterproductive.

Perhaps instead we should be looking at revamping legal reliance on a Boolean-style interface so that we can make better use of full-text search. This will be difficult. Lawyers who are charged, and charge, per search, must be able to demonstrate the value of each search to clients; they can’t afford the iterative nature of what characterizes open domain browsing. Further, if the intelligence is inside the retrieval system, rather than held by legal researchers in the form of knowledge about how to put complex queries together, how are search costs justified? Although Boolean queries are no longer well-adapted, at least value is easy to demonstrate. A push towards free-text search by either legal professionals or commercial search providers will demand a rethink of billing structures.

Given our current systems, are there incremental ways we can improve results from full-text search? Query expansion is a natural consideration and incidentally overlaps with much of the technology underlying graphical means of data exploration such as word clouds and wonderwheels; the difference is that query expansion goes on behind the scenes, whereas in graphical methods the user is allowed to control the process. Query expansion helps the user find terms they hadn’t thought of, but this doesn’t help with the decontextualization problem identified by Dabney; it simply adds more independent terms or phrases.

In order to contextualize information we can marry search using text terms and index numbers as is already applied. Even better would be to do some linguistic analysis of a query to really narrow down the situations in which we want terms to appear. In this way we might get at questions such as “What happened in a case?” or “Why did it happen?” rather than just, “What is this document about?”.

Language processing and IR

Use of linguistic information in IR isn’t a novel idea. In the 1980s, IR researchers started to think about incorporating NLP as an intrinsic part of retrieval. Many of the early approaches attempted to use syntactic information for improving term indexing or weighting. For example, Fagan improved performance by applying syntactic rules to extract similar phrases from queries and documents and then using them for direct matching, but it was held that this was comparable to a less complex, and therefore preferable, statistical approach to language analysis. In fact, Fagan’s work demonstrated early on what is now generally accepted: statistical methods that do not assume any knowledge of word meaning or linguistic role are surprisingly (some would say depressingly) hard to beat for retrieval performance.

Since then there have been a number of attempts to incorporate NLP in IR, but depending on the task involved, there can be a lack of highly accurate methods for automatic linguistic analysis of text that are also robust enough to handle unexpected and complex language constructions. (There are exceptions, for example, part-of-speech tagging is highly accurate.) The result is that improved retrieval performance is often offset by negative effects, resulting in a minimal positive, or even a negative impact on overall performance. This makes NLP techniques not worth the price of additional computational overheads in time and data storage.

However, just because the task isn’t easy doesn’t mean we should give up. Researchers, including myself, are looking afresh at ways to incorporate NLP into IR. This is being encouraged by organizations such as the NII Test Collection for IR Systems Project (NTCIR), who from 2003 to 2006 amassed excellent test and training data for patent retrieval with corpora in Japanese and English and queries in five languages. Their focus has recently shifted towards NLP tasks associated with retrieval, such as patent data mining and classification. Further, their corpora enable study of cross-language retrieval issues that become important in e-discovery since only a minority fraction of a global corporation’s electronic information will be in English.

We stand on the threshold of what will be a period of rapid innovation in legal search driven by the integration of existing knowledge bases with emerging full-text processing technologies. Let’s explore the options.

Tamsin MaxwellK. Tamsin Maxwell is a PhD candidate in informatics at the University of Edinburgh, focusing on information retrieval in law. She has a MSc in cognitive science and natural language engineering, and has given guest talks at the University of Bologna, UMass Amherst, and NAIST in Japan. Her areas of interest include text processing, machine learning, information retrieval, data mining and information extraction. Her passion outside academia is Arabic dance.

VoxPopuLII is edited by Judith Pratt.