VoxPopuLII

Context and Legal Informatics Research

Applications, Crowdsourcing the writing of secondary legal resources, information retrieval, Legal citations, Legal knowledge representation, Legal semantic web, Legal social media, Legal social networks, Legal text processing, natural language processing, Nonlawyers' use of legal information, privacy, Pro se litigants, Self represented litigants 2 Responses »

Jun 012010

[Editor’s Note: A slighly different version of this post was published on Slaw in May 2010. We thank Slaw‘s editor, Professor Simon Fodden, for granting permission to repost.]

The relationship of legal information to context is a key dimension of recent developments in legal informatics scholarship and innovation. These developments range from investigations in law and psychology to political and moral theory, from explorations in artificial intelligence and law to legal information theory, and from research on the legal Semantic Web to the creation of new applications that help nonlawyers contextualize legal information.

Professor Helen Nissenbaum has foregrounded the notion of context in the debate over privacy respecting court records. In her new book, Privacy in Context, and in her presentation at the 2010 Princeton Open Government Workshop, Nissenbaum defines the right to privacy as the right to what she calls “contextual integrity,” meaning the use or “flow” of information consistent with the norm for information transfer applicable in the particular context — such as home, school, workplace, court, etc. — where the information was first transferred. For Nissenbaum, key characteristics of an information context — the sender of the information, the intended recipient of the information, the subject of the information, the type of information that is shared, and the purposes and goals served by the information context — determine which norm governs information flow in that context. In Nissenbaum’s view, when information about individuals flows in a manner consistent with its contextual norm, “contextual integrity” — and those individuals’ privacy rights — are considered to have been preserved, but if that information flows in a manner inconsistent with that norm, contextual integrity — and thus the privacy rights of those individuals — are considered to have been breached. Nissenbaum argues that when a new technology — such as a publicly available online database of court records — arises that arguably violates contextual integrity, a presumption should arise in favor of preserving contextual integrity and rejecting the new technology. On Nissenbaum’s account, that presumption may be rebutted if the new technology can be shown more effectively to serve both general social values such as social welfare and national security, and also the particular purposes and goals of the original information context. Nissenbaum allows that the new technology might also prevail if it, or some aspect of the information available through it, could be modified to render it superior in vindicating both general and contextual values. Applying Nissenbaum’s model to court records would thus entail a careful consideration of a variety of contexts in which those records are generated, the rejection of certain new information technologies, and also negotiations with technologists to determine whether certain new technologies, or the data they process, could be modified so as to render those technologies superior to the original systems in vindicating general and contextual values.

Professor Guido Boella, Dr. Guido Governatori, and colleagues are exploring ways to model legal contexts to aid automated legal reasoning. In their recent paper these scholars show how defeasible logic can be employed to represent the policy context of legal rules. Their approach could improve computers’ capacity to assess legal compliance, and could contribute to the automation of the interpretation of legal language.

In legal information retrieval, K. Tamsin Maxwell — in a recent post, as well as in a recent conference paper co-authored by Burkhard Schafer — is exploring the use of Natural Language Processing techniques to contextualize queries, automate discovery of factually-similar cases, and achieve “near perfect search recall within the context of precision.”

Recent research in law and psychology — particularly the research highlighted by The Situationist blog published by the Project on Law and Mind Sciences at Harvard Law School — emphasizes how context affects people’s understanding and use of legal information. For example, Professor Adam Benforado‘s new paper explores how spatial situations affect the law-related behavior and thinking of various participants in criminal cases, while another of his recent articles argues that the context of the videotape evidence at issue in Scott v. Harris had a profound and unacknowledged influence on the way the U.S. Supreme Court interpreted that evidence.

In her recent post and her dissertation in progress, Christine Kirchberger explores the importance of context for making legal information usable by nonlawyers. Kirchberger highlights legal Semantic Web technology — such as that discussed in Dr. Núria Casellas’s recent post on legal ontologies — and government eportals — like Austria’s HELP service — as promising means of offering valuable context to nonlawyers using legal information. Kirchberger quotes Tom Bruce‘s 2001 paper on the need to build flexible systems that can present legal information in a range of different contexts, suited to the needs of different users of those systems.

Some free access to law services are providing this kind of contextual information by building secondary sources into their systems. For example, the Cornell Legal Information Institute‘s Wex legal encyclopedia — written collaboratively by volunteers — explains key legal concepts and terms that appear in the Institute’s primary collections. As Staffan Malmgren explains in this recent post, his lagen.nu free access to law service for Sweden includes commentaries, written by means of an innovative crowdsourcing method.

Automatic linking is another method of furnishing context to users of legal information, as hotlinked citations enable quick retrieval of full text sources that make up legal context. Free access to law services such as CanLII provide such technology for linking to primary legal sources — as Ivan Mokanov explains in this recent post. In his recent thesis, conference paper, and post, Olivier Charbonneau proposes several ideas for delivering contextual information to users of free legal systems. These include personalized user interfaces; automatic display of citing sources when a cursor is placed over a passage of a primary legal document; automatic display of relevant commentaries below or alongside a primary legal text; and user ratings of user-contributed commentary, to help nonlawyers assess the quality of content.

Dr. Floris Bex in his recent dissertation and post explains how argument- and narrative-mapping technology can provide valuable context for prosecutors conducting criminal investigations. Dr. Bex describes his and his colleagues’ research — and particularly the work of Dr. Susan van den Braak — respecting a variety of applications that provide visual displays of investigators’ legal and factual arguments and narrative accounts of alleged crimes. These tools allow investigators to contextualize each relevant fact and point of law within a conceptual framework for their case.

Two current U.S. federal court technology efforts aim to help nonlawyers put legal information in context. JERS, the Jury Evidence Recording System, enables jurors in four U.S. federal district courts to view digital representations of trial evidence and exhibits in the jury deliberation room, and navigate through the information — including via zooming and scrolling — by means of video touchscreen. This access to evidentiary information not only helps each individual juror attain a contextual understanding of the applicable law and facts of the case; it also increases the likelihood that all members of a jury will share the same understanding of the context of the case. On April 27, 2010, the Administrative Office of the U.S. Courts announced that funding for JERS had been renewed.

Whereas JERS helps jurors place legal information in context, the E Pro Se document assembly application assists self-represented litigants in contextualizing such information. A customization of the A2J Author document assembly program created for use in law school clinics by CALI, the Center for Computer Assisted Legal Instruction, and the Chicago-Kent College of Law’s Center for Access to Justice and Technology, E Pro Se conducts automated interactive interviews with pro se litigants in U.S. federal district court — for purposes of gathering contextual information from the litigants — and then processes that information to assemble pleadings and other court papers for the litigants. E Pro Se is now available online from the U.S. District Court for the Eastern District of Missouri. According to The Third Branch, the federal district court in Minnesota has begun a pilot E Pro Se project, and such a project will shortly begin in the Massachusetts federal district court. (Thanks to CALI’s Executive Director John P. Mayer for information on this topic.)

Nonlawyers who participate in policy discussions about proposed laws also need context to understand those laws. Researchers participating in the EU’s IMPACT Project are creating tools to provide this context. These tools include argument mapping applications and Semantic Web technology — described in new papers by Professor Tom van Engers and Dr. Adam Wyner — for organizing policy discussions into subject-related threads, with visual displays of the reasoning underlying the arguments that make up the discussion, translation of policy arguments into the preferred language of each user, and Web 2.0 services facilitating users’ participation in the discussions.

Context thus appears to be a focal point for legal informatics research, at the levels of theory, policy, and systems development. Research activity in this area appears to be vigorous, and embraces many disciplines in addition to law, including computer science, philosophy, political science, psychology, linguistics, sociology, anthropology, and information science. As the disintermediation of legal information professionals, the unbundling of legal services, and the participation of citizens in policy- and lawmaking proceed, the need can only grow for greater knowledge of how context affects individuals’ understanding and use of legal information, and for systems that effectively provide nonlawyers with relevant legal contextual information.

Robert Richards, JD, MSLIS, MA, is an information and communications researcher, specializing in legal information and communication systems. He is based in Philadelphia. He is editor in chief of VoxPopuLII, writes the Legal Informatics Blog, and created and maintains the online bibliography Legal Information Systems & Legal Informatics Resources. He is the founder and administrator of the Legal Informatics Research Network, an online community for those studying or developing legal information systems. His most recent writings include What Is Legal Information?, a paper delivered at the 2009 University of Colorado at Boulder Conference on Legal Information, and Cost-Effective Research in U.S. Bankruptcy Law (2009). As of September 2010 he will be a Ph.D. student in the University of Washington Department of Communication.

VoxPopuLII is edited by Judith Pratt.

Weaving the Legal Semantic Web with Natural Language Processing

Annotation of legal texts, information retrieval, Legal knowledge representation, Legal semantic web, Legal text mining, Legal text processing, Legal XML, natural language processing, Semantic annotation of legal texts, Semantic Web and law 4 Responses »

May 172010

The World Wide Web is a virtual cornucopia of legal information bearing on all manner of topics and in a spectrum of formats, much of it textual. However, to make use of this storehouse of textual information, it must be annotated and structured in such a way as to be meaningful to people and processable by computers. One of the visions of the Semantic Web has been to enrich information on the Web with annotation and structure. Yet, given that text is in a natural language (e.g., English, German, Japanese, etc.), which people can understand but machines cannot, some automated processing of the text itself is needed before further processing can be applied. In this article, we discuss one approach to legal information on the World Wide Web, the Semantic Web, and Natural Language Processing (NLP). Each of these are large, complex, and heterogeneous topics of research; in this short post, we can only hope to touch on a fragment and that heavily biased to our interests and knowledge. Other important approaches are mentioned at the end of the post. We give small working examples of legal textual input, the Semantic Web output, and how NLP can be used to process the input into the output.

Legal Information on the Web

For clients, legal professionals, and public administrators, the Web provides an unprecedented opportunity to search for, find, and reason with legal information such as case law, legislation, legal opinions, journal articles, and material relevant to discovery in a court procedure. With a search tool such as Google or indexed searches made available by Lexis-Nexis, Westlaw, or the World Legal Information Institute, the legal researcher can input key words into a search and get in return a (usually long) list of documents which contain, or are indexed by, those key words.

As useful as such searches are, they are also highly limited to the particular words or indexation provided, for the legal researcher must still manually examine the documents to find the substantive information. Moreover, current legal search mechanisms do not support more meaningful searches such as for properties or relationships, where, for example, a legal researcher searches for cases in which a company has the property of being in the role of plaintiff or where a lawyer is in the relationship of representing a client. Nor, by the same token, can searches be made with respect to more general (or more specific) concepts, such as “all cases in which a company has any role,” some particular fact pattern, legislation bearing on related topics, or decisions on topics related to a legal subject.

The underlying problem is that legal textual information is expressed in natural language. What literate people read as meaningful words and sentences appear to a computer as just strings of ones and zeros. Only by imposing some structure on the binary code is it converted to textual characters as we know them. Yet, there is no similar widespread system for converting the characters into higher levels of structure which correlate to our understanding of meaning. While a search can be made for the string plaintiff, there are no (widely available) searches for a string that represents an individual who bears the role of plaintiff. To make language on the Web more meaningful and structured, additional content must be added to the source material, which is where the Semantic Web and Natural Language Processing come into play.

Semantic Web

The Semantic Web is a complex of design principles and technologies which are intended to make information on the Web more meaningful and usable to people. We focus on only a small portion of this structure, namely the syntactic XML (eXtensible Markup Language) level, where elements are annotated so as to indicate linguistically relevant information and structure. (Click here for more on these points.) While the XML level may be construed as a ‘lower’ level in the Semantic Web “stack” — i.e., the layers of interrelated technologies that make up the Semantic Web — the XML level is nonetheless crucial to providing information to higher levels where ontologies (and click here for more on this) and logic play a role. So as to be clear about the relation between the Semantic Web and NLP, we briefly review aspects of XML by example, and furnish motivations as we go.

Suppose one looks up a case where Harris Hill is the plaintiff and Jane Smith is the attorney for Harris Hill. In a document related to this case, we would see text such as the following portions:

Harris Hill, plaintiff.
Jane Smith, attorney for the plaintiff.

While it is relatively straightforward to structure the binary string into characters, adding further information is more difficult. Consider what we know about this small fragment: Harris and Jane are (very likely) first names, Hill and Smith are last names, Harris Hill and Jane Smith are full names of people, plaintiff and attorney are roles in a legal case, Harris Hill has the role of plaintiff, attorney for is a relationship between two entities, and Jane Smith is in the attorney for relationship to Harris Hill. It would be useful to encode this information into a standardised machine-readable and processable form.

XML helps to encode the information by specifying requirements for tags that can be used to annotate the text. It is a highly expressive language, allowing one to define tags that suit one’s purposes so long as the specification requirements are met. One requirement is that each tag has a beginning and an ending; the material in between is the data that is being tagged. For example, suppose tags such as the following, where … indicates the data:


<legalcase>...</legalcase>,
<firstname>...</firstname>,
<lastname>...</lastname>,
<fullname>...</fullname>,
<plaintiff>...</plaintiff>,
<attorney>...</attorney>, 
<legalrelationship>...</legalrelationship>

Another requirement is that the tags have a tree structure, where each pair of tags in the document is included in another pair of tags and there is no crossing over:


<fullname><firstname>...</firstname>, 
<lastname>...</lastname></fullname>

is acceptable, but


<fullname><firstname>...<lastname>
</firstname> ...</lastname></fullname>

is unacceptable. Finally, XML tags can be organised into schemas to structure the tags.

With these points in mind, we could represent our fragment as:


<legalcase>
  <legalrelationship>
    <plaintiff>
      <fullname><firstname>Harris</firstname>,
           <lastname>Hill</lastname></fullname>
    </plaintiff>,
    <attorney>
      <fullname><firstname>Jane</firstname>,
           <lastname>Smith</lastname></fullname>
    </attorney>
  </legalrelationship
</legalcase>

We have added structured information — the tags — to the original text. While this is more difficult for us to read, it is very easy for a machine to read and process. In addition, the tagged text contains the content of the information, which can be presented in a range of alternative ways and formats using a transformation language such as XSLT (click here for more on this point) so that we have an easier-to-read format.

Why bother to include all this additional information in a legal text? Because these additions allow us to query the source text and submit the information to further processing such as inference. Given a query language, we could submit to the machine the query Who is the attorney in the case? and the answer would be Jane Smith. Given a rule language — such as RuleML or Semantic Web Rule Language (SWRL) — which has a rule such as If someone is an attorney for a client then that client has a privileged relationship with the attorney, it might follow from this rule that the attorney could not divulge the client’s secrets. Applying such a rule to our sample, we could infer that Jane Smith cannot divulge Harris Hill’s secrets.

Tower of Babel Though it may seem here like too much technology for such a small and obvious task, it is essential where we scale up our queries and inferences on large corpora of legal texts — hundreds of thousands if not millions of documents — which comprise vast storehouses of unstructured, yet meaningful data. Were all legal cases uniformly annotated, we could, in principle, find out every attorney for every plaintiff for every legal case. Where our tagging structure is very rich, our queries and inferences could also be very rich and detailed. Perhaps a more familiar way to view documents annotated with XML is as a database to which further processes can be applied over the Web.

Natural Language Processing

As we have presented it, we have an input, the corpus of texts, and an output, texts annotated with XML tags. The objective is to support a range of processes such as querying and inference. However, getting from a corpus of textual information to annotated output is a demanding task, generically referred to as the knowledge acquisition bottleneck. Not only is the task demanding on resources (time, money, manpower); it is also highly knowledge intensive since whoever is doing the annotation must know what to look for, and it is important that all of the annotators annotate the text in the same way (inter-annotator agreement) to support the processes. Thus, automation is central.

Yet processing language to support such richly annotated documents confronts a spectrum of difficult issues. Among them, natural language supports (1) implicit or presupposed information, (2) multiple forms with the same meaning, (3) the same form with different contextually dependent meanings, and (4) dispersed meanings. (Similar points can be made for sentences or other linguistic elements.) Here are examples of these four issues:

(1) “When did you stop taking drugs?” (presupposes that the person being questioned took drugs at sometime in the past);
(2) Jane Smith, Jane R. Smith, Smith, Attorney Smith… (different ways to refer to the same person);
(3) The individual referred to by the name “Jane Smith” in one case decision may not be the individual referred to by the name “Jane Smith” in another case decision;
(4) Jane Smith represented Jones Inc. She works for Dewey, Cheetum, and Howe. To contact her, write to j.smith@dch.com .

When we search for information, a range of linguistic structures or relationships may be relevant to our query, such as:

grammatical constructions (passive or active sentence forms, quotation, reference to other individuals, and so on),
grammatical relations among terms (e.g., whether an individual is the agent or target of some action),
ontological relations (e.g., classes and subclasses of experts, or the relationships among courts in the judicial hierarchy),
relationships among elements (e.g., who works for what organization), or
high-level patterns such as legal arguments (e.g., expert testimony) and fact patterns (e.g., culpable intent).

People grasp relationships between words and phrases, such that Bill exercises daily contrasts with the meaning of Bill is a couch potato, or that if it is true that Bill used a knife to kill Phil, then Bill killed Phil. Finally, meaning tends to be sparse; that is, there are a few words and patterns that occur very regularly, while most words or patterns occur relatively rarely in the corpus.

Natural language processing (NLP) takes on this highly complex and daunting problem as an engineering problem, decomposing large problems into smaller problems and subdomains until it gets to those which it can begin to address. Having found a solution to smaller problems, NLP can then address other problems or larger scope problems. Some of the subtopics in NLP are:

Generation – converting information in a database into natural language.
Understanding – converting natural language into a machine-readable form.
Information Retrieval – gathering documents which contain key words or phrases. This is essentially what is done by Google.
Text Summarization – summarizing (in a paragraph) the main meaning of a text or corpus.
Question Answering – making queries and giving answers to them, in natural language, with respect to some corpus of texts.
Information Extraction — identifying, annotating, and extracting information from documents for reuse, representation, or reasoning.

In this article, we are primarily (here) interested in information extraction.

NLP Approaches: Knowledge Light v. Knowledge Heavy

There are a range of techniques that one can apply to analyse the linguistic data obtained from legal texts; each of these techniques has strengths and weaknesses with respect to different problems. Statistical and machine-learning techniques are considered “knowledge light.” With statistical approaches, the processing presumes very little knowledge by the system (or analyst). Rather, algorithms are applied that compare and contrast large bodies of textual data, and identify regularities and similarities. Such algorithms encounter problems with sparse data or patterns that are widely dispersed across the text. (See Turney and Pantel (2010) for an overview of this area.) Machine learning approaches apply learning algorithms to annotated material to extend results to unannotated material, thus introducing more knowledge into the processing pipeline. However, the results are somewhat of a black box in that we cannot really know the rules that are learned and use them further.

With a “knowledge-heavy” approach, we know, in a sense, what we are looking for, and make this knowledge explicit in lists and rules for processing. Yet, this is labour- and knowledge-intensive. In the legal domain it is crucial to have humanly understandable explanations and justifications for the analysis of a text, which to our thinking warrants a knowledge-heavy approach.

One open source text-mining package, General Architecture for Text Engineering (GATE), consists of multiple components in a cascade or pipeline, each component automatically processing some aspect of the text, and then feeding into the next process. The underlying strategy in all the components is to find a pattern (from either a list or a previous process) which matches a rule, and then to apply the rule which annotates the text. Each component performs a particular process on the text, such as:

Sentence segmentation – dividing text into sentences.
Tokenisation – words identified by spaces between them.
Part-of-speech tagging – noun, verb, adjective, etc., determined by look-up and relationships among words.
Shallow syntactic parsing/chunking – dividing the text by noun phrase, verb phrase, subordinate clause, etc.
Named entity recognition – the entities in the text such as organisations, people, and places.
Dependency analysis – subordinate clauses, pronominal anaphora [i.e., identifying what a pronoun refers to], etc.

The system can also be used to annotate more specifically to elements of interest. In one study, we annotated legal cases from a case base (a corpus of cases) in order to identify a range of particular pieces of information that would be relevant to legal professionals such as:

Case citation.
Names of parties.
Roles of parties, meaning plaintiff or defendant.
Type of court.
Names of judges.
Names of attorneys.
Roles of attorneys, meaning the side they represent.
Final decision.
Cases cited.
Nature of the case, meaning using keywords to classify the case in terms of subject (e.g., criminal assault, intellectual property, etc.)

Applying our lists and rules to a corpus of legal cases, a sample output is as follows, where the coloured highlights are annotated as per the key on the right; the colours are a visualisation of the sorts of tags discussed above (to see a larger version of the image, right click on the image, then click on “View Image” or a similar phrase; when finished viewing the image, use the browser’s back button to return to the text):

Annotation of a Criminal Case

The approach is very flexible and appears in similar systems. (See, for example, de Maat and Winkels, Automatic Classification of Sentences in Dutch Laws (2008).) While it is labour intensive to develop and maintain such list and rule systems, with a collaborative, Web-based approach, it may be feasible to construct rich systems to annotate large domains.

Conclusion

In this post, we have given a very brief overview of how the Semantic Web and Natural Language Processing (NLP) apply to legal textual information to support annotation which then enables querying and inference. Of course, this is but one take on a much larger domain. In our view, it holds great promise in making legal information more transparent and available to more legal professionals. Aside from GATE, some other resources on text analytics and NLP are textbooks and lecture notes (see, e.g., Wilcock), as well as workshops (such as SPLeT and LOAIT). While applications of Natural Language Processing to legal materials are largely lab studies, the use of NLP in conjunction with Semantic Web technology to annotate legal texts is a fast-developing, results-oriented area which targets meaningful applications for legal professionals. It is well worth watching.

Dr. Adam Zachary Wyner is a Research Fellow at the University of Leeds, Institute of Communication Studies, Centre for Digital Citizenship. He currently works on the EU-funded project IMPACT: Integrated Method for Policy Making Using Argument Modelling and Computer Assisted Text Analysis. Dr. Wyner has a Ph.D. in Linguistics (Cornell, 1994) and a Ph.D. in Computer Science (King’s College London, 2008). His computer science Ph.D. dissertation is entitled Violations and Fulfillments in the Formal Representation of Contracts. He has published in the syntax and semantics of adverbs, deontic logic, legal ontologies, and argumentation theory with special reference to law. He is workshop co-chair of SPLeT 2010: Workshop on Semantic Processing of Legal Texts, to be held 23 May 2010 in Malta. He writes about his research at his blog, Language, Logic, Law, Software.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

jurMeta - New Metadata Initiative for Legal Documents

Dublin Core and legal informatics, information retrieval, Legal identifiers, Legal metadata, Legal semantic web 4 Responses »

Apr 302010

This article is about jurMeta, a new metadata initiative for legal texts that I initiated with two colleagues in Germany. Our vision for jurMeta goes beyond a standard set of metadata tags and predefined legal terms. JurMeta is a whole concept allowing the automatic annotation, exchange and linking of legal documents on the Internet for a wide community. Technically, this will be achieved by easy-to-use plugins for blog systems and content management systems. Using this system, citations to a court decision in a blog post could be linked with the court decision itself on another website by simply installing a plugin. On http://jurmeta.de, I have already started a platform for the standard itself and for tools that can be built on top of it in future.

We need a practical standard

After having Googled metadata standards for legal texts for hours, having screened dozens of projects, and having seen a lot of ontologies and legal topic maps, I suggested to my colleague Andreas Bock: “Let’s go and create our own metadata standard. Everything I’ve seen so far is too complex and does not fit our practical needs.”

In August 2009, we had just released kjur.de – “Recht einfach finden” — “Law easily found” — a small, Google-like search engine for legal documents in the German language. The concept is to offer better-structured access to freely available legal content by avoiding all that noise that bothers you when using one of the big search engines.

Some other vertical search engines for legal content (such as these and Google Scholar) have been developed in recent years. But most are built on a Google custom search, and so they offer only limited possibilities for analyzing content. We chose a flexible but more difficult method, and designed our back-end on top of an existing crawler. Legal acts, blogs, Wikipedia articles, and a bibliography were the first content types in kjur.de. Unfortunately, a few weeks after the release, the nice beep of a hard disc head crash and a badly configured RAID sent a key part of our document processing into digital nirvana.

However, this created a golden opportunity to address the increased demands on our system. We decided to shift from a less- to a more-structured data processing and storage approach. Our new, flexible, and scalable process mined the documents for particular legal information, and automatically annotated the documents with metadata, marked up in XML. In homage to Yahoo pipes, we called our process “kjur pipe.” The first goal was to offer more detailed access to legal information and, with query expansion, to provide an extended search. In order to be able do this, we needed a document and metadata standard.

jurMeta at the EDV-Gerichtstag 2009

Part two of the history of jurMeta concerns a session at the EDV-Gerichtstag in Saarbrucken, a conference — held annually in Germany — on technical aspects of legal documents. Ralf Zosel, who organized the session about free legal online projects, invited us to talk about our experiences with our project. Ralf created http://jurawiki.de, the first and best-known legal online community in Germany.

We spoke about the accessibility of law statutes and court decisions on governmental websites. The German landscape of official websites with legal content is split up into a large number of very heterogeneous portals. For example, at http://www.gesetze-im-internet.de, one can find most of the consolidated federal acts. Court decisions are available from five different sites: the Federal Constitutional Court, the Federal Civil Court, the Federal Employment Court, the Federal Patent Court, and the Federal Finance Court. Additionally, most of the 16 federal states have their own databases with court decisions and the most important statutes.

An interesting detail is that many of these free databases provided by the government authorities are hosted by a publisher who sells the same documents on his own web portal. That these sites are protected by robots.txt files and meta tags clearly demonstrates that search spiders are not welcome. Some of the official sites allow crawling, but use complicated JavaScripts that can cause infinite crawling loops. Therefore, except for the database of the Federal Constitutional Court, the German Yahoo and Google indices contain not a single court decision indexed from the federal courts in Germany.

Moreover, the automatic extraction of metadata from online German legal documents poses a big problem. Let me demonstrate this with a metadata field that basically exists in every .doc, .pdf, or .html file — the title. The users of our search engine like to see the search results with titles that tell something about what they can find behind the link; for example, the name of the court, the file number, and the date. It is a pity that the document titles of the crawled documents rarely contain any useful information. Most of the titles of court decisions on public websites contain the file name of the MS-Word template, the name of the secretary of the court who wrote the decision, or simply “New Document.doc.” So we had to program a script for metadata extraction in order to build nice titles on our own.

Therefore, in the session at the EDV-Gerichtstag 2009, my colleague suggested to the government that they publish official legal texts on a centralized website — and that those texts should be structured and should contain descriptive metadata, if those metadata appear in the original documents. If the government took these steps, my colleague contended, the publishers of legal information would be free to compete with respect to creating the best information system and the best added value for the user, and would not spend too much time or money adding technical structure to primary legal texts.

How to give birth to a vital metadata baby

My role in the session was to offer some ideas about what a metadata standard could look like. When I prepared my slides, I was very skeptical. I was afraid of presenting a stillborn child, because nobody needs a new metadata standard. Who should use it — and why? I had had my very own experiences with XML metadata-annotated documents when, in 2001, I created a system called VERIS, a database containing public procurement case law. It was the first database — and I think the only one — that allowed the user to export court decisions in the format of the XML Standard of Court Decisions of Saarbrücken. Nobody was interested in this functionality except for the person who copied the whole database piece by piece and created his own in order to sell it. Although a court ordered him to stop, he demonstrated the interoperability of legal XML documents.

Focusing on Web 2.0

I assume that because big publishers have their own standards that depend on their business processes, they are not very interested in a new metadata standard. I am convinced that the largest audience for a new metadata standard can be found in the Web 2.0 community.

A lot of incredibly good and interesting legal web projects are driven by hundreds, perhaps thousands of people, mostly technically talented lawyers. Here are some examples from the German websphere:

Two students created an open platform for court decisions and law texts;
A group of students writes articles for other students who are preparing for academic exams;
Two law students have created a very interesting project that allows others to create and share legal mindmaps; and
In the directory http://fjip.de, we can find approximately 400 known German law projects (with more being added each week), all accessible free of cost.

Connecting contents

What might happen if these Web 2.0 projects were able to share information about their content by automatically linking to each other? Consider this hypothetical use case: On http://juraexamen.info, you read an interesting legal article. The article talks about a court decision that is very important for your subject. The reference number of the court decision is automatically generated as a link that brings you directly to the court decision itself on http://openjur.de. There, a preview of a mindmap from http://www.juralib.de/ is shown, that uses a visualization to put the court decision in a global context.

Legal information is eminently suitable for being identified automatically at the document or subdivision level, by an identifier, such as paragraph number, a citation to a journal article, or a reference number of a court decision. These identifiers can take the form of primary keys, and can automatically be referenced from and to other documents. For big publishers of online databases, such functionalities have been common for many years.

These functionalities — configurable, automatically generated, and updated by implemented jurMeta tools — would dramatically increase the value of each legal content site. And a positive side effect will be a higher page-rank for these sites, because of deep links. The links between the different sites and the exchange of metadata can be supported by modules written for the different content management systems. A community could develop such modules. First, however, many questions have to be solved, some of them very important structure-related decisions. For example: will there be a central jurMeta repository on http://jurmeta.de with metadata links, or will the system be completely decentralized? I’m looking forward to a lot of discussion about this.

Considering business models

Everybody who shares legal knowledge for free on the Web is strongly motivated to do so, and most such organizations — which are often run on a non-profit basis — utilize business models. For example, typically, a blogger’s business model is to design a personal brand and to gain a reputation as being knowledgeable in his or her field. In my opinion, supporting free access to law services’ business models with standards, and the tools necessary to implement those standards, offers the best chance for a project like jurMeta to become well established.

The following two short examples demonstrate how linking documents via jurMeta can benefit free access to law services, irrespective of their business models:

If the business model of a website is to earn money with advertising, linking metadata must bring the users to the website’s pages. If metadata fail to do that, nobody will see the ads and the goals of the website creator will remain unachieved.
If the business model is to gain reputation, metadata could be linked — in an AJAX popup, for example — on the condition that the author is named or the author’s photo is shown.

I am absolutely sure that one can identify many ways in which jurMeta can support a certain set of business models commonly employed by free access to law services. I believe that such support is a prerequisite for the broad acceptance of jurMeta.

Technical specifications

Because jurMeta is still in development, we can’t yet provide detailed technical specifications for it. Our highest priority for a jurMeta metadata standard is to allow others to build tools on top of it. Therefore, jurMeta will be a simple standard, following the Pareto principle: gaining 80% of the possible effect by only 20% of the necessary investment.

So far I have developed the tagset and labels that I need for the internal document processing of our search engine, kjur.de. The actual tagset can be found on http://jurmeta.de. In general, I propose to follow most of the rules and principles of Dublin Core, but with law-specific labels. In my opinion, it is very important that no tags or terms of jurMeta be mandatory. Everything is optional. The tools of jurMeta have to prove that people really benefit from annotating metadata.

Here is an example with some labels of the current jurMeta syntax. In HTML, jurMeta tags look like this:

<meta name=”jm.doctype” content=”court decision”>
<meta name=”jm.courtname” content=”Bundesgerichtshof”>
<meta name=”jm.filenumber” content=”VIII ZR 316/06”>
<meta name=”jm.fieldoflaw” content=”Mietrecht”>
<meta name=”jm.keyword” content=”Endrenovierung”>

Here these labels appear in XML encoding with an embedded related document:

<jm:document>

<jm:doctype>court decision</jm:doctype>
<jm:courtname>Bundesgerichtshof</jm:courtname>
<jm:filenumber>VIII ZR 316/06</jm:filenumber>
<jm:fieldoflaw>Mietrecht</jm:fieldoflaw>
<jm:keyword>Endrenovierung</jm:keyword>
<jm:related>

<jm:document>
… another XML-document with jurMeta XML …
</jm:document>

</jm:related>

</jm:document>

I also think of variations of jurMeta in microformat (using the “rel” tag) and RDF. The technical details have to be worked out during the next few months so that we can have a first beta release of jurMeta this year, perhaps at the next EDV-Gerichtstag in September 2010.

I invite everyone to further discussion of this subject here in this blog or on http://jurmeta.de!

[Editor’s Note: Other current applications of Dublin Core metadata to legal materials include:

The LII‘s OAI4Courts and Suggested Metadata Practices for Legislation and Regulations;
John Joergensen‘s metadata for legislative documents and law journal articles; and
Prof. Dr. Maarten Marx‘s metadata for parliamentary documents.

Further, Olivier Charbonneau discusses an automatic link generating system for free access to law services in this post, and Ivan Mokanov discusses CanLII‘s partial implementation of such a system in this post.]

Dipl.-Jur. Felix Zimmermann is CEO and lead developer of http://kjur.de, a legal search engine. He is legal assistant at the law firm Kehr – Ritz & Kollegen in Hannover, Germany. In the past he has worked as research associate at the Institute of Legal Informatics of the Leibniz University of Hannover (Prof. Dr. Wolfgang Kilian) and at the Institute of Security and Security Law of the University of Passau (Prof. Dr. Dirk Heckmann).

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Crime investigation is a difficult and laborious process. In a large case, investigators, judges and jurors are faced with a mass of unstructured evidence of which they have to make sense. They are expected, often without any prior formal training, to map out complex scenarios and assess the potential relevance of a vast amount of evidence to each of these hypothetical scenarios. Humans can only process a limited amount of information at once and various cognitive and social biases such as tunnel vision, groupthink and confirmation bias may lead to unwanted situations and mistakes. Such mistakes, which seem almost unavoidable given the difficult nature of the task, can have a large impact on those involved in the case and in the past there have been a number of miscarriages of justice.

Reasoning with criminal evidence requires one to structure the individual pieces of incoming information. In addition to conventional database and spreadsheet programs, a number of programs, such as those produced by CaseSoft and i2, have been designed specifically for intelligence analysis. However, these tools have one major drawback, in that they do not allow analysts to express their reasoning in the case: the creation and evaluation of scenarios using evidence still take place in the heads of the analysts. At a time when knowledge- and argument mapping is taking off as a field that has to be taken seriously, this seems like a missed opportunity.

The project Making Sense of Evidence, which ran from 2005 to 2009, set out to develop a specialist support tool, in which not only the evidence and scenarios or stories can be structured in a simple way but in which it is also possible to express one’s reasoning about the evidence and stories using a sound underlying theory. Using insights from such diverse fields as legal theory, legal psychology, philosophy, argumentation theory, cognitive modelling and artificial intelligence (AI), a broad theory that both describes and prescribes how crime investigation and criminal legal decision making (should) take place was developed by me in conjunction with Henry Prakken, Bart Verheij and Peter van Koppen. At the same time, Susan van den Braak (together with Gerard Vreeswijk) developed a support tool for crime investigation based on this theory and extensively tested this tool with police analysts (together with Herre van Oostendorp).

Crime investigation, legal decision making and the process of proof

Crime investigation and legal decision making both fall under what Wigmore calls the process of proof, an iterative process of discovering, testing and justifying various hypotheses in the case. Pirolli and Card have proposed an insightful model of intelligence analysis. In Pirolli and Card’s model, the process consists of two main phases, namely foraging and sense-making. In the foraging phase, basic structure is given to a mass of evidence by schematizing the raw evidence into categories, time lines or relation schemes. In the sense-making phase, complex hypotheses consisting of scenarios and evidence are built and evaluated and these results are then presented. It is this last phase in which we are particularly interested: the existing tools for evidence analysis already support the foraging phase.

Reasoning with evidence: stories or arguments?

In the research on reasoning with criminal evidence, two main trends can be distinguished: the argumentative approach and the narrative approach. Arguments are constructed by taking items of evidence and reasoning towards a conclusion respecting facts at issue in the case. This approach has its roots in Toulmin’s argument structure and Wigmore’s evidence charts and has been adapted by influential legal theorists. It has been characterized as evidential reasoning because of the relations underlying each reasoning step: ‘a witness testifying to some event is evidence for the occurrence of the event’. Argumentative reasoning has also been called atomistic because the various elements of a case (i.e. hypotheses, evidential data) are considered separately and the case is not considered ‘as a whole’.

Hypothetical stories based on the evidence can be constructed, telling us what (might have) happened in a case. Alternative stories about what happened before, during and after the crime should then be compared according to their plausibility and the amount of evidence they explain. This approach has been advocated by people from the field of cognitive psychology as being the most natural approach to evidential reasoning. It has been characterized as causal reasoning because of the relations between the events in a story: ‘Because the suspect did not want to get caught by the police, he got in his car and drove off’. The story-based approach has also been called holistic (as opposed to atomistic), because the events are considered as a whole and the individual elements receive less attention.

Both the argument-based and the story-based approaches have their advantages. The argument-based approach, which builds on a significant academic tradition of research, is well suited for a thorough analysis of the individual pieces of evidence, whilst the empirically tested story-based approach is appreciated for its natural account of crime scenarios and causal reasoning. Therefore, in my thesis I have proposed a hybrid theory that combines stories and arguments into one theory. In this hybrid theory, hypothetical stories about what (might have) happened in a case can be anchored in evidence using evidential arguments. Furthermore, arguments can be used to reason about the plausibility of a story.

Sense-making using argument mapping

In recent times, the interest in so-called sense-making systems tools has increased exponentially. In contrast to classic knowledge based systems from Artificial Intelligence, these sense-making systems do not contain a knowledge base and do not reason automatically. Instead, they are intended to help the user make sense out of a certain problem by allowing him or her to logically structure his or her knowledge and reasoning. Thus, they help users make sense out of a certain problem by allowing them to store, share and search knowledge in a structured and intuitive way. The techniques used in sense-making systems include mind maps, concept maps, issue maps and argument maps. Whilst each of these techniques has its own merits, the technique of argument mapping is of particular interest to the current discussion.

Argument mapping, or argument visualization, traces its origins back to Wigmore, who carefully defined a complex visual language for reasoning with a mass of evidence. In the 1990’s, the advent of faster and more GUI-equipped computer programs stimulated the interest in argument mapping and specific software tools for performing these visualizations of argument. For example, in 1998 Robert Horn released a series of complex maps about one of the main debates in AI: can computers think? Software tools for argument visualization have since been used for a variety of purposes. For example, Araucaria is used in legal education, making students familiar with legal argument, and in legal practice, aiding judges in handling simple cases by providing checklists in the form of critical questions to an argument. Rationale is used in university courses to teach critical thinking and in a variety of consultancy tasks, such as producing a report for the army on whether or not to buy a new tank. Debategraph is a wiki debate visualization tool which aims to increase the transparency and rigor of public debate on the internet. The program has made it into mainstream media, as it will be used by CNN’s Christiane Amanpour. Cohere has similar aims, allowing for the visualization of ideas and debates on the web. The Online Visualisation of Argument (OVA) suite of argument mapping tools, while similar to Debategraph and Cohere in that it is built to support the idea of a global World Wide Argument Web, has its own niche appeal in that it deals specifically with structured arguments and is explicitly based on rigorous academic theories of (computational) argumentation.

Tools for argument visualization work because they force one to make explicit the various elements of one’s reasoning, such as the premises and conclusions of an argument or the claims made by the participants in a discussion. Thus, certain ambiguities can be avoided. For example, (evidential) relations between the various elements in an argument can be clearly represented as arrows, whereas in natural language arguments clues that point to possible inferences are often left implicit or phrased ambiguously. Another example of argument visualization’s aiding complex reasoning is when there is more than one reason for a conclusion. As natural language text by its very nature imposes a sequential structure, visualizing the argument can help a great deal.

Story-mapping?

Tools for argument mapping can be worthwhile additions to the existing support software for crime investigation, because these tools enable the structuring, not only of the evidence itself, but also of the reasoning based on this evidence. However, as was argued above, reasoning in the process of proof does not just involve argumentation; stories or narratives play an equally important role.

The existing support software, such as Analyst’s Notebook, makes it possible to incorporate skeleton stories by drawing timelines. However, it is not just the events or sequence of events which makes a story. A proper story also needs to be coherent; that is, its (causal) structure needs to be believable. Because the plausibility of a story depends on the prior beliefs someone has, it is very subjective and therefore open to argument. The existing argument mapping software does not allow for the visualization of stories. Arguments in this software mainly focus on one or two main claims, whereas a story is usually about the greater whole. Although arguments for individual events in a story can be visualized in the current tools, those tools do not allow for the explicit representation of a story’s structure and the relations between the events in a story.

Our project developed a tool, AVERS, which allows for the visualization of causally connected scenarios as well as the arguments supporting or attacking these scenarios. Thus, AVERS allows one to show how a scenario is contradicted by evidence, and to reason about the stories themselves. Arguments can be directly linked to source documents and the type of evidence used in those arguments can be indicated.

Looking at the future

The AVERS tool and the hybrid theory on which it is based are important first steps to developing powerful support and visualization tools tailored to a specific task such as crime investigation or legal decision making. On the theoretical side, further interdisciplinary research is necessary to achieve a truly integrated “science of evidence.” On the practical side, further testing and development of support tools are needed. While visualization can ease the interpretation of complex arguments, complex argument visualization can quickly become “boxes-and-arrow-spaghetti.” Depending on the context, a visual or textual representation may be preferred, and any sense-making tool for argumentation should allow for a combination of the two modes of representation.

Floris Bex is a research assistant at the Argumentation Research Group of the University of Dundee, working on the Dialectical Argumentation Machines (DAM) project. He has an M.Sc. in Cognitive Artificial Intelligence from Utrecht University. In 2009, he was awarded his Ph.D., for his thesis entitled “Evidence for a Good Story: A Hybrid Theory of Stories, Arguments and Criminal Evidence”, from the University of Groningen (Centre for Law and ICT). His thesis outlines a hybrid theory of reasoning with stories and arguments in the context of criminal evidence.

VoxPopuLII is edited by Judith Pratt. Managing editor Rob Richards.

Crowdsourcing Legal Commentary

Adding legal commentary to free access to law services, Applications, Crowdsourcing and free access to law, Crowdsourcing and legal information systems, Crowdsourcing the writing of secondary legal resources, free access to law, Legal commentary, Legal metadata, Legal social media, Legal social networks, Legal treatises, open source software, Public access to legal information, Secondary legal resources, Web 2.0 and law, Wikis and law 5 Responses »

Mar 312010

Background: The Need for Free Legal Commentaries

A legal commentary — also known as a legal treatise — is an unofficial text, intended to complement a particular source of law, often consisting of one or more statutes. A commentary on a statute provides information on how to interpret terms in the statutory text, summarizes examples of the statute’s application, references other relevant parts of the statutory law, and explains the legislative history and policy background of the statute. As statutory law is typically written in an open-ended way, setting forth norms in general language and usually without examples of how the law should be applied, a newcomer may have difficulty understanding it. Legal commentaries help with this.

In many jurisdictions, the texts of statutes are published by the legislature, usually without claim to copyright, and thus are made available to all, including to free-access-to-law services. Legal commentaries, by contrast, are written by private parties, who have a copyright on the resulting text, copies of which they typically sell at price levels that prevent most persons other than legal professionals from accessing them. Thus, most citizens may have free access to law, but not to the texts necessary for understanding it.

My background is that of a software developer, but in 2005 I started law school in Sweden. At that time, I couldn’t find any good freely accessible web service containing Swedish statutory law, so I built one called lagen.nu (which, translated, means “the law, now”). When the site debuted it contained around 4,000 pages of statutory law, and another 10,000 pages with headnotes on legal cases, with hyperlinks and cross-references. Over the next few years the site was gradually improved, with better hyperlinking of references, the addition of the full text of case law, and an improved graphical interface. The purpose of the site was and is to make law accessible to the common person. But making available official data such as statutory and case law can only get you so far, as there are many aspects of the law which are not apparent from the face of primary legal documents.

The Swedish legal system, like most civil law systems, is based more on statutory law than are most common law systems. Jurists in Sweden therefore spend much time interpreting statutory law. There are several publishers — including Thomson Fakta and Norstedts Juridik — that provide legal commentaries on Swedish central acts. The commentaries are written for legal professionals, which is evident in both the extent of the commentaries and their price. (As an example, the standard commentary on the Swedish code of judicial procedure fills four loose-leaf binders and costs 4435 SEK [$611 US].) As such, they are not accessible in any realistic sense to laypeople.

Therefore, I started thinking about how one could create a free commentary on Swedish law. Traditional commentaries require enormous resources, the most critical and expensive being the time of professors or other experts. Commentaries written for legal professionals generally fall into either of two categories: (1) in-depth works, or (2) practice tools. In-depth works — such as Fitger’s Rättegångsbalken — provide extensive treatment of the subject, delve deeply into the historical and teleological background of the regulation in question, and examine every conceivable exceptions-from-the-exception-to-the-rule detail. Those commentaries can be ten times the length of the actual statutory text, and are typically published in book form. In particular, they are written for readers who already understand the basic concepts and structure of the regulation, and who have time to dig deeper. For ordinary people, this level of detail is neither needed nor wanted.

In addition, there are practice tools, shorter commentaries more suited for the practicing lawyer who needs to quickly understand a particular regulation. These are still written for professionals, and are typically accessible as part of an electronic database subscription. Such subscriptions are priced far beyond the reach of non-professionals as well.

Thus neither form of commentary is written for laypeople, and neither is readily available to them. In order for the law to be accessible to all, it needs to be explained differently, and the explanation needs to be freely available. Writing this sort of commentary doesn’t require a tenured professor. In fact, the basic aspects of any central act can be adequately explained by any law student having the following two qualities:

a) A thorough understanding of the subject (for example, having taken the relevant course and having received a good grade); and

b) A talent for explaining complex things succinctly. At this, the student may even outdo the professor, as the student — having recently been a novice respecting the topic — will have a better understanding of how the subject is approached by someone new to it.

In any given act, there are a handful of key sections. A brief introduction to the act, combined with short explanations of these key sections, would be enough to create something useful to a nonlawyer. A person with a good knowledge of the act could write this in an evening. Something more extensive, like a basic commentary on the interesting parts of any central act, could be written in one or two weeks of work.

Still, there are a lot of central acts in Swedish law, and of only a few of these do I have extensive knowledge. Since I wouldn’t be able to write all the commentary myself, I’d have to create something that made it possible for people more knowledgeable than I to contribute their knowledge. Thus, the solution would have two aspects — one technological, one social. During the spring, summer, and autumn of 2009, we tried to create this solution.

Technology: Collaborative Writing Tools

In our proposed free online system, a commentary for a single act would normally contain a brief introduction to the act itself, followed by comments on the most important sections. A simple way to do this is to write the entire commentary as a single text document, divided into different parts, each referencing a particular section of the act. Acts in the Swedish law system can vary greatly in length, with some being only a sentence long, and some being over 100,000 words long and containing over 1000 different sections. However, the decision to use one single commentary text per act was made in order to keep things simple.

Swedish acts are typically divided into chapters, which are divided into sections (though for shorter acts, only sections are used). For example, the second section of the fourth chapter of an act is referenced as “4 kap. 2 §”. We adopted a convention of using such a reference as a header preceding the commentary for the referenced section.

Apart from headings, we tried to use as little formatting as possible. Basic bold and italics were desirable, as were different sorts of lists (ordered, unordered, and definition lists), but not things like multiple fonts or footnotes. Hyperlinks, both internal within the system and external to other web pages, were encouraged.

Apart from commentaries on the acts themselves, we wanted to be able to describe important concepts referred to in the acts. To write succinct explanations of statutes, one often has to refer to central legal concepts (such as “The rule of law“). If that reference can be linked to a page that describes the concept in greater detail, the commentary can be made shorter and the user can decide whether to follow the link if more explanation is needed. Furthermore, the concept can be referred to from multiple commentaries.

In order for the process to be as simple as possible, we needed a web-based editing system that didn’t require any particular piece of software on the user’s computer. We also didn’t want to spend significant time developing or customizing software.

The decision was made to use MediaWiki, the software behind Wikipedia, for the task. MediaWiki is a very robust and well developed piece of software with an active developer community. It’s also easy to extend with “hooks,” that is, small pieces of code that are configured to run in certain situations (such when a page is saved). Pages can be hyperlinked, and basic formatting such as headings and lists can be done using a simple markup language, normally referred to as “wikitext.”

As MediaWiki is based around the editing of pages, the commentary for each act in our system would take the form of a page, and the description for each legal concept would also constitute a page. In Swedish law, acts and ordinances are published in a collection called SFS (Svensk författningssamling). Each act or ordinance is given an identifying number: e.g., the public access to information and secrecy act is known as “SFS 2009:400”. In our free online system, this would correspond to a commentary page called SFS/2009:400.

Unlike Wikipedia, we did not want to use MediaWiki as the interface for the reader. The primary reason for this was that we wanted to present the statutory text and the commentary side-by-side. In order to do that, we needed to extract the text that we had edited using MediaWiki, split it up into commentaries for each individual section, and then weave the statutory text and the commentary text together. (Click here to see an example.)

The architecture of lagen.nu is such that no pages are ever created dynamically in response to a user’s request. Instead, everything is pre-rendered in the form of static HTML files on disc. This makes the site responsive even when running on modest hardware and under high loads. A hook was developed so that every time a wiki page in the main namespace was saved, a program to weave together that commentary text with the statutory text was run. The result was stored as a normal HTML file on disc.

Writing the program to do this weaving was not a trivial task. In particular, translating the minimalist wikitext markup into XHTML fragments proved to be difficult, and not all MediaWiki markup was supported. We also needed to do custom processing of the commentary text. Luckily, a library was found that parsed the MediaWiki markup and returned the equivalent XHTML markup. This library was extended in order to identify references to acts, sections, and legal cases present in the system. This enabled the commentary authors, who were not expected to learn advanced hyperlinking syntax, to just write normal legal prose, which was then linked automatically.

Using this setup, the editing process boiled down to these steps:

Find the act to be commented on on the main site, and click the “edit commentary” link.
This leads to a MediaWiki editing page — or, if the user is not logged in, to a login form.
The user creates or edits the commentary, saving occasionally.
When saving, the edit can be flagged as a “minor edit”. When this is the case, the weaving process is not run, enabling the user to save work that is still in progress without changing the contents of the main site.
When the user is satisfied (which does not necessarily mean that he/she is altogether done with writing the commentary, just that it is in such a state that it can be published on the main site), the page is saved and the weaving process is run.
The user can then check the main site and verify that the statutory text and the commentary text look OK side by side.

Social Aspects: Coordinating the Writing Effort

The plan was to get motivated law students to write the commentary for our free online system. We formed a group of 14 people, consisting mostly of law students, but including some practicing or retired lawyers. They were each given an assignment, normally consisting of a single act, and the instruction to write the best commentary on that act that they could fashion, within a 40 hour period. For some longer acts, this limit was extended to 60 or 80 hours. The authors retained copyright in their work, and were to receive attribution for it, but they agreed to license it under a Creative Commons license (specifically the CC-BY-SA license). They were also given a small monetary compensation for their work. The deadline for completing the writing assignments was set for October 1st.

At the start of summer 2009, we held an initial workshop where the motivation and impetus of the project were explained, and where people were given some hands-on instruction in how to write and edit commentary. For the rest of the summer, project members communicated using an email mailing list, together with occasional get-togethers on Saturdays at the central library in Stockholm.

During this time, the framework for writing, as well as the results, were constantly examined and improved. My roles were those of software developer (fine-tuning the weaving process), editor (suggesting improvements to individual commentaries as they were written), and project manager (gently reminding everyone of the impending October deadline).

It soon became apparent that everyone had his or her own idea of how commentaries should be written. In general, this diversity is a good thing — if everyone tries out their ideas, we can see what works and what doesn’t. However, if everyone is forced to invent their own wheels, progress is slow and resources that could be used for writing are wasted on other things.

Therefore, we wrote a style guide, containing basic guidelines with examples on how to write concisely and simply, how to refer and link to legal concepts and external resources, how to prioritize, whom to write for, and so on. A getting-started guide, and a shorter guide to MediaWiki markup, were also written.

During the writing period we were featured in the daily press, trade magazines and national radio. We were also nominated for a prize awarded by the Centre for Easy-to-Read.

Evaluation: Does It Work?

From July to September 2009, we commented on 17 acts, 1,200 individual sections, and 500 legal concepts, resulting in a total output of over 400 pages of text. A great amount of knowledge was created in an amazingly short time span. The quality assurance, which was done by me, proved to be an interesting challenge, as I’m not (as stated above) an expert on all of the acts for which commentaries were written. However, having taken basic classes on all the areas of law addressed by the commentaries, I was able to recognize most of the content of the commentaries (and to dig deeper if I found statements that seemed at odds with what I had learned). Overall, the quality of the commentaries was surprisingly high.

In conclusion, the commentary project has been a learning experience. Following the intensive activity of the summer and autumn, fewer commentaries have been added to the system. One reason for this might be that we no longer have the funds to pay authors. The monetary compensation was small but not insignificant. However, it seemed that the major incentives for authors were the opportunity to make law more accessible to others, as well as the chance to educate themselves by explaining the law to others. We have now enabled non-registered users to comment on the commentaries, using the Discussion namespace of MediaWiki. This way, we have identified and fixed many omissions and mistakes in the original commentaries.

One thing that seemed to work really well was the procedure of assigning work and following up on it. Writing a commentary on a long statute can be accomplished using free time over the course of a month or two. Setting a deadline, monitoring progress, and providing relevant feedback can serve as incentives for authors, too.

The commentaries produced by our project are written using best-efforts principles. There are no guarantees that the information in the commentaries is complete, updated, or even correct. However, having a principal author for each commentary gives that author an incentive to ensure the information in the commentary is as good as possible. Authors may use, and in fact have used, their commentaries as work samples. Even so, having someone other than the author read the text and provide feedback can improve the quality of a commentary as well.

With the commentary project, I feel we have proved that legal commentaries can be written in a crowdsourced way (even though we used monetary compensation to motivate our authors), and that wiki technology has the required capabilities, and is sufficiently user-friendly, for such an undertaking.

The next step for our commentaries project is to formalize and sustain an assignment and feedback workflow. This will require multiple project managers, and ideally some form of delegation of responsibility for different parts of the law. Further, I am confident that the project can be sustained using a voluntary framework. This remains to be proven, though.

Note: This project was supported by a grant from Internetfonden, for which we are eternally grateful.

Staffan Malmgren — the creator of the Swedish free access to law service lagen.nu — is a project manager at the Swedish Courts administration by day, working on coordinating the publishing of official legal information on the web, using structured and standardized formats (by night, he’s working on his final law school thesis, on the subject of jurisprudential relevance ranking in legal information systems).

Collaboration and Open Access to Law

Legal descriptive metadata, Legal information behavior, Legal semantic web, Peer production 9 Responses »

Mar 142010

One could wonder what Jeremy Bentham would have thought of the 4 billion people who are currently excluded from the rule of law. That number comes from a 2008 report of the United-Nations Development Program’s Commission on Legal Empowerment of the Poor (UNDP-CLEP). The British philosopher’s vision of “every man his own lawyer” certainly finds an echo in the UNDP-CLEP assertion that:

Empowering the poor through improved dissemination of legal information and formation of peer groups (self-help) are first-step strategies towards justice. Poor people may not receive the protection or opportunities to which they are legally entitled because they do not know the law or do not know how to go about securing the assistance of someone who can provide the necessary help. Modern information and communication technologies are particularly well suited to support interventions geared towards strengthening information-sharing groups, teaching the poor about their rights, and encouraging non-formal legal education. (p. 64)

Without actually saying it, the UNDP-CLEP seems to be describing what some call Web 2.0 or the participative web. In fact, these represent a series of tools and technologies, such as blogs, wikis, social networks, and content hosting and sharing sites, which allow many individuals to collaborate. Tools and communities are precisely what are required for a good old-fashioned barn raising–except, of course, for a green pasture on which to erect it !

This is where the global open access to law movement comes in. My goal, humbly submitted to the VoxPopuLII community for review, is to attempt to model how Web 2.0 and collaboration could be used to bring forth a greater understanding of the law in society, using the Canadian Legal Information Institute (CanLII) as a model and working with Daniel Poulin of the Université de Montréal’s LexUM. The full version of my findings can be found in my master’s thesis (en français) as well as in a paper submitted at the Law via the Internet Conference in November 2009 (my slides are hosted here).

I first set out to describe and define both Web 2.0 and collaboration. After a review of various theories and tools, I offer the following visual summary, which I call the Collaborative Document Management Framework (CDMF):

In this model, I posit that Agents — people like you and me as well as institutions or information services — and documents — such as court cases, newspaper articles and books — interact with each other in four generic relationships. Agents engage in conversations or exchanges, documents link or refer to each other, while agents consume or read documents, and documents are written by agents.

Manifestations of these relationships abound. For example, people exchange views via the comment or trackback function of blogs or through open letters in newspapers. A journalist covering a recent Supreme Court ruling is writing a document (her article) while linking to the ruling (although implicitly). Individuals accessing a statute in a public library are consuming the document.

Collaborative digital technologies offer the potential to explicitly keep track of all these interactions. In turn, these data could yield a great deal of insight on which bits of legal information are useful to whom. My goal was to explore how this could come about in a conceptual sense, hopefully offering some guidance by pointing out key success factors from a variety of examples. [Please note that this is in fact a theoretical exercise from my own personal point of view and does not reflect the position of the Canadian Legal Information Institute.]

Here are some ideas that could foster a vision of collaboration within the context of open access to law, using my own blogging experience as an example. I have always entertained an unhealthy fascination with the interaction of law and information. I revel in the paradoxes that arise from the tension between libel and freedom of expression, privacy and freedom of the press, government secrecy and the right of access to information, economic imperatives and fair or private copying …. This means that my personal consumption habits of legal information usually revolve around these topics, particularly discussing court cases and pointing out issues in the law. I currently use a simple blog for this function (CultureLibre.ca), and I know there are many of us out there, be they practicing lawyers, academics, laypeople or others.

Counting beans: developing a citation tool for legal documents

When I want to cite a paragraph of a court ruling, I simply copy-paste the snippet of text into my blog. This is a bit unfortunate as there is a lost opportunity. Imagine a system by which a user highlights a bit of text on an open access legal website, clicks on a button and is offered a bit of code to place in her blog (instead of the direct copy-pasting).

This bit of code, akin to what allows users repost YouTube videos, could create a box on the blog where the legal text is shown. If the statute were to be amended or the court decision reversed on appeal, the box could display a code (say, a red flag) alerting readers of the change. On the other side, the open access legal website would be able to keep track of every time this blog is read, providing a precise metric of use, useful for relevancy statistics.

Another powerful use of this mechanism would be to offer users of the open access legal website the links to the various blog postings when they hover over the quote in the original legal text. A simple voting system could offer a way for the community to regulate itself. Similarly, users could turn this feature on or off in their personal settings.

Let the data in: building better bibliographic tools

On my blog, I keep track of relevant newspaper or research articles as well as books or government reports. Compiling all this bibliographic data is a painstaking but rewarding process, as it saves me much time in the long run. If I could add this information via a web form into an open access legal information website, this process could be streamlined. After a while, one could find that some documents are already within the system, which would save time. As well, a diligent operator of an open access legal website could obtain bibliographic data in bulk either from vendors or from library catalogues.

In turn, users could be offered tools to link bibliographic data with legal documents, offering a richer browsing experience. Imagine having a list of recommended reading available directly from a specific statute or court ruling as described in the previous point, as certain commercial vendors are already doing. In fact, the real power would lie in the ability to link to a specific article in a law or a specific (short) passage in a court ruling. Creating anchors on the fly and linking back and forth would yield immense value.

People are suckers for popularity

The previous examples imply that one can register on such a website. This allows for a much richer browsing experience through personalized settings and interface customization. Most users are now accustomed to this and even expect it from contemporary websites, such as existing social networking websites. This provides the added benefit of determining the relative value of a single user’s contribution, based on how others vote.

Slashdot offers a similar mechanism, where some users are also reviewers and may agree or disagree with a comment. In fact, some legal social networking platforms already allow for this feature. What would be needed is a mechanism to link to snippets of legal text, perhaps as an add-on to legal social networking sites…

Its all about granular information

Computers are better at dealing with granular information, such as a date, or a piece of code. Textual information is better left for humans. That is why operators of open access legal websites should focus on identifying ways to trick users into giving them bits of data, not text, unless it is from a highly regarded member of the community or a trusted source (such as a judge). A simple agree/disagree vote is much better than a text box where one could write their objections. If people feel the urge to write, they should start their own blogs.

Insofar as the open access legal information website is concerned, there are only a few elements of value in this scenario — for example, when the blogger quotes a court case or statute. It would be hard to evaluate what is said without the intervention of a human operator. That is why the feedback mechanism between users (the agree/disagree button) is a useful way to allow for a very minimal amount of discussion while still being able to derive value from this process.

Parting words

Many of you are probably thinking of a flipside to what I propose. What happens if the information provided on blog-posts is wrong? What is the legal risk of blogging? What of legal advice from individuals who are not members of the Bar?

I have to admit that I am very concerned with these issues. I am tempted to say that an information-literate individual will apply due diligence to any information that is available — online or otherwise. As we all know, legal battles usually oppose two very crystallized visions of a rule or norm. One can find two sides to a story — or more. These debates add value to civil society, but one has to be aware of them. And that is not even getting into distinguishing strategies in common law!

The position I think most defensible in my proposed way forward is the following. In the practice of law, lawyers hold a monopoly in legal procedure as well as in giving legal advice within an individual case. In online communities, however, I would certainly hope that we, laypeople and jurists alike, may still discuss the law in general terms. There is a fine line to walk, but it is important for the legal community to recognize that the more people talk about the law, the more our society benefits from the Rule of Law, as Jeremy Bentham asserted.

As a side note, I would even wager that the need for legal services in a society is directly proportional to the number of legal topics discussed by laypeople in that society. But perhaps this is the subject of yet another post …

Olivier Charbonneau on www.culturelibre.ca In any case, I hope this paper will provide the opportunity to discuss some of the ideas that could bring digital collaboration to the open access to law movement. I invite you to also download my full paper for additional insight on this question. Also, I will be presenting some of these issues at the Legal IT conference in Montréal on April 26-27 2010. Perhaps this could be an opportunity to engage in a conversation about this issue?

Thanking you for your interest in my work,
Olivier Charbonneau
Associate Librarian, Concordia University (Montreal, Canada)
Doctoral candidate, Faculty of Law, Université de Montréal
www.culturelibre.ca
The author would like to thank his employer for supporting him in his research.

VoxPopuLII is edited by Judith Pratt. Editor in Chief is Rob Richards.

Environmentally-Friendly Citations

commercial systems, Legal citation, Legal citations, Legal descriptive metadata, Legal informatics, Legal knowledge representation, Legal metadata, legal research, Standards 9 Responses »

Mar 012010

Today in Canada, nearly three quarters of citations to recent case law use the neutral citation – an industry-independent, open identifier assigned by courts to their decisions. When we call something a “game-changer” most people assume that it was invented by Apple. Yet even though the neutral citation was not, it definitely is a game-changer in the legal publishing business. Here are some thoughts about why.

Cited and Citing Cases

Legal publishing would be much simpler if cases did not cite other cases or all sorts of other legal documents. However, in that event the law would be far less intelligible. The citations between legal documents help establish a coherent body of law. The interpretation of cases and statutes in their surrounding context of citing and cited legal documents is crucial in legal practice. It is often considered prudent to wait until the courts have their say on a freshly enacted statute before relying blindly on it. And no lawyer would bring up a “dynamite” case in court before carefully checking to see how this case has been treated by other case law.

Indeed, all legal publishers try in various ways to exploit the relations between legal documents in order to stand out in the eyes of their customers. Many features of the electronic publishing systems are based in some way on the relations between documents – hyperlinking, note-up, lists of cases and statutes considered, related cases, judicial history and treatment, search results ranking based on popularity, etc. The use of citation data has defined legal publishing for many years. Any major change in how things are cited in law will continue to have a great impact on legal publishing in the future.

Originally, citation data was neither intended nor designed to yield itself easily to computer processing, let alone free online publishing.

The Chinese Walls Around Print Report Series

The problem is well known. Print report citations were not designed to function outside the context of the report series they belong to. For example, “301 D.L.R. (4th) 513” does not mean anything to you if you don’t have the Dominion Law Reports nearby. Even if you were lucky enough to have the series in your firm’s library, you could not safely cite a case by, for example, 301 D.L.R. (4th) 513 and expect all your readers to understand what you are saying unless you assume that all your readers have the Dominion Law Reports in their libraries. This issue is amplified by the number of print law reports. In Canada, there are 70 major law report series according to the Queen’s University Law Library. (Although some from the list may have disappeared since the list was last updated, the number of report series remains large.)

To cope with this reality, the legal publishing world came up with citators, and in particular, their ability to offer and make use of parallel citations. Citators can tell you, among other things, what possible identifiers (citations) have been assigned to a particular case. For example, the following list – [2009] 1 S.C.R. 181; 301 D.L.R. (4th) 513; [2009] 2 W.W.R. 385; 183 C.R.R. (2d) 1; 320 Sask. R. 305 – means that the case Ravndahl v. Saskatchewan can possibly be identified by any the citations included in the list.

Commercial Electronic Databases Are Not Better

Nobody will dispute the fact that, for all practical purposes, electronic sources are the research tool. Printed reports will wither and gradually disappear. Those that remain because of their official status will be used, not for research, but only as the recognized source of citable law. Many if not most legal researchers will even affirm that this is already the de facto situation.

It is worthwhile to analyze what will happen in that new electronic context. Citations will be based on database identifiers. In Canada, such citations will take the following form: “[1998] O.J. No. 2515 (QL)”. In some ways, such a citation is leading to the same old problems discussed earlier in this post: those of proprietary citations. However, if the inconvenience of having to check in a specific book to know what was cited was annoying, citations consisting of commercial database identifiers create a much more serious problem. To get the cited material, whereas in the print environment the researchers had to take the time to visit the library, in the digital environment they must subscribe to the commercial database. In the era of the Internet, any type of proprietary citations could seriously threaten the legal information system.

The Free Law Publisher’s Initial Approach

One way of dealing with Chinese Walls is to live with them, and use one’s wits to figure out what is on the other side.

In 2004, CanLII was striving to be recognized as a legal information product that could successfully serve the everyday research needs of legal professionals. The idea emerged to develop a citator in order to improve hyperlinking — and a series of other cool features — using the relations between legal documents.

Building a citator is an expensive operation. Here is a brief outline of the manual and automated methodology mix that was employed to build Reflex, CanLII’s citator.

a law library

1. An editor keys in information about all cases published in a particular report series, for example, the Dominion Law Reports. Such information includes the case name, the docket number, the issuing court, the date of the decision, a very short excerpt from the case, and the report citation.

2. This operation results in many records like the following one:

Record 1
Case name: Ravndahl v. Saskatchewan
Docket: 32225
Date: 2009-01-29
Court: SCC
Excerpt: The appellant lost…
Citation: 301 D.L.R. (4th) 513

3. Another editor keys in the same information about all cases published in another report series, for example, the Western Weekly Reports, producing records like this one:

Record 2
Case name: Ravndahl v. Saskatchewan
Docket: 32225
Date: 2009-01-29
Court: SCC
Excerpt: The appellant lost…
Citation: [2009] 2 W.W.R. 385

301 D.L.R. (4th) 513

[2009] 2 W.W.R. 385

5. The operation is repeated for 35 report series on an ongoing basis.

Of course, in practice the exercise is much more complicated, as the software has to deal with various degrees of similarities of metadata; for example, almost identical case names (R. v. Smith and The Queen v. Smith). The program can also encounter other misfortunes, such as the absence of docket numbers or dates.

With this approach, CanLII was able to expand significantly the breadth of hyperlinking within legal documents, and all features based on the hypertext, such as sorting search results by number of citations, providing lists of related cases, and a few more. Because Reflex is able to resolve citations to cases that are unavailable on CanLII, the number of citations was used also as an indicator as to what are the most important cases missing from CanLII, or, in other words, where to start from if we want to scan cases from paper and publish them on CanLII.

As one of the founding members of the free-access-to-law movement, CanLII may have revolutionized the way Canadian law was made accessible, but the print was still a ubiquitous part of our publishing routine.

Another Way of Dealing With the Chinese Walls…

… is to simply destroy them. Before CanLII’s arrival on the legal information playground, LexUM, in collaboration with representatives from the judiciary, law librarians, court staff, IT consultants, and several forward-looking individuals from the commercial publishing circles, had set up the Canadian Citation Committee. The CCC is an ad hoc group formed to support the standardization efforts of the Judges Technology Advisory Committee (JTAC) of the Canadian Judicial Council (CJC).

The CCC designed and promotes several documentary standards, among them the neutral citation. The neutral citation was proposed as a unique, industry-independent identifier, assigned to a case by the court. It is formed in a simple way: by the year of the case, an acronym for the issuing court, and a serial number. For example, 2009 SCC 7 designates the case of the Supreme Court of Canada Ravndahl v. Saskatchewan, released in 2009.

Simple, open, Internet-friendly, environmentally-caring, promising: such is the neutral citation.

Who’s on Board?

Courts have gradually been adopting the neutral citation, beginning in 1999 and continuing to the present. The first adopter of the neutral citation was the Superior Courts of British Columbia. Today, all 50 Canadian courts follow the neutral citation standard. The last one to join — just this year, in fact — was the Ontario Superior Court of Justice – the toughest jurisdiction in the country (from a judicial administration and legal publishing point of view) because of the complexity of its judicial structure. As a result, all 50,000 cases issued annually by Canadian appellate, superior, and trial courts now bear neutral citations that have been assigned by the courts. To that number, we must add the decisions rendered by at least two dozen administrative tribunals which have also adopted the standard.

Probably a more important question than “Who’s on board?” is: Why are those institutions on board? Before embracing a change, people often need at least one ideological reason and at least one practical reason. On philosophical (and economic) grounds, it certainly made sense for court decisions to be freed from proprietary citation schemes. From a practical point of view, the most convincing argument was the convenience for the court to have a unique designation of its own decision at the very moment the reasons of the decision are issued.

Are Lawyers and Judges Following?

If you read carefully the first paragraph of this post, you know that the answer is yes. Lawyers and judges do cite cases using the neutral citation. They use neutral citations much more frequently than one may think.

Let’s bring in some data. On CanLII, case citations are hyperlinked if the citation comes from one of the 35 reports covered by CanLII’s citator or if the citation is a neutral citation. This allows for a citation resolution success rate of about 80%. This means that 80% of case citations on CanLII are hyperlinked. The rest, many of which are citations to proprietary commercial databases, are not.

In this context, it was tempting to verify the portion of the links attributable to the neutral citation. Or in other words, what is the percentage of case citations that contain the neutral citation – alone or among other parallel citations?

So we examined two sets of citations. The first one contained 40,000 citations of cases released in 2006, 2007 and 2008. The second one included 41,000 citations of cases released in 2007, 2008 and 2009.

The count showed the following. In data set 1 (citations pointing to cases released in 2006, 2007 and 2008), 85% of hyperlinked citations are, or contain, a neutral citation. In data set 2 (citations to cases released in 2007, 2008 and 2009), the neutral citation accounts for 91% of the links.

Data Set #1
40,000 citations
Citing cases released in 2008
Cited cases released in 2006, 2007, 2008
Links based on neutral citations: 85%
Share of all citations that are or contain neutral citations: 68%

Data Set #2
41,000 citations
Citing cases released in 2009
Cited cases released in 2007, 2008, 2009
Links based on neutral citations: 91%
Share of all citations that are or contain neutral citations: 73%

Needless to say, both the numbers and the progression look exciting. This, of course, is not the last reason we need before sending the print reports sailing into history.

It is just one more.

Ivan Mokanov is Deputy Director of LexUM. He oversees LexUM’s publishing and development activities and supervises various consulting and research projects in Canada and abroad. As a member of LexUM’s Executive Committee, he participates in LexUM’s administration and business development. Ivan is a graduate from Sofia University (B.C.L.) and the University of Montreal (LL.M.), and he is currently enrolled at HEC Montreal (M.B.A).

Semantic Enhancement of Legal Information… Are We Up for the Challenge?

Cross-language legal information retrieval, information retrieval, knowledge management, Legal knowledge representation, Legal ontologies, Legal semantic web, Multilingual legal information retrieval, Semantic Web and law 8 Responses »

Feb 152010

The organization and formalization of legal information for computer processing in order to support decision-making or enhance information search, retrieval and knowledge management is not recent, and neither is the need to represent legal knowledge in a machine-readable form. Nevertheless, since the first ideas of computerization of the law in the late 1940s, the appearance of the first legal information systems in the 1950s, and the first legal expert systems in the 1970s, claims, such as Hafner’s, that “searching a large database is an important and time-consuming part of legal work,” which drove the development of legal information systems during the 80s, have not yet been left behind.

Similar claims may be found nowadays as, on the one hand, the amount of available unstructured (or poorly structured) legal information and documents made available by governments, free access initiatives, blawgs, and portals on the Web will probably keep growing as the Web expands. And, on the other, the increasing quantity of legal data managed by legal publishing companies, law firms, and government agencies, together with the high quality requirements applicable to legal information/knowledge search, discovery, and management (e.g., access and privacy issues, copyright, etc.) have renewed the need to develop and implement better content management tools and methods.

Information overload, however important, is not the only concern for the future of legal knowledge management; other and growing demands are increasing the complexity of the requirements that legal information management systems and, in consequence, legal knowledge representation must face in the future. Multilingual search and retrieval of legal information to enable, for example, integrated search between the legislation of several European countries; enhanced laypersons’ understanding of and access to e-government and e-administration sites or online dispute resolution capabilities (e.g., BATNA determination); the regulatory basis and capabilities of electronic institutions or normative and multi-agent systems (MAS); and multimedia, privacy or digital rights management systems, are just some examples of these demands.

How may we enable legal information interoperability? How may we foster legal knowledge usability and reuse between information and knowledge systems? How may we go beyond the mere linking of legal documents or the use of keywords or Boolean operators for legal information search? How may we formalize legal concepts and procedures in a machine-understandable form?

In short, how may we handle the complexity of legal knowledge to enhance legal information search and retrieval or knowledge management, taking into account the structure and dynamic character of legal knowledge, its relation with common sense concepts, the distinct theoretical perspectives, the flavor and influence of legal practice in its evolution, and jurisdictional and linguistic differences?

These are challenging tasks, for which different solutions and lines of research have been proposed. Here, I would like to draw your attention to the development of semantic solutions and applications and the construction of formal structures for representing legal concepts in order to make human-machine communication and understanding possible.

Semantic metadata

Nowadays, in the search and retrieval area, we still perform most legal searches in online or application databases using keywords (that we believe to be contained in the document that we are searching for), maybe together with a combination of Boolean operators, or supported with a set of predefined categories (metadata regarding, for example, date, type of court, etc.), a list of pre-established topics, thesauri (e.g., EUROVOC), or a synonym-enhanced search.

These searches rely mainly on syntactic matching, and — with the exception of searches enhanced with categories, synonyms, or thesauri — they will return only documents that contain the exact term searched for. To perform more complex searches, to go beyond the term, we require the search engine to understand the semantic level of legal documents; a shared understanding of the domain of knowledge becomes necessary.

Although the quest for the representation of legal concepts is not new, these efforts have recently been driven by the success of the World Wide Web (WWW) and, especially, by the later development of the Semantic Web. Sir Tim Berners-Lee described it as an extension of the Web “in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”

Thus, the Semantic Web (including Linked Data efforts or the Web of Data) is envisaged as an extension of the current Web, which now also comprises collaborative tools and social networks (the Social Web or Web 2.0). The Semantic Web is sometimes also referred to as Web 3.0, although there is no widespread agreement on this matter, as different visions exist regarding the enhancement and evolution of the current Web.

Towards that shift, new languages and tools (ontologies) were needed to allow semantics to be added to the current Web, as the development of the Semantic Web is based on the formal representation of meaning in order to share with computers the flexibility, intuition, and capabilities of the conceptual structures of human natural languages. In the subfield of computer science and information science known as Knowledge Representation, the term “ontology” refers to a consensual and reusable vocabulary of identified concepts and their relationships regarding some phenomena of the world, which is made explicit in a machine-readable language. Ontologies may be regarded as advanced taxonomical structures, where concepts formalized as classes (e.g., “Actor”) are defined with axioms, enriched with the description of attributes or constraints (for example, “cardinality”), and linked to other classes through properties (e.g., “possesses” or “is_possessed_by”).

The task of developing interoperable technologies (ontology languages, guidelines, software, and tools) has been taken up by the World Wide Web Consortium (W3C). These technologies were arranged in the Semantic Web Stack according to increasing levels of complexity (like a layer cake), in the sense that higher layers depend on lower layers (and the latter are inherited from the original Web). The languages include XML (eXtensible Markup Language), a superset of HTML usually used to add structure to documents, and the so-called ontology languages: RDF (Resource Description Framework), OWL, and OWL2 (Ontology Web Language). Recently, a specification to support the conversion of existing thesauri, taxonomies or subject headings into RDF has been released (the the SKOS, Simple Knowledge Organization System standard).

Although there are different views in the literature regarding the scope of the definition or main characteristics of ontologies, the use of ontologies is seen as the key to implementing semantics for human-machine communication. Many ontologies have been built for different purposes and knowledge domains, for example:

OpenCyc,
SUMO,
PROTON,
DOLCE,
the FRBRoo model (used in the above code and graph examples),
the RDF representation of Dublin Core,
the Gene Ontology,
the Wine Ontology, and
the SemanticBible.

Although most domains are of interest for ontology modeling, the legal domain offers a perfect area for conceptual modeling and knowledge representation to be used in different types of intelligent applications and legal reasoning systems, not only due to its complexity as a knowledge intensive domain, but also because of the large amount of data that it generates. The use of semantically-enabled technologies for legal knowledge management could provide legal professionals and citizens with better access to legal information; enhance the storage, search, and retrieval of legal information; make possible advanced knowledge management systems; enable human-computer interaction; and even satisfy some hopes respecting automated reasoning and argumentation.

Regarding the incorporation of legal knowledge into the Web or into IT applications, or the more complex realization of the Legal Semantic Web, several directions have been taken, such as the development of XML standards for legal documentation and drafting (including Akoma Ntoso, LexML, CEN Metalex, and Norme in Rete), and the construction of legal ontologies.

Ontologizing legal knowledge

During the last decade, research on the use of legal ontologies as a technique to represent legal knowledge has increased and, as a consequence, a very interesting debate about their capacity to represent legal concepts and their relation to the different existing legal theories has arisen. It has even been suggested that ontologies could be the “missing link” between legal theory and Artificial Intelligence.

The literature suggests that legal ontologies may be distinguished by the levels of abstraction of the ideas they represent, the key distinction being between core and domain levels. Legal core ontologies model general concepts which are believed to be central for the understanding of law and may be used in all legal domains. In the past, ontologies of this type were mainly built upon insights provided by legal theory and largely influenced by normativism and legal positivism, especially by the works of Hart and Kelsen. Thus, initial legal ontology development efforts in Europe were influenced by hopes and trends in research on legal expert systems based on syllogistic approaches to legal interpretation.

More recent contributions at that level include the LRI-Core Ontology, the DOLCE+CLO (Core Legal Ontology), and the Ontology of Fundamental Legal Concepts Blue Scene (the basis for the LKIF-Core Ontology). Such ontologies usually include references to the concepts of Norm, Legal Act, and Legal Person, and may contain the formalization of deontic operators (e.g., Prohibition, Obligation, and Permission).

Domain ontologies, on the other hand, are directed towards the representation of conceptual knowledge regarding specific areas of the law or domains of practice, and are built with particular applications in mind, especially those that enable communication (shared vocabularies), or enhance indexing, search, and retrieval of legal information. Currently, most legal ontologies being developed are domain-specific ontologies, and some areas of legal knowledge have been heavily targeted, notably the representation of intellectual property rights respecting digital rights management (IPROnto Ontology, the Copyright Ontology, the Ontology of Licences, and the ALIS IP Ontology), and consumer-related legal issues (the Customer Complaint Ontology (or CContology), and the Consumer Protection Ontology). Many other well-documented ontologies have also been developed for purposes of the detection of financial fraud and other crimes; the representation of alternative dispute resolution methods, cases, judicial proceedings, and argumentation frameworks; and the multilingual retrieval of European law, among others. (See, for example, the proceedings of the JURIX and ICAIL conferences for further references.)

A socio-legal approach to legal ontology development

Thus, there are many approaches to the development of legal ontologies. Nevertheless, in the current legal ontology literature there are few explicit accounts or insights into the methods researchers use to elicit legal knowledge, and the accounts that are available reflect a lack of consensus as to the most appropriate methodology. For example, some accounts focus solely on the use of legal text mining and statistical analysis, in which ontologies are built by means of machine learning from legal texts; while others concentrate on the analysis of legal theories and related materials. Moreover, legal ontology researchers disagree about the role that legal experts should play in ontology validation.

Orange Scene In this regard, at the Institute of Law and Technology, we are developing a socio-legal approach to the construction of legal conceptual models. This approach stems from our collaboration with firms, government agencies, and nonprofit organizations (and their experts, clients, and other users) for the gathering of either explicit or tacit knowledge according to their needs. This empirically-based methodology may require the modeling of legal knowledge in practice (or professional legal knowledge, PLK), and the acquisition of knowledge through ethnographic and other social science research methods, together with the extraction (and merging) of concepts from a range of different sources (acts, regulations, case law, protocols, technical reports, etc.) and their validation by both legal experts and users.

For example, the Ontology of Professional Judicial Knowledge (OPJK) was developed in collaboration with the Spanish School of the Judicary to enhance search and retrieval capabilities of a Web-based frequentl- asked-question system (IURISERVICE) containing a repository of practical knowledge for Spanish judges in their first appointment. The knowledge was elicited from an ethnographic survey in Spanish First Instance Courts. On the other hand, the Neurona Ontologies, for a data protection compliance application, are based on the knowledge of legal experts and the requirements of enterprise asset management, together with the analysis of privacy and data protection regulations and technical risk management standards.

This approach tries to take into account many of the criticisms that developers of legal knowledge-based systems (LKBS) received during the 1980s and the beginning of the 1990s, including, primarily, the lack of legal knowledge or legal domain understanding of most LKBS development teams at the time. These criticisms were rooted in the widespread use of legal sources (statutes, case law, etc.) directly as the knowledge for the knowledge base, instead of including in the knowledge base the “expert” knowledge of lawyers or law-related professionals.

Further, in order to represent knowledge in practice (PLK), legal ontology engineering could benefit from the use of social science research methods for knowledge elicitation, institutional/organizational analysis (institutional ethnography), as well as close collaboration with legal practitioners, users, experts, and other stakeholders, in order to discover the relevant conceptual models that ought to be represented in the ontologies. Moreover, I understand the participation of these stakeholders in ontology evaluation and validation to be crucial to ensuring consensus about, and the usability of, a given legal ontology.

Challenges and drawbacks

Although the use of ontologies and the implementation of the Semantic Web vision may offer great advantages to information and knowledge management, there are great challenges and problems to be overcome.

First, the problems related to knowledge acquisition techniques and bottlenecks in software engineering are inherent in ontology engineering, and ontology development is quite a time-consuming and complex task. Second, as ontologies are directed mainly towards enabling some communication on the basis of shared conceptualizations, how are we to determine the sharedness of a concept? And how are context-dependencies or (cultural) diversities to be represented? Furthermore, how can we evaluate the content of ontologies?

Current research is focused on overcoming these problems through the establishment of gold standards in concept extraction and ontology learning from texts, and the idea of collaborative development of legal ontologies, although these techniques might be unsuitable for the development of certain types of ontologies. Also, evaluation (validation, verification, and assessment) and quality measurement of ontologies are currently an important topic of research, especially ontology assessment and comparison for reuse purposes.

Regarding ontology reuse, the general belief is that the more abstract (or core) an ontology is, the less it owes to any particular domain and, therefore, the more reusable it becomes across domains and applications. This generates a usability-reusability trade-off that is often difficult to resolve.

Finally, once created, how are these ontologies to evolve? How are ontologies to be maintained and new concepts added to them?

Over and above these issues, in the legal domain there are taking place more particularized discussions: for example, the discussion of the advantages and drawbacks of adopting an empirically based perspective (bottom-up), and the complexity of establishing clear connections with legal dogmatics or general legal theory approaches (top-down). To what extent are these two different perspectives on legal ontology development incompatible? How might they complement each other? What is their relationship with text-based approaches to legal ontology modeling?

I would suggest that empirically based, socio-legal methods of ontology construction constitute a bottom-up approach that enhances the usability of ontologies, while the general legal theory-based approach to ontology engineering fosters the reusability of ontologies across multiple domains.

The scholarly discussion of legal ontology development also embraces more fundamental issues, among them the capabilities of ontology languages for the representation of legal concepts, the possibilities of incorporating a legal flavor into OWL, and the implications of combining ontology languages with the formalization of rules.

Finally, the potential value to legal ontology of other approaches, areas of expertise, and domains of knowledge construction ought to be explored, for example: pragmatics and sociology of law methodologies, experiences in biomedical ontology engineering, formal ontology approaches, and the relationships between legal ontology and legal epistemology, legal knowledge and common sense or world knowledge, expert and layperson’s knowledge, and legal dogmatics and political science (e.g., in e-Government ontologies).

As you may see, the challenges faced by legal ontology engineering are great, and the limitations of legal ontologies are substantial. Nevertheless, the potential of legal ontologies is immense. I believe that law-related professionals and legal experts have a central role to play in the successful development of legal ontologies and legal semantic applications.

[Editor’s Note: For many of us, the technical aspects of ontologies and the Semantic Web are unfamiliar. Yet these technologies are increasingly being incorporated into the legal information systems that we use everyday, so it’s in our interest to learn more about them. For those of us who would like a user-friendly introduction to ontologies and the Semantic Web, here are some suggestions:

Tom Gruber, Where the Social Web Meets the Semantic Web (video);
Sandro Hawke, How the Semantic Web Works;
Kevin Hemenway, The Semantic Web: 1-2-3;
Jim Hendler et al., Introduction to the Semantic Web (video);
Ivan Herman, Introduction to the Semantic Web;
Brian Lowe, Introduction to Ontologies: Adding Meaning to Metadata;
Marek Obitko, Introduction to Ontologies and Semantic Web;
Sean B. Palmer, The Semantic Web: An Introduction;
Ioana Robu et al., An Introduction to the Semantic Web for Health Sciences Librarians;
Barry Smith, Ontology: An Introduction: Video: How to Build an Ontology;
University of Manchester, CO-ODE, Tutorial: A Practical Introduction to Ontologies and OWL;
Dr. Adam Z. Wyner, Legal Ontologies Spin a Semantic Web.]

Dr. Núria Casellas is a researcher at the Institute of Law and Technology and an assistant professor at the UAB Law School. She has participated in several national and European-funded research projects regarding the acquisition of knowledge in judicial settings (IURISERVICE), improving access to multimedia judicial content (E-Sentencias), on Drafting Legislation with Ontology-Based Support (DALOS), or in the Legal Case Study of the Semantically Enabled Knowledge Technologies (SEKT VI Framework project), among others. Her lines of investigation include: legal knowledge representation, legal ontologies, artificial intelligence and law, legal semantic web, law and technology, and bioethics.
She holds a Law Degree from the Universitat Autònoma de Barcelona, a Master’s Degree in Health Care Ethics and Law from the University of Manchester, and a PhD in Public Law and Legal Philosophy (UAB). Her PhD thesis is entitled “Modelling Legal Knowledge through Ontologies. OPJK: the Ontology of Professional Judicial Knowledge”.

VoxPopuLII is edited by Judith Pratt. Editor in Chief is Rob Richards.

Confessions of a Legal Info-holic

digital law, Digital law libraries, digital libraries, india, information retrieval, liis, open source software 2 Responses »

Feb 012010

In an extraordinary story, Jorge Luis Borges writes of a “Total Library”, organized into ‘hexagons’ that supposedly contained all books:

When it was proclaimed that the Library contained all books, the first impression was one of extravagant happiness. All men felt themselves to be the masters of an intact and secret treasure. . . . At that time a great deal was said about the Vindications: books of apology and prophecy which . . . [contained] prodigious arcana for [the] future. Thousands of the greedy abandoned their sweet native hexagons and rushed up the stairways, urged on by the vain intention of finding their Vindication. These pilgrims disputed in the narrow corridors . . . strangled each other on the divine stairways . . . . Others went mad. . . . The Vindications exist . . . but the searchers did not remember that the possibility of a man’s finding his Vindication, or some treacherous variation thereof, can be computed as zero. As was natural, this inordinate hope was followed by an excessive depression. The certitude that some shelf in some hexagon held precious books and that these precious books were inaccessible, seemed almost intolerable.

About three years ago I spent almost an entire sleepless month coding OpenJudis – my rather cool, “first-of-its-kind” free online database of Indian Supreme Court cases. The database hosts the full texts of about 25,000 cases decided since 1950. In this post I embark on a somewhat personal reflection on the process of creating OpenJudis – what I learnt about access to law (in India), and about “legal informatics,” along with some meditations on future pathways.

Having, by now, attended my share of FLOSS events, I know it is the invariable tendency of anyone who’s written two lines of free code to consider themselves qualified to pronounce on lofty themes – the nature of freedom and liberty, the commodity, scarcity, etc. With OpenJudis, likewise, I feel like I’ve acquired the necessary license to inflict my theory of the world on hapless readers – such as those at VoxPopuLII!

I begin this post by describing the circumstances under which I began coding OpenJudis. This is followed by some of my reflections on how “legal informatics” relates to and could relate to law.

Online Access to Law in India
India is privileged to have quite a robust ICT architecture. Internet access is relatively inexpensive, and the ubiquity of “cyber cafes” has resulted in extensive Internet penetration, even in the absence of individual subscriptions.

Government bodies at all levels are statutorily obliged to publish, on the Internet, vital information regarding their structure and functioning. The National Informatics Centre (NIC), a public sector corporation, is responsible for hosting, maintaining and updating the websites of government bodies across the country. These include, inter alia, the websites of the Union (federal) Government, the various state governments, union and state ministries, constitutional bodies such as the Election Commission and the Planning Commission, and regulatory bodies such as the Securities Exchange Board of India (SEBI). These websites typically host a wealth of useful information including, illustratively, the full texts of applicable legislations, subordinate legislations, administrative rulings, reports, census data, application forms etc.

The NIC has also been commissioned by the judiciary to develop websites for courts at various levels and publish decisions online. As a result, beginning in around the year 2000, the Supreme Court and various high courts have been publishing their decisions on their websites. The full texts of all Supreme Court decisions rendered since 1950 have been made available, which is an invaluable free resource for the public. Most High Court websites however, have not yet made archival material available online, so at present, access remains limited to decisions from the year 2000 onwards. More recently the NIC has begun setting up websites for subordinate courts, although this process is still at a very embryonic stage.

Apart from free government websites, a handful of commercial enterprises have been providing online access to legal materials. Among them, two deserve special mention. SCCOnline – a product of one of the leading law report publishers in India – provides access to the full texts of decisions of the Indian Supreme Court. The CD version of SCCOnline sells for about INR 70,000 (about US$1,500), which is around the same price the company charges for a full set of print volumes of its reporter. For an additional charge, the company offers updates to the database. The other major commercial venture in the field is Manupatra, which offers access to the full text of decisions of various courts and tribunals as well as the texts of legislation. Access is provided for a basic charge of about US$100, plus a charge of about US$1 per document downloaded. While seemingly modest by international standards, these charges are unaffordable by large sections of the legal profession and the lay public.

OpenJudis
In December 2006, I began coding OpenJudis. My reasons were purely selfish. While the full texts of the decisions of the Supreme Court were already available online for free, the search engine on the government website was unreliable and inadequate for (my) advanced research needs. The formatting of the text of cases themselves was untidy, and it was cumbersome to extract passages from them. Frequently, the website appeared overloaded with users, and alternate free sources were unavailable. I couldn’t afford any of the commercial databases. My own private dissatisfaction with the quality of service, coupled with (in retrospect) my completely naive optimism, led me to attempt OpenJudis. A third crucial factor on the input side was time, and a “room of my own,” which I could afford only because of a generous fellowship I had from the Open Society Institute.

I began rashly, by serially downloading the full texts of the 25,000 decisions on the Supreme Court website. Once that was done (it took about a week), I really had no notion of how to proceed. I remember being quite exhilarated by the sheer fact of being in possession of twenty five thousand Supreme Court decisions. I don’t think I can articulate the feeling very well. (I have some hope, however, that readers of this blog and my fellow LII-ers will intuitively understand this feeling.) Here I was, an average Joe poking around on the Internet, and just-like-that I now had an archive of 25,000 key documents of our republic, cumulatively representing the articulations of some of the finest (and some not-so-fine) legal minds of the previous half-century, sitting on my laptop. And I could do anything with them.

The word “archive,” incidentally, as Derrida informs us, derives from the Greek arkheion, the residence of the superior magistrates, the archons – those who commanded. The archons both “held and signified political power,” and were considered to possess the right to both “make and represent the law.” “Entrusted to such archons, these documents in effect speak the law: they recall the law and call on or impose the law”. Surely, or I am much mistaken, a very significant transformation has occurred when ordinary citizens become capable of housing archives – when citizens can assume the role of archons at will.

Giddy with power, I had an immediate impulse to find a way to transmit this feeling, to make it portable, to dissipate it – an impulse that will forever mystify economists wedded to “rational” incentive-based models of human behavior. I wasn’t a computer engineer, I didn’t have the foggiest idea how I’d go about it, but I was somehow going to host my own online free database of Indian Supreme Court cases. The audacity of this optimism bears out one of Yochai Benkler‘s insights about the changes wrought by the new “networked information economy” we inhabit. According to Benkler,

The belief that it is possible to make something valuable happen in the world, and the practice of actually acting on that belief, represent a qualitative improvement in the condition of individual freedom [because of NIE]. They mark the emergence of new practices of self-directed agency as a lived experience, going beyond mere formal permissibility and theoretical possibility.

Without my intending it, the archive itself suggested my next task. I had to clean up the text and extract metadata. This process occupied me for the longest time during the development of OpenJudis. I was very new to programming and had only just discovered the joys of Regular Expressions. More than my inexperience with programming techniques, however, it was the utter heterogeneity of reporting styles that took me a while to accustom myself to. Both opinion-writing and reporting styles had changed dramatically in the course of the fifty years my database covered, and this made it difficult to find patterns when extracting, say, the names of judges involved. Eventually, I had cleaned up the texts of the decisions and extracted an impressive (I thought) set of metadata, including the names of parties, the names of the judges, and the date the case was decided. To compensate for the absence of headnotes, I extracted names of statutes cited in the cases as a rough indicator of what their case might relate to. I did all this programming in PHP with the data housed in a MySQL database.

And then I encountered my first major roadblock that threatened to jeopardize the whole operation: I ran my first full-text Boolean search on the MySQL database and the results took a staggering 20 minutes to display. I was devastated! More elaborate searches took longer. Clearly, this was not a model I could host online. Or do anything useful with. Nobody in their right mind would want to wait 20 minutes for the results of their search. I had to look for a quicker database, or, as I eventually discovered, a super fast, lightweight indexing search engine. After a number of failed attempts with numerous free search engine software programs, none of which offered either the desired speed or the search capability I wanted, I was getting quite desperate. Fortunately, I discovered Swish-e, a lightweight, Perl-based Boolean search engine which was extremely fast and, most importantly, free – exactly what I needed. The final stage of creating the interface, uploading the database, and activating the search engine happened very quickly, and sometime in the early hours of December 22nd, 2006, OpenJudis went live. I sent announcement emails out to several e-groups and waited for the millions to show up at my doorstep.

They never did. After a week, I had maybe a hundred users. In a month, a few hundred. I received some very complimentary emails, which was nice, but it didn’t compensate for the failure of “millions” to show up. Over the next year, I added some improvements:
1) First, I built an automatic update feature that would periodically check the Supreme Court website for new cases and update the database on its own.
2) In October 2007, I coded a standalone MS Windows application of the database that could be installed on any system running Windows XP. This made sense in a country where PC penetration is higher than Internet penetration. The Windows application became quite popular and I received numerous requests for CDs from different corners of the country.
3) Around the same time, I also coded a similar application for decisions of the Central Information Commission – the apex statutory tribunal for adjudicating disputes under the Right to Information Act.
4) In February 2008, both applications were included in the DVD of Digit Magazine – a popular IT magazine in India.

Unfortunately, in August 2008, the Supreme Court website changed its design so that decisions could no longer be downloaded serially in the manner I had been accustomed to. One can only speculate about what prompted this change – since no improvements were made to the actual presentation of the cases. The only thing that changed was that one could no longer download cases serially as I’d been doing. The new format was far more difficult for me to “hack” and I abandoned the attempt. My work left me with no time to attempt to circumvent the new format.

Fortunately at the same time, an exciting new project called IndianKanoon was started by Sushant Sinha, an Indian computer science graduate at Michigan. In addition to decisions of the Supreme Court, his site covers several high courts and links up to the text of legislation of various kinds. Although I have not abandoned plans to develop OpenJudis, the presence of IndianKanoon has allowed me to step back entirely from this domain – secure in the knowledge that it is being taken forward by abler hands than mine.

Predictions, Observations, Conclusions
I’d like to end this already-too-long post with some reflections, randomly ordered, about legal information online.
1) I think one crucial area commonly neglected by most LIIs is client-side software that enables users to store local copies of entire databases. The urgency of this need is highlighted in the following hypothetical about digital libraries by Siva Vaidhyanathan (from The Anarchist in the Library):

So imagine this: An electronic journal is streamed into a library. A library never has it on its shelf, never owns a paper copy, can’t archive it for posterity. Its patrons can access the material and maybe print it, maybe not. But if the subscription runs out, if the library loses funding and has to cancel that subscription, or if the company itself goes out of business, all the material is gone. The library has no trace of what it bought: no record, no archive. It’s lost entirely.

It may be true that the Internet will be around for some time, but it might be worthwhile for LIIs to stop emulating the commercial database models of restricting control while enabling access. Only then can we begin to take seriously the task of empowering users into archons.

2) My second observation pertains to interface and usability. I have for long been planning to incorporate a set of features including tagging, highlighting, annotating, and bookmarking that I myself would most like to use. Additionally, I have been musing about using Web 2.0 to enable user-participation in maintenance and value-add operations – allowing users to proofread the text of judgments and to compose headnotes. At its most ambitious, in these “visions” of mine, OpenJudis looks like a combination of LII + social networking + Wikipedia.

A common objection to this model is that it would upset the authority of legal texts. In his brilliant essay A Brief History of the Internet from the 15th to the 18th century, the philosopher Lawrence Liang reminds us that the authority of knowledge that we today ascribe to printed text was contested for the longest period in modern history.

Far from ensuring fixity or authority, this early history of Printing was marked by uncertainty, and the constant refrain for a long time was that you could not rely on the book; a French scholar Adrien Baillet warned in 1685 that “the multitude of books which grows every day” would cast Europe into “a state as barbarous as that of the centuries that followed the fall of the Roman Empire.”

Europe’s non-descent into barbarism offers us a degree of comfort in dealing with Adrien Baillet-type arguments made in the context of legal information. The stability that we ascribe to law reports today is a relatively recent historical innovation that began in the mid-19th century. “Modern” law has longer roots than that.

3) While OpenJudis may look like quite a mammoth endeavor for one person, I was at all times intensely aware that this was by no means a solitary undertaking, and that I was “standing on the shoulders of giants.” They included the nameless thousands at the NIC who continue to design websites, scan and upload cases on the court websites – a Sisyphian task – and the thousands whose labor collectively produced the free software I used : Fedora Core 4, PHP, MySQL, Swish-E. And lastly, the nameless millions who toil to make the physical infrastructure of the Internet itself possible. Like the ground beneath our feet, we take it for granted, even as the tragic recent events in Haiti in recent weeks remind us to be more attentive. (For a truly Herculean endeavor, however, see Sushant Sinha’s IndianKanoon website, about which many ballads may be composed in the decades to come.)

It might be worthwhile for the custodians of LIIs to enable users to become derivative producers themselves, to engage in “practices of self-directed agency” as Benkler suggests. Without sounding immodest, I think the real story of OpenJudis is how the Internet makes it plausible and thinkable for average Joes like me (and better-than-average people like Sushant Sinha) to think of waging unilateral wars against publishing empires.

4) So, what is the impact that all this ubiquitous, instant, free electronic access to legal information is likely to have on the world of law? In a series of lectures titled “Archive Fever,” the philosopher Derrida posed a similar question in a somewhat different context: What would the discipline of psychoanalysis have looked like, he asked, if Sigmund Freud and his contemporaries had had access to computers, televisions, and email? In brief, his answer was that the discipline of psychoanalysis itself would not have been the same – it would have been transformed “from the bottom up” and its very events would have been altered. This is because, in Derrida’s view:

The archive . . . in general is not only the place for stocking and for conserving an archivable content of the past. . . . No, the technical structure of the archiving archive also determines the structure of the archivable content even in its coming into existence and in its relationship to the future. The archivization produces as much as it records the event.

The implication, following Derrida, is that in the past, law would not have been what it currently is if electronic archives had been possible. And the obverse is true as well: in the future, because of the Internet, “rule of law” will no longer observe the logic of the stable trajectories suggested by its classical “analog” commentators. New trajectories will have to be charted.

5) In the same book, Derrida describes a condition he calls “Archive fever”:

It is to burn with a passion. It is never to rest, interminably, from searching for the archive right where it slips away. It is to run after the archive even if there’s too much of it. It is to have a compulsive, repetitive and nostalgic desire for the archive, an irrepressible desire to return to the origin, a homesickness, a nostalgia for the return to the most archaic place of absolute commencement.

I don’t know about other readers of VoxPopulII (if indeed you’ve managed to continue reading this far!), but for the longest time during and after OpenJudis, I suffered distinctively from this malady. I downloaded indiscriminately whole sets of data that still sit unused on my computer, not having made it into OpenJudis. For those in a similar predicament, I offer Borges’s quote with which I began this text, as a reminder of the foolishness of the notion of “Total Libraries.”

Prashant Iyengar is a lawyer affiliated with the Alternative Law Forum, Bangalore, India. He is currently pursuing his graduate studies at Columbia University in New York. He runs OpenJudis, a free database of Indian Supreme Court cases.

VoxPopuLII is edited by Judith Pratt. Editor in Chief is Rob Richards.

Preserving Born-Digital Legal Materials…Where to Start?

Digital law libraries, Law librarians and legal informatics, open source software, Standards 9 Responses »

Jan 102010

It’s tempting to begin any discussion of digital preservation and law libraries with a mind-blowing statistic. Something to drive home the fact that the clearly-defined world of information we’ve known since the invention of movable type has evolved into an ephemeral world of bits and bytes, that it’s expanding at a rate that makes it nearly impossible to contain, and that now is the time to invest in digital preservation efforts.

But, at this point, that’s an argument that you and I have already heard. As we begin the second decade of the 21st century, we know with certainty that the digital world is ubiquitous because we ourselves are part of it. Ours is a world where items posted on blogs are cited in landmark court decisions, a former governor and vice-presidential candidate posts her resignation speech and policy positions to Facebook, and a busy 21st-century president is attached at the thumb to his Blackberry.

We have experienced an exhilarating renaissance in information, which, as many have asserted for more than a decade, is threatening to become a digital dark age due to technology obsolescence and other factors. There is no denying the urgent need for libraries to take on the task of preserving our digital heritage. Law libraries specifically have a critically important role to play in this undertaking. Access to legal and law-related information is a core underpinning of our democratic society. Every law librarian knows this to be true. (I believe it’s what drew us to the profession in the first place.)

Frankly speaking, our current digital preservation strategies and systems are imperfect – and they most likely will never be perfected. That’s because digital preservation is a field that will be in a constant state of change and flux for as long as technology continues to progress. Yet, tremendous strides have been made over the past decade to stave off the dreaded digital dark age, and libraries today have a number of viable tools, services, and best practices at our disposal for the preservation of digital content.

Law libraries and the preservation of born-digital content

In 2008, Dana Neacsu, a law librarian at Columbia University Law School, and I decided to explore the extent to which law libraries were actively involved in the preservation of born-digital legal materials. So, we conducted a survey of digital preservation activity and attitudes among state and academic law libraries.

We found an interesting incongruity among our respondent population of library directors who represented 21 law libraries: less than 7 percent of the digital preservation projects being planned or underway at our respondents’ libraries involved the preservation of born-digital materials. The remaining 93 percent involved the preservation of digital files created through the digitization of print or tangible originals. Yet, by a margin of 2 to 1, our respondents expressed that they believed born-digital materials to be in more urgent need of preservation than print materials.

This finding raises an interesting question: If law librarians (at least those represented among our respondents) believe born-digital materials to be in more urgent need of preservation, why were the majority of digital preservation resources being invested in the preservation of files resulting from digitization projects?

I speculate that part of the problem is that we often don’t know where to start when it comes to preserving born-digital content. What needs to be preserved? What systems and formats should we use? How will we pay for it?

What needs to be preserved? A few thoughts…

Determining what needs to be preserved is not as complicated as it may seem. The mechanisms for content selection and collection development that are already in place at most law libraries lend themselves nicely to prioritizing materials for digital preservation, as I have learned through the Georgetown Law Library’s involvement in The Chesapeake Project Legal Information Archive. A collaborative effort between Georgetown and partners at the State Law Libraries of Maryland and Virginia, The Chesapeake Project was established to preserve born-digital legal information published online and available via open-access URLs (as opposed to within subscription databases).

So, how did we approach selection for the digital archive? Within a broad, shared project collection scope (limited to materials that were law- or policy-related, digitally born, and published to the “free Web” per our Collection Plan) each library simply established its own digital archive selection priorities, based on its unique institutional mandates and the research needs of its users. Libraries have historically developed their various print collections in a similar manner.

The Maryland State Library focused on collecting documents relating to public-policy and legal issues affecting Maryland citizens. The Virginia State Library collected the online publications of the Supreme Court of Virginia and other entities within Virginia’s judicial branch of government. As an academic library, the Georgetown Law Library developed topical and thematic collection priorities based on research and educational areas of interest at the Georgetown University Law Center. (Previously, online materials selected for the Georgetown Law Library’s collection had been printed from the Web on acid-free paper, bound, cataloged, and shelved. Digital preservation offered an attractive alternative to this system.)

To build our topical digital archive collections, the Georgetown Law Library assembled a team of staff subject specialists to select content (akin to our collection development selection committee), and, to make things as simple as possible, submissions were made and managed using a Delicious bookmark account, which allowed our busy subject specialists to submit online content for preservation with only a few clicks.

As a research library, we preserved information published to the free Web under a claim of fair use. Permission from copyright holders was sought only for items published either outside of the U.S. or by for-profit entities. Taking our cues from the Internet Archive, we determined to respect the robots.txt protocol in our Web harvesting activities and provide rights holders with instructions for requesting the removal of their content from the archive.

Fear of duplicating efforts

We have, on occasion, knowingly added digital materials to our archive collection that were already within the purview of other digital preservation programs. There is a fear of duplicating efforts when it comes to digital preservation, but there is also a strong argument to be made for multiple, geographically dispersed entities maintaining duplicate preserved copies of important digital resources.

This philosophy, especially as relates to duplicating the digital-preservation efforts of the Government Printing Office, is currently being echoed among several Federal Depository Libraries (and prominently by librarians who contribute to the Free Government Information blog) who are supporting the concept of digital deposit to maintain a truly distributed Federal Depository Library Program. Should there ever be a catastrophic failure at GPO, or even a temporary loss of access (such as that caused by the PURL server crash last August), user access to government documents would remain uninterrupted, thanks to this distributed preservation network. Currently there are 156 academic law libraries listed as selective depositories on the Federal Depository Library Directory; each of these would be candidates for digital deposit should the program come to fruition.

Libraries with perpetual access or post-cancellation access agreements with publishers may also find it worthwhile to invest in digital preservation activities that may be redundant. Some publishers offer easy post-cancellation access to purchased digital content via nonprofit initiatives such as Portico and LOCKSS, both of which function as digital preservation systems. Other publishers, however, may simply provide subscribers with a set of CDs or DVDs containing their purchased subscription content. In these cases, it is worthwhile to actively preserve these files within a locally managed digital archive to ensure long-term accessibility for library patrons, rather than relegating these valuable digital files, stored on an unstable optical medium, to languishing on a shelf.

Law reviews and legal scholarship

It has been suggested that academic law libraries take responsibility for the preservation of digital content cited within their institutions’ law reviews to ensure that future researchers will able to reference source materials even if they are no longer available at the cited URLs. While there aren’t specific figures relating to the problem of citation link rot in law reviews, research on Web citations appearing in scientific journals has shown that roughly 10 percent of these citations become inactive within 15 months of the citing article’s publication. When it comes to Web-published law and policy information, our own Chesapeake Project evaluation efforts have found that about 14 percent, or 1 out of every 7, Web-based items had disappeared from their original URLs within two years of being archived.

In the near future, we may find ourselves in the position of taking responsibility for the digital preservation of our law reviews themselves, given the call to action in the Durham Statement on Open Access to Legal Scholarship. After all, if law schools end print publication of journals and commit “to keep the electronic versions available in stable, open, digital formats” within open-access online repositories, there is an implicit mandate to ensure that those repositories offer digital preservation functionality, or that a separate dark digital preservation system be used in conjunction with the repository, to ensure long-term access to the digital journal content. (It is important to note that digital repository software and services do not necessarily feature standard digital preservation functionality.)

Speaking of digital repositories, the responsibility for establishing and maintaining institutional repositories most certainly falls to the law library, as does the responsibility for preserving the digital intellectual output of their law schools’ faculty, institutes, centers, and students (many of whom go on to impressive heights).

At the Georgetown Law Library, we’ve also taken on the task of preserving the intellectual output published to the Law Center’s Web sites.

The Preserv project has compiled an impressive bibliography on digital preservation aimed specifically at preservation services for institutional repositories (but also covering many of the larger issues in digital preservation), which is worth reviewing.

What systems and formats should we use?

Did I mention that our current digital preservation strategies and systems are imperfect? Well, it’s true. That’s the bad news. No matter which system or service you chose, you will surely encounter occasional glitches, endure system updates and migrations, and be forced to revise your processes and workflows from time to time. This is a fledgling, evolving field, and it’s up to us to grow and evolve along with it.

But, take heart! The good news is that there are standards and best practices established to guide us in developing strategies and selecting digital preservation systems, and we have multiple options to choose from. The key to embarking on a digital preservation project is to be versed in the language and standards of digital preservation, and to know what your options are.

The language and standards of digital preservation

I have heard a very convincing argument against standards in digital preservation: Because digital preservation is a new, evolving field, complying with rigid standards can be detrimental to systems that require a certain amount of adaptability in the face of emerging technological challenges. While I agree with this argument, I also believe that it is tremendously useful for those of us who are librarians, as opposed to programmers or IT specialists, to have standards as a starting point from which to identify and evaluate our options in digital preservation software and services.

There are a number of standards to be aware of in digital preservation. Chief among these is the Open Archival Information System (OAIS) Reference Model, which provides the central framework for most work in digital preservation. A basic question to ask when evaluating a digital preservation system or service is, “Does this system conform to the OAIS model?” If not, consider that a red flag.

The Trustworthy Repositories Audit & Certification Criteria and Checklist, or TRAC, is a digital repository evaluation tool currently being incorporated into an international standard for auditing and certifying digital archives. A small number of large repositories have undergone (or are undergoing) TRAC audits, including E-Depot at the Koninklijke Bibliotheek (National Library of the Netherlands), LOCKSS, Portico, and HathiTrust. This number can be expected to increase in the coming years.

The TRAC checklist is also a helpful resource to consult in conducting your own independent evaluations. Last year, for example, the libraries participating in The Chesapeake Project commissioned the Center for Research Libraries to conduct an assessment (as opposed to a formal audit) of our OCLC digital archive system based on TRAC criteria, which provided useful information to strengthen the project.

The PREMIS Data Dictionary provides a core set of preservation metadata elements to support the long-term preservation and future renderability of digital objects stored within a preservation system. The PREMIS working group has created resources and tools to support PREMIS implementation, available via the Library of Congress’s Web site. It is useful to consult the data dictionary when establishing local policy, and to ask about PREMIS compatibility when evaluating digital preservation options.

While we’re on the exciting topic of metadata, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH, not to be confused with OAIS), is another protocol to watch for, especially if discovery and access are key components of your preservation initiative. OAI-PMH is a framework for sharing metadata between various “silos” of content. Essentially, the metadata of an OAI-PMH compliant system could be shared with and made discoverable via a single, federated search interface, allowing users to search the contents of multiple, distributed digital archives at the same time.

For an easy-to-read overview of digital preservation practices and standards, I recommend Priscilla Caplan’s The Preservation of Digital Materials, which appeared in the Feb./March 2008 issue of Library Technology Reports. There are also a few good online glossaries available to help decipher digital preservation jargon: the California Digital Library Glossary, the Internet Archives’ Glossary of Web Archiving Terms, and the Digital Preservation Coalition’s Definitions and Concepts.

Open source formats and software

Open source and open standard formats and software play a vital role in the lifecycle management of digital content. In the context of digital preservation, open-source formats, which make their source code and specifications freely available, facilitate the future development of tools that can assist in the migration of files to new formats as technology progresses and older formats become obsolete. PDF, for example, although developed originally as a proprietary format by Adobe Systems, became a published open standard in 2008, meaning that developers will have a foundation for making these files accessible in the future.

Other open source formats commonly used in digital preservation include the TIFF format for digital images, the ARC or WARC file for Web archiving, and the Extensible Markup Language (XML) text format for encoding data or document structure information. Microsoft formats, such as Word Documents, do not comply with open standards; the proprietary nature of these formats will inhibit future access to these documents when these formats become obsolete. The Library of Congress has a useful Web site devoted to digital formats and sustainability (including moving image and sound formats), which is worth reviewing.

Open source software is also looked upon favorably in digital preservation because, similar to open source formats, the software development and design process is made transparent, allowing current and future developers to develop new interfaces to or updates to the software over time.

Open source does not necessarily mean free-of-charge, and in fact, many service providers utilize open source software and open standards in developing fee-based or subscription digital preservation solutions.

Digital preservation solutions

There are many factors to consider in selecting a digital preservation solution. What is the nature of the content being preserved, and can the system accommodate it? Is preservation the sole purpose of the system — so that the system need include only a dark archive — or is a user access interface also necessary? How much does the system cost, and what are the expected ongoing maintenance costs, both in terms of budget and staff time? Is the system scalable, and can it accommodate a growing amount of content over time? This list could go on…

Keep in mind that no system will perfectly accommodate your needs. (Have I mentioned that digital preservation systems will always be imperfect?) And there is no use in waiting for the “perfect system” to be developed. We must use what’s available today. In selecting a system, consider its adherence to digital preservation standards, the stability of the institution or organization providing the solution, and the extent to which the digital preservation system has been accepted and adopted by institutions and user communities.

In a perfect world, perhaps every law library would implement a free, build-it-yourself, OAIS-compliant, open-source digital preservation solution with a large and supportive user community, such as DSpace or Fedora. These systems put full control in the hands of the libraries, which are the true custodians of the preserved digital content. But, in practice, our law libraries often do not have the staff and technological expertise to build and maintain an in-house digital preservation system.

As a result, several reputable library vendors and nonprofit organizations have developed fee-based digital preservation solutions, often built using open-source software. The Internet Archive offers the Archive-It service for the preservation of Web sites. The Stanford University-based LOCKSS program provides a decentralized preservation infrastructure for Web-based and other types of digital content, and the MetaArchive Cooperative provides a preservation repository service using the open-source LOCKSS software. The Ex Libris Digital Preservation System and the collaborative HathiTrust repository both support the preservation of digital objects.

For The Chesapeake Project, the Georgetown, Maryland State, and Virginia State Law Libraries use OCLC systems: the Digital Archive for preservation, coupled with a hosted instance of CONTENTdm as an access interface.

In our experience, working with a vendor that hosted our content at a secure offsite location and managed system updates and migrations allowed us to focus our energies on the administrative and organizational aspects of the project, rather than the ongoing management of the system itself. We were able to develop shared project documentation, including preferred file format and metadata policies, and conduct regular project evaluations. Moreover, because our project was collaborative, it worked to our advantage to enlist a third party to store all three libraries’ content, rather than place the burden of hosting the project’s content upon one single institution. In short, working with a vendor can actually benefit your project.

The ultimate question: How will we pay for it?

We still seem to be in the midst of a global economic recession that has impacted university and library budgets. Yet, despite budget stagnation, there has been a steady increase in the production of digital content.

Digital preservation can be expensive, and law library staff members with digital preservation expertise are few. The logical solution to these issues of budget and staff limitations is to seek out opportunities for collaboration, which would allow for the sharing of costs, resources, and expertise among participating institutions.

Collaborative opportunities exist with the Library of Congress, which has created a network of more than 130 preservation partners throughout the U.S., and the law library community is also in the process of establishing its own collaborative digital archive, the Legal Information Archive, to be offered through the Legal Information Preservation Alliance, or LIPA.

During the 2009 AALL annual meeting, LIPA’s executive director announced that The Chesapeake Project had become a LIPA-sanctioned project under the umbrella of the new Legal Information Archive. As a collaborative project with expenses shared by three law libraries, The Chesapeake Project’s costs are currently quite low compared to other annual library expenditures, such as those for subscription databases. These annual costs will decrease as more law libraries join this initiative.

I firmly believe that law libraries must invest in digital preservation if we are to remain relevant and true to our purpose in the 21st century. The core reason libraries exist is to build collections, to make those collections accessible, to assist patrons in using our collections, and to preserve our collections forever. No other institution has been created to take on this responsibility. Digital preservation represents an opportunity in the digital age for law libraries to reclaim their traditional roles as stewards of information, and to ensure that our digital legal heritage will be available to legal scholars and the public well into the future.

Sarah Rhodes is the digital collections librarian at the Georgetown Law Library in Washington, D.C., and a project coordinator for The Chesapeake Project Legal Information Archive, a digital preservation initiative of the Georgetown Law Library in collaboration with the State Law Libraries of Maryland and Virginia.

VoxPopuLII is edited by Judith Pratt. Editor in Chief is Rob Richards.

Older Entries Newer Entries

Suffusion theme by Sayontan Sinha

Weaving the Legal Semantic Web with Natural Language Processing

jurMeta - New Metadata Initiative for Legal Documents

Argument Mapping and Storytelling in Criminal Cases

Collaboration and Open Access to Law

Environmentally-Friendly Citations

Semantic Enhancement of Legal Information… Are We Up for the Challenge?

Confessions of a Legal Info-holic

Preserving Born-Digital Legal Materials…Where to Start?

Law libraries and the preservation of born-digital content

What needs to be preserved? A few thoughts…

Fear of duplicating efforts

Law reviews and legal scholarship

What systems and formats should we use?

The language and standards of digital preservation

Open source formats and software

Digital preservation solutions

The ultimate question: How will we pay for it?

Recent Posts

VoxPop people and posts

Subscribe to VoxPopuLII

Blogroll

Law libraries and the preservation of born-digital content

What needs to be preserved? A few thoughts…

Fear of duplicating efforts

Law reviews and legal scholarship

What systems and formats should we use?

The language and standards of digital preservation

Open source formats and software

Digital preservation solutions

The ultimate question: How will we pay for it?

Recent Posts

VoxPop people and posts

Subscribe to VoxPopuLII

Blogroll

Tags