Cross-language legal information retrieval

About LII / Get the law / Find a lawyer / Legal Encyclopedia / Help Out

Standardizing the World’s Legislative Information—One hackathon at a time

Annotation of legal texts, Cross-language legal information retrieval, digital law, Electronic government, elegislation, Legal metadata, Legal XML, Open Government Data, Semantic annotation of legal texts, Standards 8 Responses »

Sep 172012

As guest bloggers to this site, we have been asked to write about big ideas. We’ll get to those. But first, a note about hackathons.

Could legal hackathons be like this one day?

Hackathons used to be the exclusive domain of soda-and-coffee-guzzling, pizza-eating, all-night hacking, highly competitive computer programmers. The result of such a hackathon is often supposed to be a cool app (like the forerunner of Twitter) that is even cooler because it was built in the compressed schedule of the event. More recently, hackathons have been popping up in a variety of places, with some unexpected contexts and sponsors including the U.S. House of Representatives, NASA, Brooklyn Law School, New York City government, and others. These events serve as a way to prove (or build) the sponsor’s tech credentials and to cross-fertilize policy and technology expertise. There has been some handwringing and thoughtful commentary about the expansion of “civic” hackathons and what sustainable outcomes they produce.

As co-organizers, with Karen Suhaka, Greg Wilson, Charles Belle and others, of two legislative focused, hackathon-inspired events–the California Law Hackathon, and the International Legislation Un-hackathon–we can attest to their value in bringing engineers and lawyer and policy folks together. We can give some insights into the kinds of benefits these events have had in propelling efforts on legislative data standards, and some of the advances that have taken place in the development of these standards over the last year.

Big Idea: Legislative Data Standards

And now to the big idea: to represent all the world’s legislation in a standard structured data format. That’s actually two big ideas: (1) putting legislation into a structured data format, and (2) designing that format so that it is compatible with the wide variety of laws and legislative document types worldwide.

There are reasons for doing these things: First, introducing structured data to legislation can make it possible to search and analyze the law with greater precision and efficiency. And second, having a common standard can permit more comprehensive bill-tracking and comparison between jurisdictions.

California Bill with Metadata

It also can make it possible for legislatures with small (and shrinking) budgets to benefit from some of the same bill drafting software that is being developed for much larger jurisdictions. (Full disclosure: Xcential has developed such software for more than ten years, including the drafting platform used by the State of California.)

In the age of Google, these ideas may not seem so big; in fact, they are a subset of Google’s far-reaching mission. However, legislation is a corner of the world’s information that Google has not yet addressed in a systematic way. And as regular readers of this blog know, legislation presents its own hurdles, technical and bureaucratic (not necessarily in that order), that make this both an interesting and a challenging problem. One of the challenges is that the kind of people who generally work with data (we’ll call them engineers) and the kind of people who generally work with legislation (we’ll call them lawyers and policy folks) don’t often work on data and legislation together. One of us, a lawyer and policy type, has made this point graphically (and somewhat hyperbolically) in a Quora response to a question about whether version control software could be used for legislation. That question, and a subsequent discussion generated in response to a blogpost by software engineer Abe Voelker about version control for legislation, drew in many engineers and some lawyers and policy folks.

For software engineers who consider such things, it is very attractive to think about treating legal text as if it were software code; we could automatically highlight and cross-validate key terms, run test cases, automate redlining and version control, etc. It would be easy to see what the state of the law was at any particular point in time, and to trace the series of amendments that got us into the mess we’re in today. This desire is often expressed as “What if we had a Github for legislation?” On the other hand, people who work closely with legislation–researching it, drafting it or developing information systems to deal with it–tend to see the many places that the analogy between computer code and legal code break down. Legal texts have been shaped over hundreds of years by technologically conservative institutions, using print-based systems.

The full transformation of law to digital information is not going to happen overnight. While most law is already accessible in electronic format (often pdf), it is not encoded in a way that software engineers could start using their favorite text-munching tools. One of us, an engineer, has described this as the difference between computerization and automation. The move toward better digital tools for automating legislative drafting and research tasks will require more dialogue and working exchanges between engineers and the lawyers and policy folks.

That brings us back to hackathons.

What is a Legislative Hackathon?

Recognizing the need to bring lawyers and engineers together in order to implement our big idea(s), and appreciating the valuable bandwagon that hackathons have become, we decided to jump onboard. The first event we organized, the California Law Hackathon, was hosted just over a year ago, in September 2011, in Berkeley at the offices of Maplight, and in Denver by Karen Suhaka’s team at BillTrack50. The event focused on building web-based visualization tools to track the timeline of amendments to California legislation, and to link particular amendments, through their legislative sponsors, to particular donors or interest groups. We were joined remotely by a number of international participants, including John Sheridan, head of e-services for legislation.gov.uk, and a fellow guest contributor to this blog. As one participant noted, we learned a great deal at the event, including the limits placed on us by the existing data. Neither the legislative record, nor the donations databases are detailed enough to trace influence in politics in the way we hoped. This helped spark an interest in a more in-depth exploration of legislative data formats, and in particular how more and better metadata could be added to legislation.

That led to the International Legislation Un-hackathon, held simultaneously at UC Hastings, Stanford and Denver, with participants from the University of Bologna (Ravenna campus) and around the world. So assuming you can get engineers together with lawyers and policy folks, what do you do with them? We decided that we’d need a user-friendly tool that could be used to explore and add metadata to legislation from around the world. This could highlight a developing legislative XML standard, Akoma Ntoso (more about this standard soon), and give hands-on experience to lawyer and policy types kinds of text and analysis tools that engineers take for granted.

Hacking With A Legislative Editor

So one of us (the engineer, naturally) started building a web-based editor for legislation, while the other (the lawyer, naturally) started organizing the next hackathon. Of course, thought the lawyer, it would just

Legislative Editor at legalhacks.org

be a matter of time before all governments worldwide use such editors to draft their laws and regulations in a standard data format.

Advances in Legislative Data Standards Efforts

Akoma Ntoso

Akoma Ntoso (AkN) is a strong contender to be that format. Developed under the auspices of the UN Department of Economic and Social Affairs, AkN is an XML data structure that is meant to capture high-level forms and semantic ideas that are common to a broad variety of legal texts. OASIS, the folks who brought us the DocBook standards, among others, have convened a standards committee to create an official legislative data standard based on AkN. (More disclosure: the engineer is a member of this committee.) There’s just one problem. Few governments are using AkN to draft or store their legislation.

AkN itself is fast evolving, and with more exposure to legal data structures from different jurisdictions, the OASIS committee will be able to adapt AkN to better model those structures.

We saw the International Legislative Un-Hackathon as a venue to kick off this process. It was conceived with Charles Belle of UC Hastings, as part of the Legal Hacks initiative. The event was held simultaneously at UC Hastings, Stanford, in Denver. Jim Harper and Francis Avila of the Cato Institute came to the Hastings Event. We also had many international participants. Key among them were Professors Monica Palmirani and Fabio Vitali of the University of Bologna, the architects and primary evangelists of AkN. Over the course of the day, participants learned about AkN and, importantly, got a chance to try it out, marking up documents of their choosing with the web editor. In the process, as expected, we found bugs in the software and bugs in the standard. We found structures in U.S. legislation that didn’t fit well with the existing AkN element set. We saw places where there was confusion in applying AkN’s data structures to documents. All of this information was collected to incorporate in the development of both the editor and AkN, underscoring again the importance of getting more practical exposure for both.

University of Bologna Summer School–Ravenna

And we are working to expand the venues for this kind of practical exposure to develop the AkN standard. Every September, the University of Bologna hosts the LEX Summer School in Ravenna, Italy. For them, it’s an opportunity to introduce Akoma Ntoso to new groups of students from around Europe and around the world. For the students, it’s an opportunity to learn about the application of XML to legislation, see the success various groups are having around the world, and to meet interesting new people having a passion for legal informatics. One of us, the engineer, who was a student two years ago, was invited to return last year to present a success story, and this year is returning once more to deliver a class in how to build and use the HTML5-based editor for drafting legislation in XML. For us, this is an opportunity to expose the editor to the European legal traditions in order for us to better understand how our editor must evolve to fulfill our vision of a unified standard around the world with common, highly adaptable, tools.

Chile National Library of Congress Browser-based editor

Another step toward adoption of legislative data standards is a project by Chile’s National Library of Congress (BCN in Spanish) called the “History of the Law” (Historia de la Ley). This ambitious project aims to bring together machine learning, a legislative editor and other features to mark up Chile’s legislative record and other legislative documents. The BCN has chosen Xcential’s browser-based editor, working with the AkN standard, to conduct the mark-up and correction after documents are passed through an automated parser. As with the hackathon, but on a larger scale, we are learning from experience the modifications that are needed to AkN, to make it work with Chile’s live documents. Excitingly, each mismatch we find between AkN and actual legislation can be fed back into the OASIS committee process, to make AkN able to handle a wider variety of real-world use cases.

Other Efforts and the Future of Legislative Data Standards

We see these steps as just the beginning. European governments are also flirting with legislative standards, and Karen Suhaka’s group at BillTrack50 has converted all U.S. bills from all states into a single standard XML format showing that the technical hurdles can be overcome, and many of the practical benefits of doing so. In focusing on the projects (and hackathons) we are most closely involved with, we have certainly left out a lot of the initiatives that are advancing legislative data standards around the world. That’s what the comments are for. Let us know your experience with Akoma Ntosa as a legislative standard, and what you’re doing or interested in doing with AkN or other legislative data standards worldwide.

Grant Vergottini (the engineer) is a founder of Xcential. He is a leading authority on applications of XML data to legislation. Prior to founding Xcential, Grant was the Director of Applications at Chrystal Software, a company dedicated to XML design and reporting software. Before Chrystal, Grant led the redesign of Homestore.com, and founded Genedax Design Automation, which developed innovative team and data management applications for electronics design. Bringing data structures and automation tools to the legislative drafting process parallels the work that Grant did earlier in his career at Mentor Graphics and the Boeing Company, where he participated in the transformation from manual drafting to CAD software. Mr. Vergottini holds a Bachelor of Science in Electrical Engineering from Cleveland State University, where he graduated Summa Cum Laude.

Ari Hershowitz (the lawyer) is a consultant at Xcential, and founder of Tabulaw. Tabulaw develops software for lawyers, including a web-based legal research and writing platform. Prior to Tabulaw, Ari worked to protect wildlife and habitats from Chile to Mexico as Director of the BioGems project for Latin America at the Natural Resources Defense Council. Ari has a law degree from Georgetown University Law Center, a Masters in Computation and Neural Systems from Caltech, and a Bachelors in Molecular Biophysics & Biochemistry from Yale College.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

[Editor’s Note: For topic-related VoxPopuLII posts please see: Núria Casellas, Semantic Enhancement of legal information … Are we up for the challenge?; João Lima, et.al, LexML Brazil Project; and Rinke Hoekstra, The MetaLex Document Server

Making a legal dictionary for an indigenous language: the Legal Maori Dictionary

Cross-language legal information retrieval, legal dictionaries 1 Response »

Jul 102012

In 2010, an interesting observation was made about the linguistic identity of the New Zealand state. The observer was the Waitangi Tribunal of New Zealand, a permanently appointed commission of inquiry tasked with investigating claims of Crown breaches of the Treaty of Waitangi that may have caused prejudice to Māori. Of course the Treaty itself was signed by two distinct parties in 1840: the British Crown, and the representatives of Māori tribal groupings. In 1840 the linguistic, ethnic, and cultural identity of each grouping was simply not in doubt. But over the years the British Crown has devolved or morphed into the Crown in right of New Zealand, British settlers became Pākehā New Zealanders, and the Māori themselves have also changed irrevocably. So the Tribunal’s observationwas interesting:

Fundamentally, there is a need for a mindset shift away from the pervasive assumption that the Crown is Pākehā, English-speaking, and distinct from Māori rather than representative of them. Increasingly, in the twenty-first century, the Crown is also Māori. If the nation is to move forward, this reality must be grasped.

In short, the Crown, in right of New Zealand, is not only Māori, but must also be Māori speaking. In view of New Zealand’s bicultural (and bilingual) legal history, this is not as merely ‘aspirational’ as might be presumed.

In early 2013, a new dictionary will be published in New Zealand. This dictionary will be a bilingual Māori-English language dictionary. Nothing unusual about that; there are quite a few Māori dictionaries about. Nor is the fact that this particular dictionary is a legal dictionary particularly strange; the world is well served with those, even in regards to New Zealand legal English. The Legal Māori Dictionary is relatively unusual, however, for combining these two characteristics. There are, as yet, not many indigenous language legal dictionaries, or indigenous legal language projects around the world. Of course, there are some fascinating indigenous legal language projects, such as the rich searchable collection of native Hawaiian legal documents available through the Ka Huli Ao Digital Archives under the auspices of the Ka Huli Ao Center for Excellence in Native Hawai`ian Law. An extensive Irish Language Legal Terminology derived from the bilingual Acts of the Irish parliament has also been made publicly available. In Australia, some exciting work has been done with identifying legal glossaries in a number of aboriginal languages including Yolngu Matha and Murrinh-Patha from the Northern Territory. Not infrequently, such glossaries and terminologies are the result of dedicated workshops, often government funded, set up in order to create a functional lexicon for use in the state legal system by speakers of the target indigenous language, as in the case of the English-Inuktitut-French Legal Glossary released in 1997 by the Nunavut Translator/Interpreter program at Nunavut Arctic College. An earlier but similar project for the Navajo language was published in 1989 by the US District Court for the District of New Mexico, and is still made publicly available by the Judicial Branch of the Navajo Nation. A more recent example is the extensive Sámi legal terminology that has been worked up over recent years and made available online by translators and interpreters working on the translation of state legal documents into Sámi for Sámi-speaking populations of Norway, Finland, Denmark and Sweden.

So, we at the Legal Māori Project, and our Legal Māori Dictionary, are in good, if select, company. But every legal lexicography project has a unique whakapapa (genealogy) and characteristics that somehow reflect the lived histories of the people who belong to each language.

To briefly outline our whakapapa then. The Legal Māori Project, as established in 2008 in the Law Faculty of Victoria University of Wellington, seeks to achieve two primary aims: • A long-term goal of normalizing the use of the Māori language within the New Zealand legal system; and ultimately, the public, civic sphere of New Zealand society. Māori must claim its place as an ordinary language of the enactment of state law, of government, administration, politics and the economy; • A shorter-term aim of providing bilingual Māori speakers with a resource that can help such speakers can effectively and feasibly choose to use Māori rather than English in that legal system. Such ease of choice is critically important for effective language revitalisation.

The Legal Māori has received four years of public funding for our research from New Zealand’s Ministry of Science and Innovation. Rather than create a legal terminology from scratch, however, we thought it absolutely necessary to carry out a kind of textual excavation of the rich, but mainly hidden Māori-language documents of New Zealand’s bilingual and bicultural legal history. We were aware that there are several thousand pages of publicly available, printed, Māori language documents discussing, applying, translating, critiquing and interpreting Western legal concepts. These documents are available, but sequestered in public repositories such as the Alexander Turnbull Library. In the face of such a rich treasure trove of texts, we considered our best approach was to be a corpus-based one. We would build a body of digitized Māori language texts that we could analyse to identify the kinds of words and phrases that Māori speakers and writers of the past 180 years had been using in those texts. By June 2011, the texts we found and, in crucial partnership with the New Zealand Electronic Text Centre, digitized, totaled 8 million word tokens; the largest purpose-built and structured corpus of Māori language texts known. The pre-1910 texts of the Legal Māori Corpus are publicly available for download, with the remainder of the texts to be made available by the end of this year. The Legal Māori Corpus contains printed texts of the following kinds of historical documents, most of which are also available online in the land title system. Some documents might be more accurately described strategic documents issued by government departments in Māori, such as Māori language versions of statements of intent.

These documents taken as a whole provide an incredible opportunity to examine the evolution of an endangered language as it wrestles with the lexicon and conceptual world of the dominant language and that language’s culture. Therefore, the collated texts from the Corpus were examined to find how various words and phrases have been used to express Western legal ideas. Over the past two years we have been identifying those words and phrases; first, to come up with a useful lexicon of possible legal Māori terms, and then, to test and validate those lexicon terms in order choose the terms that are now to form to the base of the Legal Māori Dictionary itself. With the invaluable design, by Dave Moskovitz of ThinkTank Ltd, of an open-source, easy-to-use web-based text browser and dictionary writing system called Freelex, we are now compiling our dictionary entries.

As mentioned above, our purpose has always been to create a dictionary of Māori language terms to express Western legal concepts. Customary Māori legal language had been explored in-depth in other scholarship. For example, customary Māori legal concepts have been investigated by the FRST funded work undertaken by Te Mātāhauāriki Institute based at Waikato University in developing a compendium of customary Māori legal terms: Te Mātāpunenga. Choosing to focus on the expression of Western legal ideas in Māori, however, exposed us to the considerable risk that English meanings and concepts would drive the content of our dictionary. Indeed we expected such English conceptual dominance. However, the pilot stages and subsequent corpus-based work showed that Māori customary legal vocabulary had a far stronger presence in the terms we were identifying than had been expected. In fact, many of the words in te reo Māori (the Māori language) that have been used to describe traditional Māori legal concepts are also terms within legal Māori terminology, communicating Western legal ideas. (Some examples are mana, roughly glossed as ‘authority’; tikanga, or the ‘correct way of doing things’; and rangatiratanga which can be equated to ‘chieftainship’.) The Legal Māori Project then must reflect two very important aspects of legal Māori vocabulary: customary legal meaning and Western legal meaning. A core set of customary legal terms that had acquired further Western legal senses over the past 180 years could in fact be identified within the lexicon of legal terms that were being derived from the corpus itself. In view of this insight, we decided that the idea of identifying a finite set of core customary legal terms could form part of a methodology that would enable Māori ideas and Māori legal thinking, alongside Western legal thinking, to take centre stage in our dictionary generation and formatting. The methodology used by the Legal Māori Project team is one that therefore pays careful attention to both the Western and customary law aspects of a significant, identifiable core of traditional Māori law terms. The team identified that if customary legal and western legal aspects of core terms are accounted for in the selection, formatting, and organisation of the dictionary entries, English glosses and English ideas are less likely to subvert Māori ideas and the Māori language basis of the dictionary as a whole. To provide a practical example of how we attempted to incorporate such prioritization in the design of the Legal Māori Dictionary, the following draft entry for taonga might be useful. It comes from the sample dictionary released in June 2010.

	taonga
	The customary usage of taonga refers to property or anything highly prized. The giving and receiving of taonga was an important part of recording and maintaining reciprocal relationships between groups. @TM Taonga
	1n <cust> valued property [K]i te kitea kua kore te tangata e utu i ngā moni reti, e whai ture ana ki te hamene i a ia ki te muru i ōna taonga[.] @S241886 2n goods Kua kitea te nui haere o ngā mahi o te koroni i runga i te maha o ngā taonga e utautaina ana ki tāwahi[.] @S241891 ☼ Usually used in the context of personal property, but sometimes also used to refer to real property or goods traded on a commercial scale.

Many typical dictionary elements have been used in this draft entry. For example, distinct verb senses have been identified and numbered. The grammatical function of each sense is identified, and the primary usage (here referring to taonga being primarily a customary term) identified. It also includes a one-word English gloss for each sense and some further explanation in English of how the term is used in a technical way (preceded by ☼). Finally, the entry includes a usage example for each term and short code references for each example, which will enable the user to find the original text. The opening sentence at the top of the entry will be shaded in its final printed form, and will thereby be a new addition to the formatting of our dictionary articles. We have labeled this feature the whakamaramatanga (‘clarification’) field, where a very brief explanation is given of the all-important customary context for the term with a reference to further reading for those readers wanting to find out more about the concept. The reference is to the Matapunenga compendium (to be published at roughly at the same time as the Legal Māori Dictionary). These small additions to the traditional dictionary entry, must be taken in conjunction with all the work carried out by the Legal Māori Project to date. Ultimately we hope that our experience in designing and producing our outputs, including the dictionary, might assist the designers of other specialist dictionaries or lexicons of indigenous languages to pay appropriate deference to the customary concepts of those languages, where possible and practicable.

And, above all, just maybe our work will help Māori speakers to choose to use their own language in precisely those domains where they are simply not expected to, or in the view of some, supposed to. And when that happens, a Māori-speaking Crown doesn’t seem so difficult after all.

Thanks to Māori.org.nz for the Māori images used here.

After some years working in the New Zealand Department of Corrections and Māori broadcasting, Māmari completed an MA (Distinction) in Classical Studies, BA (Hons), and an LLB (Hons) at Victoria University. She then spent three and a half years at New Zealand’s largest law firm, Russell McVeagh, in Wellington, working in the Māori legal team in the Corporate Advisory Group. Māmari has been with the School of Law since January 2006 and, with Assistant Professor Mary Boyce of the University of Hawai’i, runs the Legal Māori Project. Her primary research interests are law and language, Māori and the New Zealand legal system, and social security law. Māmari is married to Maynard Gilgen and has two sons, Te Rangihuia (9) Havelund (5) and a daughter, Jessica-Lee Ngātaiotehauauru, born in November 2009.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed. The information above should not be considered legal advice. If you require legal representation, please consult a lawyer.

Semantic Enhancement of Legal Information… Are We Up for the Challenge? [Revised Repost]

Cross-language legal information retrieval, information retrieval, knowledge management, Legal knowledge representation, Legal ontologies, Legal semantic web, Linked Data, Linked Data and law, Multilingual legal information retrieval, Semantic Web and law 1 Response »

Jan 182011

[Editor’s Note: We are republishing here, with some corrections, a post by Dr. Núria Casellas that appeared earlier on VoxPopuLII.]

The organization and formalization of legal information for computer processing in order to support decision-making or enhance information search, retrieval and knowledge management is not recent, and neither is the need to represent legal knowledge in a machine-readable form. Nevertheless, since the first ideas of computerization of the law in the late 1940s, the appearance of the first legal information systems in the 1950s, and the first legal expert systems in the 1970s, claims, such as Hafner’s, that “searching a large database is an important and time-consuming part of legal work,” which drove the development of legal information systems during the 80s, have not yet been left behind.

Similar claims may be found nowadays as, on the one hand, the amount of available unstructured (or poorly structured) legal information and documents made available by governments, free access initiatives, blawgs, and portals on the Web will probably keep growing as the Web expands. And, on the other, the increasing quantity of legal data managed by legal publishing companies, law firms, and government agencies, together with the high quality requirements applicable to legal information/knowledge search, discovery, and management (e.g., access and privacy issues, copyright, etc.) have renewed the need to develop and implement better content management tools and methods.

Information overload, however important, is not the only concern for the future of legal knowledge management; other and growing demands are increasing the complexity of the requirements that legal information management systems and, in consequence, legal knowledge representation must face in the future. Multilingual search and retrieval of legal information to enable, for example, integrated search between the legislation of several European countries; enhanced laypersons’ understanding of and access to e-government and e-administration sites or online dispute resolution capabilities (e.g., BATNA determination); the regulatory basis and capabilities of electronic institutions or normative and multi-agent systems (MAS); and multimedia, privacy or digital rights management systems, are just some examples of these demands.

How may we enable legal information interoperability? How may we foster legal knowledge usability and reuse between information and knowledge systems? How may we go beyond the mere linking of legal documents or the use of keywords or Boolean operators for legal information search? How may we formalize legal concepts and procedures in a machine-understandable form?

In short, how may we handle the complexity of legal knowledge to enhance legal information search and retrieval or knowledge management, taking into account the structure and dynamic character of legal knowledge, its relation with common sense concepts, the distinct theoretical perspectives, the flavor and influence of legal practice in its evolution, and jurisdictional and linguistic differences?

These are challenging tasks, for which different solutions and lines of research have been proposed. Here, I would like to draw your attention to the development of semantic solutions and applications and the construction of formal structures for representing legal concepts in order to make human-machine communication and understanding possible.

Semantic metadata

For example, in the search and retrieval area, we still perform nowadays most legal searches in online or application databases using keywords (that we believe to be contained in the document that we are searching for), maybe together with a combination of Boolean operators, or supported with a set of predefined categories (metadata regarding, for example, date, type of court, etc.), a list of pre-established topics, thesauri (e.g., EuroVoc), or a synonym-enhanced search.

These searches rely mainly on syntactic matching, and — with the exception of searches enhanced with categories, synonyms, or thesauri — they will return only documents that contain the exact term searched for. To perform more complex searches, to go beyond the term, we require the search engine to understand the semantic level of legal documents; a shared understanding of the domain of knowledge becomes necessary.

Although the quest for the representation of legal concepts is not new, these efforts have recently been driven by the success of the World Wide Web (WWW) and, especially, by the later development of the Semantic Web. Sir Tim Berners-Lee described it as an extension of the Web “in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”

Thus, the Semantic Web is envisaged as an extension of the current Web, which now comprises collaborative tools and social networks (the Social Web or Web 2.0). The Semantic Web is sometimes also referred to as Web 3.0, although there is no widespread agreement on this matter, as different visions exist regarding the enhancement and evolution of the current Web.

These efforts also include the Web of Data (or Linked Data), which relies on the existence of standard formats (URIs, HTTP and RDF) to allow the access and query of interrelated datasets, which may be granted through a SPARQL endpoint (e.g., Govtrack.us, US census data, etc.). Sharing and connecting data on the Web in compliance with the Linked Data principles enables the exploitation of content from different Web data sources with the development of search, browse, and other mashup applications. (See the Linking Open Data cloud diagram by Cyganiak and Jentzsch below.) [Editor’s Note: Legislation.gov.uk also applies Linked Data principles to legal information, as John Sheridan explains in his recent post.]

Thus, to allow semantics to be added to the current Web, new languages and tools (ontologies) were needed, as the development of the Semantic Web is based on the formal representation of meaning in order to share with computers the flexibility, intuition, and capabilities of the conceptual structures of human natural languages. In the subfield of computer science and information science known as Knowledge Representation, the term “ontology” refers to a consensual and reusable vocabulary of identified concepts and their relationships regarding some phenomena of the world, which is made explicit in a machine-readable language. Ontologies may be regarded as advanced taxonomical structures, where concepts are formalized as classes and defined with axioms, enriched with the description of attributes or constraints, and properties.

The task of developing interoperable technologies (ontology languages, guidelines, software, and tools) has been taken up by the World Wide Web Consortium (W3C). These technologies were arranged in the Semantic Web Stack according to increasing levels of complexity (like a layer cake). In this stack, higher layers depend on lower layers (and the latter are inherited from the original Web). These languages include XML (eXtensible Markup Language), a superset of HTML usually used to add structure to documents, and the so-called ontology languages: RDF/RDFS (Resource Description Framework/Schema), OWL, and OWL2 (Ontology Web Language). While the RDF language offers simple descriptive information about the resources on the Web, encoded in sets of triples of subject (a resource), predicate (a property or relation), and object (a resource or a value), RDFS allows the description of sets. OWL offers an even more expressive language to define structured ontologies (e.g. class disjointess, union or equivalence, etc.

Moreover, a specification to support the conversion of existing thesauri, taxonomies or subject headings into RDF triples has recently been published: the SKOS, Simple Knowledge Organization System standard. These specifications may be exploited in Linked Data efforts, such as the New York Times vocabularies. Also, EuroVoc, the multilingual thesaurus for activities of the EU is, for example, now available in this format.

Although there are different views in the literature regarding the scope of the definition or main characteristics of ontologies, the use of ontologies is seen as the key to implementing semantics for human-machine communication. Many ontologies have been built for different purposes and knowledge domains, for example:

OpenCyc: an open source version of the Cyc general ontology;
SUMO: the Suggested Upper Merged Ontology;
the upper ontologies PROTON (PROTo Ontology) and DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering);
the FRBRoo model (which represents bibliographic information);
the RDF representation of Dublin Core;
the Gene Ontology;
the FOAF (Friend of a Friend) ontology.

Although most domains are of interest for ontology modeling, the legal domain offers a perfect area for conceptual modeling and knowledge representation to be used in different types of intelligent applications and legal reasoning systems, not only due to its complexity as a knowledge intensive domain, but also because of the large amount of data that it generates. The use of semantically-enabled technologies for legal knowledge management could provide legal professionals and citizens with better access to legal information; enhance the storage, search, and retrieval of legal information; make possible advanced knowledge management systems; enable human-computer interaction; and even satisfy some hopes respecting automated reasoning and argumentation.

Regarding the incorporation of legal knowledge into the Web or into IT applications, or the more complex realization of the Legal Semantic Web, several directions have been taken, such as the development of XML standards for legal documentation and drafting (including Akoma Ntoso, LexML, CEN Metalex, and Norme in Rete), and the construction of legal ontologies.

Ontologizing legal knowledge

During the last decade, research on the use of legal ontologies as a technique to represent legal knowledge has increased and, as a consequence, a very interesting debate about their capacity to represent legal concepts and their relation to the different existing legal theories has arisen. It has even been suggested that ontologies could be the “missing link” between legal theory and Artificial Intelligence.

The literature suggests that legal ontologies may be distinguished by the levels of abstraction of the ideas they represent, the key distinction being between core and domain levels. Legal core ontologies model general concepts which are believed to be central for the understanding of law and may be used in all legal domains. In the past, ontologies of this type were mainly built upon insights provided by legal theory and largely influenced by normativism and legal positivism, especially by the works of Hart and Kelsen. Thus, initial legal ontology development efforts in Europe were influenced by hopes and trends in research on legal expert systems based on syllogistic approaches to legal interpretation.

More recent contributions at that level include the LKIF-Core Ontology, the LRI-Core Ontology, the DOLCE+CLO (Core Legal Ontology), and the Ontology of Fundamental Legal Concepts. Blue Scene Such ontologies usually include references to the concepts of Norm, Legal Act, and Legal Person, and may contain the formalization of deontic operators (e.g., Prohibition, Obligation, and Permission).

Domain ontologies, on the other hand, are directed towards the representation of conceptual knowledge regarding specific areas of the law or domains of practice, and are built with particular applications in mind, especially those that enable communication (shared vocabularies), or enhance indexing, search, and retrieval of legal information. Currently, most legal ontologies being developed are domain-specific ontologies, and some areas of legal knowledge have been heavily targeted, notably the representation of intellectual property rights respecting digital rights management (IPROnto Ontology, the Copyright Ontology, the Ontology of Licences, and the ALIS IP Ontology), and consumer-related legal issues (the Customer Complaint Ontology (or CContology), and the Consumer Protection Ontology). Many other well-documented ontologies have also been developed for purposes of the detection of financial fraud and other crimes; the representation of alternative dispute resolution methods, privacy compliance, patents, cases (e.g., Legal Case OWL Ontology), judicial proceedings, legal systems, and argumentation frameworks; and the multilingual retrieval of European law, among others. (See, for example, the proceedings of the JURIX and ICAIL conferences for further references.)

A socio-legal approach to legal ontology development

Thus, there are many approaches to the development of legal ontologies. Nevertheless, in the current legal ontology literature there are few explicit accounts or insights into the methods researchers use to elicit legal knowledge, and the accounts that are available reflect a lack of consensus as to the most appropriate methodology. For example, some accounts focus solely on the use of text mining techniques towards ontology learning from legal texts; while others concentrate on the analysis of legal theories and related materials to extract and formalize legal concepts. Moreover, legal ontology researchers disagree about the role that legal experts should play in ontology development and validation.

Orange Scene In this regard, at the Institute of Law and Technology, we are developing a socio-legal approach to the construction of legal conceptual models. This approach stems from our collaboration with firms, government agencies, and nonprofit organizations (and their experts, clients, and other users) for the gathering of either explicit or tacit knowledge according to their needs. This empirically-based methodology may require the modeling of legal knowledge in practice (or professional legal knowledge, PLK), and the acquisition of knowledge through ethnographic and other social science research methods, together with the extraction (and merging) of concepts from a range of different sources (acts, regulations, case law, protocols, technical reports, etc.) and their validation by both legal experts and users.

For example, the Ontology of Professional Judicial Knowledge (OPJK) was developed in collaboration with the Spanish School of the Judicary to enhance search and retrieval capabilities of a Web-based frequentl- asked-question system (IURISERVICE) containing a repository of practical knowledge for Spanish judges in their first appointment. The knowledge was elicited from an ethnographic survey in Spanish First Instance Courts. On the other hand, the Neurona Ontologies, for a data protection compliance application, are based on the knowledge of legal experts and the requirements of enterprise asset management, together with the analysis of privacy and data protection regulations and technical risk management standards.

This approach tries to take into account many of the criticisms that developers of legal knowledge-based systems (LKBS) received during the 1980s and the beginning of the 1990s, including, primarily, the lack of legal knowledge or legal domain understanding of most LKBS development teams at the time. These criticisms were rooted in the widespread use of legal sources (statutes, case law, etc.) directly as the knowledge for the knowledge base, instead of including in the knowledge base the “expert” knowledge of lawyers or law-related professionals.

Further, in order to represent knowledge in practice (PLK), legal ontology engineering could benefit from the use of social science research methods for knowledge elicitation, institutional/organizational analysis (institutional ethnography), as well as close collaboration with legal practitioners, users, experts, and other stakeholders, in order to discover the relevant conceptual models that ought to be represented in the ontologies. Moreover, I understand the participation of these stakeholders in ontology evaluation and validation to be crucial to ensuring consensus about, and the usability of, a given legal ontology.

Challenges and drawbacks

Although the use of ontologies and the implementation of the Semantic Web vision may offer great advantages to information and knowledge management, there are great challenges and problems to be overcome.

First, the problems related to knowledge acquisition techniques and bottlenecks in software engineering are inherent in ontology engineering, and ontology development is quite a time-consuming and complex task. Second, as ontologies are directed mainly towards enabling some communication on the basis of shared conceptualizations, how are we to determine the sharedness of a concept? And how are context-dependencies or (cultural) diversities to be represented? Furthermore, how can we evaluate the content of ontologies?

Current research is focused on overcoming these problems through the establishment of gold standards in concept extraction and ontology learning from texts, and the idea of collaborative development of legal ontologies, although these techniques might be unsuitable for the development of certain types of ontologies. Also, evaluation (validation, verification, and assessment) and quality measurement of ontologies are currently an important topic of research, especially ontology assessment and comparison for reuse purposes.

Regarding ontology reuse, the general belief is that the more abstract (or core) an ontology is, the less it owes to any particular domain and, therefore, the more reusable it becomes across domains and applications. This generates a usability-reusability trade-off that is often difficult to resolve.

Finally, once created, how are these ontologies to evolve? How are ontologies to be maintained and new concepts added to them?

Over and above these issues, in the legal domain there are taking place more particularized discussions: for example, the discussion of the advantages and drawbacks of adopting an empirically based perspective (bottom-up), and the complexity of establishing clear connections with legal dogmatics or general legal theory approaches (top-down). To what extent are these two different perspectives on legal ontology development incompatible? How might they complement each other? What is their relationship with text-based approaches to legal ontology modeling?

I would suggest that empirically based, socio-legal methods of ontology construction constitute a bottom-up approach that enhances the usability of ontologies, while the general legal theory-based approach to ontology engineering fosters the reusability of ontologies across multiple domains.

The scholarly discussion of legal ontology development also embraces more fundamental issues, among them the capabilities of ontology languages for the representation of legal concepts, the possibilities of incorporating a legal flavor into OWL, and the implications of combining ontology languages with the formalization of rules.

Finally, the potential value to legal ontology of other approaches, areas of expertise, and domains of knowledge construction ought to be explored, for example: pragmatics and sociology of law methodologies, experiences in biomedical ontology engineering, formal ontology approaches, and the relationships between legal ontology and legal epistemology, legal knowledge and common sense or world knowledge, expert and layperson’s knowledge, legal information and Linked Data possibilities, and legal dogmatics and political science (e.g., in e-Government ontologies).

As you may see, the challenges faced by legal ontology engineering are great, and the limitations of legal ontologies are substantial. Nevertheless, the potential of legal ontologies is immense. I believe that law-related professionals and legal experts have a central role to play in the successful development of legal ontologies and legal semantic applications.

[Editor’s Note: For many of us, the technical aspects of ontologies and the Semantic Web are unfamiliar. Yet these technologies are increasingly being incorporated into the legal information systems that we use everyday, so it’s in our interest to learn more about them. For those of us who would like a user-friendly introduction to ontologies and the Semantic Web, here are some suggestions:

Tom Gruber, Where the Social Web Meets the Semantic Web (video);
Sandro Hawke, How the Semantic Web Works;
Kevin Hemenway, The Semantic Web: 1-2-3;
Jim Hendler et al., Introduction to the Semantic Web (video);
Ivan Herman, Introduction to the Semantic Web;
Brian Lowe, Introduction to Ontologies: Adding Meaning to Metadata;
Marek Obitko, Introduction to Ontologies and Semantic Web;
Sean B. Palmer, The Semantic Web: An Introduction;
Ioana Robu et al., An Introduction to the Semantic Web for Health Sciences Librarians;
Barry Smith, Ontology: An Introduction: Video: How to Build an Ontology;
University of Manchester, CO-ODE, Tutorial: A Practical Introduction to Ontologies and OWL;
Dr. Adam Z. Wyner, Legal Ontologies Spin a Semantic Web.]

Dr. Núria Casellas is a visiting researcher at the Legal Information Institute at Cornell University. She is a researcher at the Institute of Law and Technology and an assistant professor at the UAB Law School (on leave). She has participated in several national and European-funded research projects regarding legal ontologies and legal knowledge management: these concern the acquisition of knowledge in judicial settings (IURISERVICE), modeling privacy compliance regulations (NEURONA), drafting legislation (DALOS), and the Legal Case Study of the Semantically Enabled Knowledge Technologies (SEKT VI Framework project), among others. Co-editor of the IDT Series, she holds a Law Degree from the Universitat Autònoma de Barcelona, a Master’s Degree in Health Care Ethics and Law from the University of Manchester, and a PhD (“Modelling Legal Knowledge through Ontologies. OPJK: the Ontology of Professional Judicial Knowledge”).

VoxPopuLII is edited by Judith Pratt. Editor in Chief is Robert Richards.

Semantic Enhancement of Legal Information… Are We Up for the Challenge?

Cross-language legal information retrieval, information retrieval, knowledge management, Legal knowledge representation, Legal ontologies, Legal semantic web, Multilingual legal information retrieval, Semantic Web and law 8 Responses »

Feb 152010

The organization and formalization of legal information for computer processing in order to support decision-making or enhance information search, retrieval and knowledge management is not recent, and neither is the need to represent legal knowledge in a machine-readable form. Nevertheless, since the first ideas of computerization of the law in the late 1940s, the appearance of the first legal information systems in the 1950s, and the first legal expert systems in the 1970s, claims, such as Hafner’s, that “searching a large database is an important and time-consuming part of legal work,” which drove the development of legal information systems during the 80s, have not yet been left behind.

Similar claims may be found nowadays as, on the one hand, the amount of available unstructured (or poorly structured) legal information and documents made available by governments, free access initiatives, blawgs, and portals on the Web will probably keep growing as the Web expands. And, on the other, the increasing quantity of legal data managed by legal publishing companies, law firms, and government agencies, together with the high quality requirements applicable to legal information/knowledge search, discovery, and management (e.g., access and privacy issues, copyright, etc.) have renewed the need to develop and implement better content management tools and methods.

Information overload, however important, is not the only concern for the future of legal knowledge management; other and growing demands are increasing the complexity of the requirements that legal information management systems and, in consequence, legal knowledge representation must face in the future. Multilingual search and retrieval of legal information to enable, for example, integrated search between the legislation of several European countries; enhanced laypersons’ understanding of and access to e-government and e-administration sites or online dispute resolution capabilities (e.g., BATNA determination); the regulatory basis and capabilities of electronic institutions or normative and multi-agent systems (MAS); and multimedia, privacy or digital rights management systems, are just some examples of these demands.

How may we enable legal information interoperability? How may we foster legal knowledge usability and reuse between information and knowledge systems? How may we go beyond the mere linking of legal documents or the use of keywords or Boolean operators for legal information search? How may we formalize legal concepts and procedures in a machine-understandable form?

In short, how may we handle the complexity of legal knowledge to enhance legal information search and retrieval or knowledge management, taking into account the structure and dynamic character of legal knowledge, its relation with common sense concepts, the distinct theoretical perspectives, the flavor and influence of legal practice in its evolution, and jurisdictional and linguistic differences?

These are challenging tasks, for which different solutions and lines of research have been proposed. Here, I would like to draw your attention to the development of semantic solutions and applications and the construction of formal structures for representing legal concepts in order to make human-machine communication and understanding possible.

Semantic metadata

Nowadays, in the search and retrieval area, we still perform most legal searches in online or application databases using keywords (that we believe to be contained in the document that we are searching for), maybe together with a combination of Boolean operators, or supported with a set of predefined categories (metadata regarding, for example, date, type of court, etc.), a list of pre-established topics, thesauri (e.g., EUROVOC), or a synonym-enhanced search.

These searches rely mainly on syntactic matching, and — with the exception of searches enhanced with categories, synonyms, or thesauri — they will return only documents that contain the exact term searched for. To perform more complex searches, to go beyond the term, we require the search engine to understand the semantic level of legal documents; a shared understanding of the domain of knowledge becomes necessary.

Although the quest for the representation of legal concepts is not new, these efforts have recently been driven by the success of the World Wide Web (WWW) and, especially, by the later development of the Semantic Web. Sir Tim Berners-Lee described it as an extension of the Web “in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”

Thus, the Semantic Web (including Linked Data efforts or the Web of Data) is envisaged as an extension of the current Web, which now also comprises collaborative tools and social networks (the Social Web or Web 2.0). The Semantic Web is sometimes also referred to as Web 3.0, although there is no widespread agreement on this matter, as different visions exist regarding the enhancement and evolution of the current Web.

Towards that shift, new languages and tools (ontologies) were needed to allow semantics to be added to the current Web, as the development of the Semantic Web is based on the formal representation of meaning in order to share with computers the flexibility, intuition, and capabilities of the conceptual structures of human natural languages. In the subfield of computer science and information science known as Knowledge Representation, the term “ontology” refers to a consensual and reusable vocabulary of identified concepts and their relationships regarding some phenomena of the world, which is made explicit in a machine-readable language. Ontologies may be regarded as advanced taxonomical structures, where concepts formalized as classes (e.g., “Actor”) are defined with axioms, enriched with the description of attributes or constraints (for example, “cardinality”), and linked to other classes through properties (e.g., “possesses” or “is_possessed_by”).

The task of developing interoperable technologies (ontology languages, guidelines, software, and tools) has been taken up by the World Wide Web Consortium (W3C). These technologies were arranged in the Semantic Web Stack according to increasing levels of complexity (like a layer cake), in the sense that higher layers depend on lower layers (and the latter are inherited from the original Web). The languages include XML (eXtensible Markup Language), a superset of HTML usually used to add structure to documents, and the so-called ontology languages: RDF (Resource Description Framework), OWL, and OWL2 (Ontology Web Language). Recently, a specification to support the conversion of existing thesauri, taxonomies or subject headings into RDF has been released (the the SKOS, Simple Knowledge Organization System standard).

Although there are different views in the literature regarding the scope of the definition or main characteristics of ontologies, the use of ontologies is seen as the key to implementing semantics for human-machine communication. Many ontologies have been built for different purposes and knowledge domains, for example:

OpenCyc,
SUMO,
PROTON,
DOLCE,
the FRBRoo model (used in the above code and graph examples),
the RDF representation of Dublin Core,
the Gene Ontology,
the Wine Ontology, and
the SemanticBible.

Although most domains are of interest for ontology modeling, the legal domain offers a perfect area for conceptual modeling and knowledge representation to be used in different types of intelligent applications and legal reasoning systems, not only due to its complexity as a knowledge intensive domain, but also because of the large amount of data that it generates. The use of semantically-enabled technologies for legal knowledge management could provide legal professionals and citizens with better access to legal information; enhance the storage, search, and retrieval of legal information; make possible advanced knowledge management systems; enable human-computer interaction; and even satisfy some hopes respecting automated reasoning and argumentation.

Regarding the incorporation of legal knowledge into the Web or into IT applications, or the more complex realization of the Legal Semantic Web, several directions have been taken, such as the development of XML standards for legal documentation and drafting (including Akoma Ntoso, LexML, CEN Metalex, and Norme in Rete), and the construction of legal ontologies.

Ontologizing legal knowledge

During the last decade, research on the use of legal ontologies as a technique to represent legal knowledge has increased and, as a consequence, a very interesting debate about their capacity to represent legal concepts and their relation to the different existing legal theories has arisen. It has even been suggested that ontologies could be the “missing link” between legal theory and Artificial Intelligence.

The literature suggests that legal ontologies may be distinguished by the levels of abstraction of the ideas they represent, the key distinction being between core and domain levels. Legal core ontologies model general concepts which are believed to be central for the understanding of law and may be used in all legal domains. In the past, ontologies of this type were mainly built upon insights provided by legal theory and largely influenced by normativism and legal positivism, especially by the works of Hart and Kelsen. Thus, initial legal ontology development efforts in Europe were influenced by hopes and trends in research on legal expert systems based on syllogistic approaches to legal interpretation.

More recent contributions at that level include the LRI-Core Ontology, the DOLCE+CLO (Core Legal Ontology), and the Ontology of Fundamental Legal Concepts Blue Scene (the basis for the LKIF-Core Ontology). Such ontologies usually include references to the concepts of Norm, Legal Act, and Legal Person, and may contain the formalization of deontic operators (e.g., Prohibition, Obligation, and Permission).

Domain ontologies, on the other hand, are directed towards the representation of conceptual knowledge regarding specific areas of the law or domains of practice, and are built with particular applications in mind, especially those that enable communication (shared vocabularies), or enhance indexing, search, and retrieval of legal information. Currently, most legal ontologies being developed are domain-specific ontologies, and some areas of legal knowledge have been heavily targeted, notably the representation of intellectual property rights respecting digital rights management (IPROnto Ontology, the Copyright Ontology, the Ontology of Licences, and the ALIS IP Ontology), and consumer-related legal issues (the Customer Complaint Ontology (or CContology), and the Consumer Protection Ontology). Many other well-documented ontologies have also been developed for purposes of the detection of financial fraud and other crimes; the representation of alternative dispute resolution methods, cases, judicial proceedings, and argumentation frameworks; and the multilingual retrieval of European law, among others. (See, for example, the proceedings of the JURIX and ICAIL conferences for further references.)

A socio-legal approach to legal ontology development

Thus, there are many approaches to the development of legal ontologies. Nevertheless, in the current legal ontology literature there are few explicit accounts or insights into the methods researchers use to elicit legal knowledge, and the accounts that are available reflect a lack of consensus as to the most appropriate methodology. For example, some accounts focus solely on the use of legal text mining and statistical analysis, in which ontologies are built by means of machine learning from legal texts; while others concentrate on the analysis of legal theories and related materials. Moreover, legal ontology researchers disagree about the role that legal experts should play in ontology validation.