Semantic Web and law » VoxPopuLII

About LII / Get the law / Find a lawyer / Legal Encyclopedia / Help Out

Folksonomies & Law - Background issues and theoretical perspectives

information retrieval, knowledge management, Law libraries, Legal knowledge management, Legal ontologies, Legal semantic web, Semantic Web and law, Taxonomies 3 Responses »

Nov 272014

§.1.- Foreword

«If folksonomies work for pictures (Flickr), books (Goodreads), questions and answers (Quora), basically everything else (Delicious), why shouldn’t they work for law?» (Serena Manzoli)

In a post on this blog, Serena Manzoli distinguishes three uses of taxonomies in law: (1) for research of legal documents, (2) in teaching to law students, and (3) for its practical application.

In regard to her first point, she notes that (observation #1) to increase the availability of legal resources is compelling change of the whole information architecture, and – correctly, in my opinion – she exposes some objections to the heuristic efficiency of folksonomies: (objection #1) they are too “flat” to constitute something useful for legal research and (objection #2) it is likely that non-expert users could “pollute” the set of tags. Notwithstanding these issues, she states (prediction #1) that folksonomies could be helpful with non-legal users.

On the second point, she notes (observation #2) that folksonomies could be beneficial to study the law, because they could allow one to penetrate easier into its conceptual frameworks; she also formulates the hypothesis (prediction #2) that this teaching method could shape a more flexible mindset in students.

In discussing the third point, she notes (observation #3) that different taxonomies entail different ways of apply the law, and (prediction #3) she formulates the hypothesis that, in a distant perspective in which folksonomies would replace taxonomies, the result would be a whole new way to apply the law.

I appreciated Manzoli’s post and accepted with pleasure the invitation of Christine Kirchberger – to whom I am grateful – to share my views with the readers of this prestigious blog. Hereinafter I intend to focus on the theoretical profiles that aroused my curiosity. My position is partly different from that of Serena Manzoli.

§.2.- Introduction

In order to detect the issues stemming from folksonomies, I think it is relevant to give some preliminary clarifications.

In collective tagging systems, by tagging we can describe the content of an object – an image, a song or a document – label it using any lexical expression preceded by the “hashtag” (the symbol “#”) and share it with our friends and followers or also recommend it to an audience of strangers.

Folksonomies (blend of the words “taxonomy” and “folk”) are sets of categories resulting from the use of tags in the description of on line resources by the users, allowing a “many to many” connection between tags, users and resources.

Basic pattern of a folksonomy

Thomas Vander Wal coined the word a decade ago – ten years is really a long time in ICTs – and these technologies, as reported by Serena Manzoli, have now been adopted in most of the social networks and e-commerce systems.

The main feature of folksonomies is that tags aggregate spontaneously in a semantic core; therefore, they are often associated with taxonomies or ontologies, although in these latter cases hierarchies and categories are established before the collection of data, as “a priori”.

Simplifying, I can say that tags may describe three aspects of the resources, using particulars (i.e. a picture of a flowerpot lit by the sun):

(1) The content of the resources (i.e. #flowers),

(2) The interaction with other specific resources and the environment in general (i.e. #sun or #summer),

(3) The effect that these resources have on users having access to them (i.e. #beautiful).

Since it seems to me that none of these aspects should be disregarded in an overall assessment of folksonomies, I will consider all of them.

Having regard to law, they end to match with these three major issues:

(1) Law as a “content”. Users select legal documents among others available and choose those that seem most relevant. As a real interest is – normally – the driving criterion of the search, and as this typically is given by the need to solve a legal problem, I designate this profile with the expression «Quid juris?».

(2) Law as a “concept”. This problem emerges because the single legal document can not be conceived separately from the context in which it appears, namely the relations it has with the legal system to which it belongs. Consequently becomes inevitable to ask what the law is, as a common feature of all legal documents. Recalling Immanuel Kant in the “Metaphysics of Morals”, here I use the expression «Quid jus?».

(3) Law as a “sentiment”. What emerges in folksonomies is a subjective attitude that regards the meaning to be attributed to the research of resources and that affects the way in which it is performed. To this I intend to refer using the expression «Cur jus?».

§.3.- Folksonomies, Law, and «Quid juris?»: legal information management and collective tagging systems

In this respect, I agree definitely with Serena Manzoli. Folksonomies seem to open very interesting perspectives in the field of legal information management; we admit, however, that these technologies still have some limitations. For instance: just because the resources are tagged freely, it is difficult to use them to build taxonomies or ontologies; inexperienced users classify resources less efficiently than the other, diluting all the efforts of more skilled users and “polluting” well-established catalogs; vice versa, even experienced users can make mistakes in the allocation of tags, worsening the quality of information being shared.

Though in some cases these issues can be solved in several ways – i.e., the use of tags can be guided with the tag’s recommendation, hence the distinction between broad and narrow folksonomies – and even if it can reasonably be expected that these tools will work even better in the future, for now we can say that folksonomies are useful just to integrate pre-existing classifications.

I may add, as an example, that an Italian law requires the creation of “user-created taxonomies (folksonomies),” “Guidelines for websites of public administrations” of 29 July 2011, page 20. These guidelines have been issued pursuant to art. 4 of Directive 26th November 2009 n. 8, of the “Minister for Public Administration and Innovation”, according to the Legislative Decree of 7 March 2005, n. 82, “Digital Administration Code” (O.J. n. 112 of 16th May 2005, S.O. n. 93). It may be interesting to point out that in Italian law the innovation in administrative bodies is promoted by a specific institution, the Agency for Digital Italy (“Agenzia per l’Italia Digitale”), which coordinates the actions in this field and sets standards for usability and accessibility. Folksonomies indeed fall into this latter category.

Following this path, a municipality (Turin) has recently set up a system of “social bookmarking” for the benefit of citizens called TaggaTO.

§.4.- Folksonomies, Law, and «Quid jus?»: the difference between the “map” and the “territory”

In this regard, my theoretical approach is different from that of Serena Manzoli. Here is the reason our findings are opposite.

Human beings are “tagging animals”, since labelling things is a natural habit. We can note it in common life: each of us, indeed, organizes his environment at home (we have jars with “salt” or “pepper” written on the caps) and at work (we use folders with “invoices” or “bank account” printed on the cover). The significance of tags is obvious if we consider using it with other people: it allows us to establish and share a common information framework. For the same reasons of convenience, tags have been included in most of the software applications we use (documents, e-mail, calendars) and, as said above, in many online services. To sum up, labels help us to build a representation of reality: they are tools for our knowledge.

In regard to reality and knowledge, it may be recalled that in the twentieth century there were two philosophical perspectives: the “continental tradition”, focused on the first (reality) and pretty much common in Europe, and the “analytic philosophy”, centered on the second (knowledge and widespread among USA, UK and Scandinavia. More recently, this distinction has lost much of its heuristic value and we have seen rising a different approach, the “philosophy of information”, which proposes, developing some theoretical aspects of cybernetics, a synthesis of reality and knowledge in an unifying vision that originates from a naturalistic notion of “information”.

I will try to simplify, saying that if reality is a kind of “territory”, and if taxonomies (and in general ontologies) can be considered as a sort of representation of knowledge, then they can be considered as “maps”.

In light of these premises, I should explain what to me “sharing resources” and “shared knowledge” mean in folksonomies. Folksonomies are a kind of “map”, indeed, but different than ontologies. In a metaphor: ontologies could be seen as “maps” created by a single geographer overlapping the reliefs of many “territories”, and sold indiscriminately to travelers; folksonomies could be seen as “maps” that inhabitants of different territories help each other to draw by telephone or by texting a message. Both solutions have advantages and disadvantages: the former may be detailed but more difficult to consult, while the latter may be always updated but affected by inaccuracies. In this sense, folksonomies could be said “antifragile” – according to the brilliant metaphor of Nassim Nicholas Taleb – because their value improves with increased use, while ontologies could be seen as “fragile”, because of the linearity of the process of production and distribution.

Therefore, as the “map” is not the “territory”, reality does not change depending on the representation. Nevertheless, this does not mean that the “maps” are not helpful to travel to unknown “territories”, or to reach faster the destination even in “territories” that are well known (just like when driving in the car with the aid of GPS).

On the application of folksonomies to the field of law, I shall say that, after all, legal science has always been a kind of “natural folksonomy”. Indeed, it has always been a widespread knowledge, ready to be practiced, open to discussion, and above all perfectly “antifragile”: new legal issues to be solved determine a further use of the systems, thus causing an increase in knowledge and therefore a greater accuracy in the description of the legal domain. In this regard, Serena Manzoli in her post also mentioned the Corpus Juris Civilis, which for centuries has been crucial in the Western legal culture. Scholars went to Italy from all over Europe to study it, at the beginning by noting few elucidations in the margins of the text (glossatores), then commenting on what they had learned (commentatores), and using their legal competences to decide cases that were submitted to them as judges or to argue in trials as lawyers.

Modern tradition has refused all of this, imposing a rationalistic and rigorous view of law. This approach – “fragile”, continuing with the paradigm of Nassim Nicholas Taleb – has spread in different directions, which simplifying I can lower to three:

(1) Legal imperativism: law as embodied in the words of the sovereign.

Leviathan (Thomas Hobbes)

(2) Legal realism: law as embodied in the words of the judge.

Gavel

(3) Legal formalism: law as embodied in administrative procedures.

The Castle (Franz Kafka)

For too long we have been led to pretending to see only the “map” and to ignore the “territory”. In my opinion, the application of folksonomies to law can be very useful to overcome these prejudices emerging from the traditional legal positivism, and to revisit a concept of law that is a step closer to its origin and its nature. I wrote “a step closer”; I’d like to clarify, to emphasize that the “map”, even if obtained through a participatory process, remains a representation of the “territory”, and to suggest that the vision known as the “philosophy of information” seems an attempt to overlay or replace the two terms – hence its “naturalism” – rather than to draw a “map” as similar as possible to the “territory”.

§.5- Folksonomies, Law and «Cur jus?»: the user in folksonomies: from “anybody” to “somebody”

This profile does not fall within the topics covered in Manzoli’s post, but I would like to take this opportunity to discuss it because it is the most intriguing to me.

Each of us arranges his resources according to the meaning that he intends to give his world. Think of how each of us arrays the resources containing information that he needs in his work: the books on the desk of a scholar, the files on the bench of a lawyer or a judge, the documents in the archive of a company. We place things around us depending on the problem we have to address: we use the surrounding space to help us find the solution.

With folksonomies, in general, we simply do the same in a context in which the concept of “space” is just a matter of abstraction.

What does it mean? We organize things, then we create “information”. Gregory Bateson in a very famous book, Steps to an Ecology of Mind – in which he wrote on “maps” and “territories”, too – stated that “information” is “the difference that makes the difference”. This definition, brilliant in its simplicity, raises the tremendous problem of the meaning of our existence and the freedom of will. This issue can be explained through an example given by a very interesting app called “Somebody”, recently released by the contemporary artist Miranda July.

The app works as follows: a message addressed to a given person is written and transmitted to another, who delivers it verbally. In other words, the actual recipient receives the message from an individual who is unknown to him. The point that fascinates me is this: someone suddenly comes out to tell that you “make a difference,” that you are not “anybody” because you are “somebody” for “somebody.” Moreover, at the same time this same person, since he is addressing you, becomes “somebody,” because the sender of the message chose him among others, since he “meant something” to him.

For me, the meaning of this amazing app can be summed up in this simple equation:

“Being somebody” = “Mean something” = “Make a difference”

This formula means that each of us believes he is worth something (“being somebody”), that his life has a meaning (“mean something”), that his choices or actions can change something – even if slightly – in this world (“make a difference”).

Returning to Bateson, if it is important to each of us to “make a difference”, if we all want to be “somebody”, then how could we settle down for recognize ourselves as just an “organizing agent”? Self-consciousness is related to semantics and to the freedom of choice: who is not free at all, does not create any “difference” in the world. Poetically, Miranda July makes people talk to each other, giving a meaning to humanity and a purpose to freedom: this is what “making a difference” means for humans.

In applying folksonomies to law, we should consider all this. It is true that folksonomies record the way in which each user arrays available legal documents, but it should be emphasized the purpose for which this activity is carried out. Therefore, it should be clear that an efficient cataloguing of resources depends on several conditions: certainly that the user shall know the law and remember its ontologies, but also that he shall be focused on what he is doing. This means that the user needs to be well-motivated, in order to recognize the value of what he is doing, so that to give meaning to his activity.

§.6- Conclusion

I believe that folksonomies can teach us a lot. In them we can find not only an extraordinary technical tool, but also – and most importantly – a reason to overcome the traditional legal positivism – which is “ontological” and therefore “fragile” – and thus rediscover the cooperation not only among experts, but also with non-experts, in the name of an “antifragile” shared legacy of knowledge that is called “law”.

All this will work – or at least, it will work better – if we remember that we are human beings.

Federico Costantini.

I hold a Master’s degree in Law and a Ph.D. in Philosophy of Law from the University of Padua (Italy).
Currently I am Researcher in Philosophy of Law (Legal informatics) in the Department of Legal sciences at the University of Udine (Italy).
My study aims to bridge philosophy, computer science and law, focusing on the strife between human nature and new technologies. Recently I am investigating the theoretical implications of ICTs on «social ontology», the concept of law as an instrument of social control as emerging from the «peer to peer economy», the use of folksonomies in legal information management and the theoretical aspects of Digital evidence.
I teach Legal Informatics in the Faculty of Law of Udine. In my lectures on cyberlaw, which I study since 2000, I bring out the critical profiles of the “Information Society” from the discussion of the most recent jurisprudence.
I am also a Lawyer. I am registered in the Bar Association of Udine (Italy) in a special section (full time academic researchers and professors).
My full profile can be visited on www.linkedin.com .
My complete list of publications can be found on https://air.uniud.it.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

Legal Prosumers: How Can Government Leverage User-Generated Content?

Crowdsourcing and legal information systems, Legal knowledge representation, Legal ontologies, Legal text mining, Legal text processing, natural language processing, Semantic Web and law, User-generated content and legal information 1 Response »

Nov 172011

Prosumption: shifting the barriers between information producers and consumers

One of the major revolutions of the Internet era has been the shifting of the frontiers between producers and consumers [1]. Prosumption refers to the emergence of a new category of actors who not only consume but also contribute to content creation and sharing. Under the umbrella of Web 2.0, many sites indeed enable users to share multimedia content, data, experiences [2], views and opinions on different issues, and even to act cooperatively to solve global problems [3]. Web 2.0 has become a fertile terrain for the proliferation of valuable user data enabling user profiling, opinion mining, trend and crisis detection, and collective problem solving [4].

The private sector has long understood the potentialities of user data and has used them for analysing customer preferences and satisfaction, for finding sales opportunities, for developing marketing strategies, and as a driver for innovation. Recently, corporations have relied on Web platforms for gathering new ideas from clients on the improvement or the development of new products and services (see for instance Dell’s Ideastorm; salesforce’s IdeaExchange; and My Starbucks Idea). Similarly, Lego’s Mindstorms encourages users to share online their projects on the creation of robots, by which the design becomes public knowledge and can be freely reused by Lego (and anyone else), as indicated by the Terms of Service. Furthermore, companies have been recently mining social network data to foresee future action of the Occupy Wall Street movement.

Even scientists have caught up and adopted collaborative methods that enable the participation of laymen in scientific projects [5].

Now, how far has government gone in taking up this opportunity?

Some recent initiatives indicate that the public sector is aware of the potential of the “wisdom of crowds.” In the domain of public health, MedWatcher is a mobile application that allows the general public to submit information about any experienced drug side effects directly to the US Food and Drug Administration. In other cases, governments have asked for general input and ideas from citizens, such as the brainstorming session organized by Obama government, the wiki launched by the New Zealand Police to get suggestions from citizens for the drafting of a new policing act to be presented to the parliament, or the Website of the Department of Transport and Main Roads of the State of Queensland, which encourages citizens to share their stories related to road tragedies.

Even in so crucial a task as the drafting of a constitution, government has relied on citizens’ input through crowdsourcing [6]. And more recently several other initiatives have fostered crowdsourcing for constitutional reform in Morocco and in Egypt .

It is thus undeniable that we are witnessing an accelerated redefinition of the frontiers between experts and non-experts, scientists and non-scientists, doctors and patients, public officers and citizens, professional journalists and street reporters. The ‘Net has provided the infrastructure and the platforms for enabling collaborative work. Network connection is hardly a problem anymore. The problem is data analysis.

In other words: how to make sense of the flood of data produced and distributed by heterogeneous users? And more importantly, how to make sense of user-generated data in the light of more institutional sets of data (e.g., scientific, medical, legal)? The efficient use of crowdsourced data in public decision making requires building an informational flow between user experiences and institutional datasets.

Similarly, enhancing user access to public data has to do with matching user case descriptions with institutional data repositories (“What are my rights and obligations in this case?”; “Which public office can help me”?; “What is the delay in the resolution of my case?”; “How many cases like mine have there been in this area in the last month?”).

From the point of view of data processing, we are clearly facing a problem of semantic mapping and data structuring. The challenge is thus to overcome the flood of isolated information while avoiding excessive management costs. There is still a long way to go before tools for content aggregation and semantic mapping are generally available. This is why private firms and governments still mostly rely on the manual processing of user input.

The new producers of legally relevant content: a taxonomy

Before digging deeper into the challenges of efficiently managing crowdsourced data, let us take a closer look at the types of user-generated data flowing through the Internet that have some kind of legal or institutional flavour.

One type of user data emerges spontaneously from citizens’ online activity, and can take the form of:

citizens’ forums

platforms gathering open public data and comments over them (see for instance data-publica )

legal expert blogs (blawgs)

or the journalistic coverage of the legal system.

User data can as well be prompted by institutions as a result of participatory governance initiatives, such as:

crowdsourcing (targeting a specific issue or proposal by government as an open brainstorming session)

comments and questions addressed by citizens to institutions through institutional Websites or through e-mail contact.

This variety of media supports and knowledge producers gives rise to a plurality of textual genres, semantically rich but difficult to manage given their heterogeneity and quick evolution.

Managing crowdsourcing

The goal of crowdsourcing in an institutional context is to extract and aggregate content relevant for the management of public issues and for public decision making. Knowledge management strategies vary considerably depending on the ways in which user data have been generated. We can think of three possible strategies for managing the flood of user data:

Pre-structuring: prompting the citizen narrative in a strategic way

A possible solution is to elicit user input in a structured way; that is to say, to impose some constraints on user input. This is the solution adopted by IdeaScale, a software application that was used by the Open Government Dialogue initiative of the Obama Administration. In IdeaScale, users are asked to check whether their idea has already been covered by other users, and alternatively to add a new idea. They are also invited to vote for the best ideas, so that it is the community itself that rates and thus indirectly filters the users’ input.

The MIT Deliberatorium, a technology aimed at supporting large-scale online deliberation, follows a similar strategy. Users are expected to follow a series of rules to enable the correct creation of a knowledge map of the discussion. Each post should be limited to a single idea, it should not be redundant, and it should be linked to a suitable part of the knowledge map. Furthermore, posts are validated by moderators, who should ensure that new posts follow the rules of the system. Other systems that implement the same idea are featurelist and Debategraph [7].

While these systems enhance the creation and visualization of structured argument maps and promote community engagement through rating systems, they present a series of limitations. The most important of these is the fact that human intervention is needed to manually check the correct structure of the posts. Semantic technologies can play an important role in bridging this gap.

Semantic analysis through ontologies and terminologies

Ontology-driven analysis of user-generated text implies finding a way to bridge Semantic Web data structures, such as formal ontologies expressed in RDF or OWL, with unstructured implicit ontologies emerging from user-generated content. Sometimes these emergent lightweight ontologies take the form of unstructured lists of terms used for tagging online content by users. Accordingly, some works have dealt with this issue, especially in the field of social tagging of Web resources in online communities. More concretely, different works have proposed models for making compatible the so-called top-down metadata structures (ontologies) with bottom-up tagging mechanisms (folksonomies).

The possibilities range from transforming folksonomies into lightly formalized semantic resources (Lux and Dsinger, 2007; Mika, 2005) to mapping folksonomy tags to the concepts and the instances of available formal ontologies (Specia and Motta, 2007; Passant, 2007). As the basis of these works we find the notion of emergent semantics (Mika, 2005), which questions the autonomy of engineered ontologies and emphasizes the value of meaning emerging from distributed communities working collaboratively through the Web.

We have recently worked on several case studies in which we have proposed a mapping between legal and lay terminologies. We followed the approach proposed by Passant (2007) and enriched the available ontologies with the terminology appearing in lay corpora. For this purpose, OWL classes were complemented with a has_lexicalization property linking them to lay terms.

The first case study that we conducted belongs to the domain of consumer justice, and was framed in the ONTOMEDIA project. We proposed to reuse the available Mediation-Core Ontology (MCO) and Consumer Mediation Ontology (COM) as anchors to legal, institutional, and expert knowledge, and therefore as entry points for the queries posed by consumers in common-sense language.

The user corpus contained around 10,000 consumer questions and 20,000 complaints addressed from 2007 to 2010 to the Catalan Consumer Agency. We applied a traditional terminology extraction methodology to identify candidate terms, which were subsequently validated by legal experts. We then manually mapped the lay terms to the ontological classes. The relations used for mapping lay terms with ontological classes are mostly has_lexicalisation and has_instance.

A second case study in the domain of consumer law was carried out with Italian corpora. In this case domain terminology was extracted from a normative corpus (the Code of Italian Consumer law) and from a lay corpus (around 4000 consumers’ questions).

In order to further explore the particularities of each corpus respecting the semantic coverage of the domain, terms were gathered together into a common taxonomic structure [8]. This task was performed with the aid of domain experts. When confronted with the two lists of terms, both laypersons and technical experts would link most of the validated lay terms to the technical list of terms through one of the following relations:

Subclass: the lay term denotes a particular type of legal concept. This is the most frequent case. For instance, in the class objects, telefono cellulare (cell phone) and linea telefonica (phone line) are subclasses of the legal terms prodotto (product) and servizio (service), respectively. Similarly, in the class actors agente immobiliare (estate agent) can be seen as subclass of venditore (seller). In other cases, the linguistic structures extracted from the consumers’ corpus denote conflictual situations in which the obligations have not been fulfilled by the seller and therefore the consumer is entitled to certain rights, such as diritto alla sostituzione (entitlement to a replacement). These types of phrases are subclasses of more general legal concepts such as consumer right.

Instance: the lay term denotes a concrete instance of a legal concept. In some cases, terms extracted from the consumer corpus are named entities that denote particular individuals, such as Vodafone, an instance of a domain actor, a seller.

Equivalent: a legal term is used in lay discourse. For instance, contratto (contract) or diritto di recessione (withdrawal right).

Lexicalisation: the lay term is a lexical variant of the legal concept. This is the case for instance of negoziante, used instead of the legal term venditore (seller) or professionista (professional).

The distribution of normative and lay terms per taxonomic level shows that, whereas normative terms populate mostly the upper levels of the taxonomy [9], deeper levels in the hierarchy are almost exclusively represented by lay terms.

Term distribution per taxonomic level

The result of this type of approach is a set of terminological-ontological resources that provide some insights on the nature of laypersons’ cognition of the law, such as the fact that citizens’ domain knowledge is mainly factual and therefore populates deeper levels of the taxonomy. Moreover, such resources can be used for the further processing of user input. However, this strategy presents some limitations as well. First, it is mainly driven by domain conceptual systems and, in a way, they might limit the potentialities of user-generated corpora. Second, they are not necessarily scalable. In other words, these terminological-ontological resources have to be rebuilt for each legal subdomain (such as consumer law, private law, or criminal law), and it is thus difficult to foresee mechanisms for performing an automated mapping between lay terms and legal terms.

Beyond domain ontologies: information extraction approaches

One of the most important limitations of ontology-driven approaches is the lack of scalability. In order to overcome this problem, a possible strategy is to rely on informational structures that occur generally in user-generated content. These informational structures go beyond domain conceptual models and identify mostly discursive, emotional, or event structures.

Discursive structures formalise the way users typically describe a legal case. It is possible to identify stereotypical situations appearing in the description of legal cases by citizens (i.e., the nature of the problem; the conflict resolution strategies, etc.). The core of those situations is usually predicates, so it is possible to formalize them as frame structures containing different frame elements. We followed such an approach for the mapping of the Spanish corpus of consumers’ questions to the classes of the domain ontology (Fernández-Barrera and Casanovas, 2011). And the same technique was applied for mapping a set of citizens’ complaints in the domain of acoustic nuisances to a legal domain ontology (Bourcier and Fernández-Barrera, 2011). By describing general structures of citizen description of legal cases we ensure scalability.

Emotional structures are extracted by current algorithms for opinion- and sentiment mining. User data in the legal domain often contain an important number of subjective elements (especially in the case of complaints and feedback on public services) that could be effectively mined and used in public decision making.

Finally, event structures, which have been deeply explored so far, could be useful for information extraction from user complaints and feedback, or for automatic classification into specific types of queries according to the described situation.

Crowdsourcing in e-government: next steps (and precautions?)

Legal prosumers’ input currently outstrips the capacity of government for extracting meaningful content in a cost-efficient way. Some developments are under way, among which are argument-mapping technologies and semantic matching between legal and lay corpora. The scalability of these methodologies is the main obstacle to overcome, in order to enable the matching of user data with open public data in several domains.

However, as technologies for the extraction of meaningful content from user-generated data develop and are used in public-decision making, a series of issues will have to be dealt with. For instance, should the system developer bear responsibility for the erroneous or biased analysis of data? Ethical questions arise as well: May governments legitimately analyse any type of user-generated content? Content-analysis systems might be used for trend- and crisis detection; but what if they are also used for restricting freedoms?

The “wisdom of crowds” can certainly be valuable in public decision making, but the fact that citizens’ online behaviour can be observed and analysed by governments without citizens’ acknowledgement poses serious ethical issues.

Thus, technical development in this domain will have to be coupled with the definition of ethical guidelines and standards, maybe in the form of a system of quality labels for content-analysis systems.

[Editor’s Note: For earlier VoxPopuLII commentary on the creation of legal ontologies, see Núria Casellas, Semantic Enhancement of Legal Information… Are We Up for the Challenge? For earlier VoxPopuLII commentary on Natural Language Processing and legal Semantic Web technology, see Adam Wyner, Weaving the Legal Semantic Web with Natural Language Processing. For earlier VoxPopuLII posts on user-generated content, crowdsourcing, and legal information, see Matt Baca and Olin Parker, Collaborative, Open Democracy with LexPop; Olivier Charbonneau, Collaboration and Open Access to Law; Nick Holmes, Accessible Law; and Staffan Malmgren, Crowdsourcing Legal Commentary.]

[1] The idea of prosumption existed actually long before the Internet, as highlighted by Ritzer and Jurgenson (2010): the consumer of a fast food restaurant is to some extent as well the producer of the meal since he is expected to be his own waiter, and so is the driver who pumps his own gasoline at the filling station.

[2] The experience project enables registered users to share life experiences, and it contained around 7 million stories as of January 2011: http://www.experienceproject.com/index.php.

[3] For instance, the United Nations Volunteers Online platform (http://www.onlinevolunteering.org/en/vol/index.html) helps volunteers to cooperate virtually with non-governmental organizations and other volunteers around the world.

[4] See for instance the experiment run by mathematician Gowers on his blog: he posted a problem and asked a large number of mathematicians to work collaboratively to solve it. They eventually succeeded faster than if they had worked in isolation: http://gowers.wordpress.com/2009/01/27/is-massively-collaborative-mathematics-possible/.

[5] The Galaxy Zoo project asks volunteers to classify images of galaxies according to their shapes: http://www.galaxyzoo.org/how_to_take_part. See as well Cornell’s projects Nestwatch (http://watch.birds.cornell.edu/nest/home/index) and FeederWatch (http://www.birds.cornell.edu/pfw/Overview/whatispfw.htm), which invite people to introduce their observation data into a Website platform.

[6] http://www.participedia.net/wiki/Icelandic_Constitutional_Council_2011.

[7] See the description of Debategraph in Marta Poblet’s post, Argument mapping: visualizing large-scale deliberations (http://serendipolis.wordpress.com/2011/10/01/argument-mapping-visualizing-large-scale-deliberations-3/).

[8] Terms have been organised in the form of a tree having as root nodes nine semantic classes previously identified. Terms have been added as branches and sub-branches, depending on their degree of abstraction.

[9] It should be noted that legal terms are mostly situated at the second level of the hierarchy rather than the first one. This is natural if we take into account the nature of the normative corpus (the Italian consumer code), which contains mostly domain specific concepts (for instance, withdrawal right) instead of general legal abstract categories (such as right and obligation).

REFERENCES

Bourcier, D., and Fernández-Barrera, M. (2011). A frame-based representation of citizen’s queries for the Web 2.0. A case study on noise nuisances. E-challenges conference, Florence 2011.

Fernández-Barrera, M., and Casanovas, P. (2011). From user needs to expert knowledge: Mapping laymen queries with ontologies in the domain of consumer mediation. AICOL Workshop, Frankfurt 2011.

Lux, M., and Dsinger, G. (2007). From folksonomies to ontologies: Employing wisdom of the crowds to serve learning purposes. International Journal of Knowledge and Learning (IJKL), 3(4/5): 515-528.

Mika, P. (2005). Ontologies are us: A unified model of social networks and semantics. In Proc. of Int. Semantic Web Conf., volume 3729 of LNCS, pp. 522-536. Springer.

Passant, A. (2007). Using ontologies to strengthen folksonomies and enrich information retrieval in Weblogs. In Int. Conf. on Weblogs and Social Media, 2007.

Poblet, M., Casellas, N., Torralba, S., and Casanovas, P. (2009). Modeling expert knowledge in the mediation domain: A Mediation Core Ontology, in N. Casellas et al. (Eds.), LOAIT- 2009. 3rd Workshop on Legal Ontologies and Artificial Intelligence Techniques joint with 2^nd Workshop on Semantic Processing of Legal Texts. Barcelona, IDT Series n. 2.

Ritzer, G., and Jurgenson, N. (2010). Production, consumption, prosumption: The nature of capitalism in the age of the digital “prosumer.” In Journal of Consumer Culture 10: 13-36.

Specia, L., and Motta, E. (2007). Integrating folksonomies with the Semantic Web. Proc. Euro. Semantic Web Conf., 2007.

Meritxell Fernández-Barrera is a researcher at the Cersa (Centre d’Études et de Recherches de Sciences Administratives et Politiques) -CNRS, Université Paris 2-. She works on the application of natural language processing (NLP) to legal discourse and legal communication, and on the potentialities of Web 2.0 for participatory democracy.

VoxPopuLII is edited by Judith Pratt. Editor-in-Chief is Robert Richards, to whom queries should be directed. The statements above are not legal advice or legal representation. If you require legal advice, consult a lawyer. Find a lawyer in the Cornell LII Lawyer Directory.

The MetaLex Document Server

Legal identifiers, Legal metadata, Legal semantic web, Legal text processing, Legal XML, Legislative information systems, Regulatory information systems, Semantic Web and law 3 Responses »

Oct 252011

In this post I describe the process and requirements that eventually led to the MetaLex Document Server, a server that hosts all versions of Dutch national statutes and regulations published since May 2011, both as CEN MetaLex, and as Linked Data. Before I set out to do so, however, I would like to emphasize that, although the development of the server and its contents was a one-man-job, the road to make it possible surely was not solitary. A couple of people I’d like to mention here are Alexander Boer, Radboud Winkels, and Tom van Engers of the Leibniz Center for Law, together with whom I have worked over the past ten years to develop, test, and publish the ideas that underlie CEN MetaLex. Also, the team around legislation.gov.uk clearly has done a lot of great and inspiring work in this area.

So, what happened? Over the course of last spring, I was involved in several small-scale projects that shared a specific need: version-aware identifiers for all parts of legislative texts. The first of these was a report for the Swiss Federal Chancellery on possible technological solutions for a regulation drafting system to be used by the Swiss government. Second to arrive on my desk was a project for the Dutch Tax and Customs Administration (Belastingdienst), in which we were asked to develop a concept-extraction toolkit that would allow them to make explicit where concepts are defined, where they are reused, and how they relate to other concepts (e.g., from an external thesaurus). The purpose of this project was to investigate whether we could replace with technology what is currently a manual process of turning legislation into business processes that fuel citizen- and business-oriented services. The Belastingdienst needs this to better cope with the yearly changes to tax regulation issued by the Ministry of Finance. The Dutch Immigration and Naturalisation Service (IND) faces exactly the same problem: of discovering what part of their business processes is affected by each legislative modification. Updates to legislation require continuous, significant investment in IT re-engineering.

The root of the problem

But don’t modern European governments already have elaborate facilities for supporting this workflow? I’m afraid not.

Currently, regulation drafting is a process of sending around Word documents, copy-and-pasting from older texts, “version hell,” signing by a Minister, and sending the enacted regulations off to a publisher, who will then turn it into some XML format to feed a publishing platform to generate HTML, PDF, and paper versions of the texts. This process is not designed with a content management perspective, and most if not all metadata is thrown away in the process.

Part of the problem is one of organisational change: convincing legislative drafters to use a more structured approach in their daily work. The Dutch Ministry of Security and Justice is currently developing a legislative editing environment (similar to the MetaVex editor developed at the University of Amsterdam), but it will take awhile before this is adopted in practice.

Requirements

To develop a chain of tools for managing legal information, both as text and as knowledge models, we need to address a number of key requirements:

An integrated legislative drafting and editing environment that supports advanced version and provenance tracking (e.g., version tracking of successive changes to draft texts). Provenance information is very important for eliciting the procedure that led to an official version (both pre- and post-publication), as well as its underlying motivation.
A format in which these texts are stored that is flexible enough to allow both editing and publication to various formats (such as PDF and HTML).
The ability to persistently identify every element of a legal text. Versioning of texts, references, and metadata requires identifiers that reflect the different versions of these resources. The various parts of a text should be versioned independently, allowing for transitory regimes.

A versioning mechanism should distinguish between a regulation text as it exists at a particular time, and the final regulation. The IFLA Functional Requirements for Bibliographic Records (FRBR) (Saur, ’98) makes the following distinctions: the work as a “distinguishable intellectual or artistic creation” (e.g., the constitution); the expression as the “intellectual or artistic form that a work takes each time it is realised” (e.g., “The Constitution of July 15th, 2008”); the manifestation as the “physical embodiment of an expression of a work” (e.g., a PDF version of “The Constitution of July 15th, 2008”); and the item as a “single exemplar of a manifestation” (e.g., the PDF version of “The Constitution of July 15th, 2008” residing on my USB stick).

These identifiers should be dereferenceable to the element they describe, or a description of the element’s metadata.
Metadata and annotations should be traceable to the most detailed part of a text, as well as to its version, when needed. The same requirement holds for references between texts, allowing for fine-grained analysis of interdependencies between texts.
It is furthermore a requirement that these identifiers be transparent and follow a prescribed naming convention. This allows third parties to construct valid identifiers without having to first query a name service.
The metadata itself should be made accessible in a standard format as well.

Making do with what we’ve got

As we don’t have any time to waste, and have neither the organisational infrastructure, nor the funds, to use or develop any other (richer) information source, we need to make do with what’s currently available. How hard is it to build a chain of tools that meets at least part of these requirements? And, what information does the Dutch government already provide on which we can build the services that it itself so dearly needs?

Wetten.overheid.nl is the de facto source for legislative information in The Netherlands. Users can perform a full text search through the titles and text of all statutes and regulations of the Kingdom of the Netherlands. They can search for a specific article, as well as for the version of a text as it stood at a specified date. Wetten.overheid.nl also provides an API for retrieving XML manifestations of statutes and regulations.

Identifiers

What about identifiers? Wetten.nl supports deeplinks to particular versions of statutes and regulations, but is not very consistent about it. For example:

http://wetten.overheid.nl/cgi-bin/deeplink/law1/bwbid=BWBR0005416/article=6/date=2005-01-14

and

http://wetten.overheid.nl/BWBR0005416/TitelII698946/HoofdstukII/Artikel16/geldigheidsdatum_14-01-2005

both point to article 6 of the Municipal law, as it was in effect on January 14th, 2005. A third mechanism for identifying regulations is the Juriconnect standard for referring to parts of statutes and regulations. XML documents hosted by Wetten.nl use these identifiers to specify citations between statutes and regulations. For instance, the Juriconnect identifier for article 6 of the Municipal law is:

1.0:c:BWBR0005416&artikel=16&g=2005-01-14

But… the standard does not prescribe a mechanism for dereferenecing an identifier to the actual text of (part of) a statute or regulation.

The BWB XML service and format

XML manifestations of statutes and regulations are retrievable through an API on top of the “Basiswettenbestand” (BWB) content management system. This REST Web service only provides the latest version of an entire statute or regulation. The BWB XML document returned is stripped of all version history: it does not even contain the version date of the text itself.

An index of all BWB identifiers, with basic attributes such as official and abbreviated titles, enactment and publication dates, retroactivity, etc. is available as a zipped XML dump or a SOAP service. Unfortunately, the XML file is corrupt, and the date of the latest change to a statute or regulation reported in the index is not really the date of the latest modification, but of the latest update of the statute or regulation in the CMS. See the picture above.

The BWB uses its own XML schema for storing statutes and regulations; this schema does not allow for intermixing with any third-party elements or attributes, ruling out obvious extensions for rich annotations such as RDFa. And, BWB XML elements do not carry any identifiers.

A more general XML format: CEN MetaLex

CEN MetaLex is a jurisdiction-independent XML format for representing, publishing, and interchanging legal texts. It was developed to allow traceability of legal knowledge representations to their original source. MetaLex elements are purely structural. Syntactic elements (structure) are strictly distinct from the meaning of elements by specifying for each element a name and its content model. What this essentially does is to pave the way for a semantic description of the types of content of elements in an XML document. The standard prescribes the existence of a naming convention for minting URI-based identifiers for all structural elements of a legal document. MetaLex explicitly encourages the use of RDFa attributes on its elements, and provides special metadata-elements for serialising additional RDF triples that cannot be expressed on structural elements themselves. MetaLex includes an ontology, which defines an event model for legislative modifications. The legislation.gov.uk portal has adopted the MetaLex event model for representing modifications.

Getting from BWB to CEN MetaLex

The MetaLex schema is designed to be independent of jurisdiction, which means that it should be possible to map each legacy XML element to a MetaLex element in an unambiguous fashion. Fortunately, we were able to define a straightforward 1:n mapping between BWB and MetaLex (see below) by a semi-automatic conversion of the BWB XML DTD.

The transformation of legacy XML to MetaLex and RDF is implemented in the MetaLex converter, an open source Python script available from GitHub. Conversion occurs in four stages:

mapping legacy elements to MetaLex elements,
minting identifiers for newly created elements,
producing metadata for these elements, and
serialising to appropriate formats.

Step 1: Mapping

For the transformation of BWB XML files, the converter is sequentially fed with all BWB XML files and identifiers listed in the BWB ID index. Based on the mapping table, the converter traverses the DOM tree of the source document, and synchronously builds a DOM tree for the target document. In cases where the MetaLex schema doesn’t quite “fit,” the converter has to make additional repairs.

We evaluated the ability of MetaLex to map onto the BWB XML by running the converter on 300 randomly selected BWB identifiers. The artikel element accounts for 72% of all corrections, and corresponds to 68% of all htitle substitutions (5 % of total). This means that only a very small part of BWB XML does not directly fit onto the MetaLex schema. The main cause for incompatibility is the restriction in MetaLex that hcontainer elements are not allowed to contain block elements (and that’s perhaps something to consider for the MetaLex workshop).

Because of the limitations of the API, version information, citation titles, and other metadata are retrieved through a custom-built scraper of the information pages on the wetten.nl Website.

Step 2: Minting Identifiers

For every element in the document we create transparent URL-like URIs for the work, expression and manifestation levels, and two opaque URIs for the expression and item levels in the FRBR specification.

For transparent URIs, we use a naming scheme that is based on the URIs used at legislation.gov.uk, with slight adaptations to allow for the Dutch situation. In short, work level identifiers are based on the standard BWB identifier, followed by a hierarchical path to an element in the source, e.g., “chapter/1/article/1”. These URIs are extended to expression URIs by appending version and language information. Similarly, manifestation URIs are extensions of expression URIs that specify format information such as XML, RDF, etc. Juriconnect references in the source BWB XML are automatically translated to this naming scheme.

The opaque version URI is needed to distinguish different versions of a text. The current Webservice does not provide access to all versions of statutes and regulations (only to the latest), let alone at a level of granularity lower than entire statutes or regulations. We therefore need some way of constructing a version history by regularly checking for new versions, and comparing them to those we looked at before. By including in the opaque URI a unique SHA1 hash of the textual content of an XML element, and simultaneously maintaining a link between the opaque URI and the transparent identifier, different expressions of a work can be automatically distinguished through time. This is needed to work around issues with identifiers based on numbers: the insertion of a new element can change the position (and therefore the identifier) of other elements without a change in the content of the elements.

By this method, globally persistent URIs of every element in a legal text can be consistently generated for both current and future versions of the text. By simultaneously generating an opaque and a transparent expression level URI, identification of these text versions does not have to rely on numbering.

Step 3: Producing Metadata

The MetaLex converter produces three types of metadata. First, legacy metadata from attributes in the source XML is directly translated to RDF triples. Second, metadata describing the structural and identity relations between elements is created. This includes typing resources according to the MetaLex ontology, e.g., as ml:BibliographicExpression; creating ml:realizes relations between expressions and works; and creating owl:sameAs relations between opaque and transparent expression URIs. The official title, abbreviation, and publication date of statutes and regulations are represented using the dcterms:title, dcterms:alternative and dct:valid properties.

Events and Processes

Event information plays a central role in determining what version of a regulation was valid when. Making explicit which events and modifying processes contributed to an expression of a regulation provides for a flexible and extensible model. The MDS uses the MetaLex ontology for legislative modification events, the Simple Event Model (SEM) and the W3C Time Ontology for an abstract description of events and event types, and the Open Provenance Model Vocabulary (OPMV) for describing processes and provenance information. These vocabularies can be combined in a compatible fashion, allowing for maximal reuse of event and process descriptions by third parties that may not necessarily commit to the MetaLex ontology.

Step 4: Serialization

The MetaLex converter supports three formats for serialising a legal text to a manifestation: the MetaLex format itself, viewable in a browser by linking a CSS stylesheet; RDFa, Turtle and RDF/XML serializations of the RDF metadata; and a citation graph. The converter can automatically upload RDF to a triple store through either the Sesame API, or SPARQL updates. The citation graph is exported as a ‘”net” network file, for further analysis in social network software tools such as Pajek and Gephi. We are exploring ways to use these networks for determining the importance of articles (in degree) and the dependency of legislation on certain articles (betweenness centrality), and for analysing the correlation between legislation and case law.

Publication

The results of this procedure are published through the MetaLex Document Server (MDS). The server follows the Cool URIs specification, and implements HTTP-based redirects for work- and expression-level URIs to corresponding manifestations based on the HTTP accept header. Requests for an HTML mime-type are redirected to a Marbles HTML rendering of a Symmetric Concise Bounded Description (SCBD) of the RDF resource. Similarly, requests for RDF content return the SCBD itself; supported formats are RDF/XML and Turtle. A request for XML will return a snippet of MetaLex for the specified part of a statute or regulation.

The MDS provides two convenient methods for retrieving manifestations of a statute or regulation. Appending “/latest” to a work URI will redirect to the latest expression present in the triple store. Appending an arbitrary ISO date will return the last expression published before that date if no direct match is available. Lastly, the MDS offers a simple search interface for finding statutes and regulations based on the title and version date.

Results and Take Up

We have been running the converter on a daily basis, on all versions of statutes and regulations made available through the wetten.nl portal since May 2011. This has resulted in a current total of 29,120 document versions: 28 thousand versions in the first run, the rest accumulated through time. For these document versions, we now store 119 million triples of RDF metadata in a 4Store triple store. Compared to the size of legislation.gov.uk (1.9 billion triples, since the 1200s), this is a modest number, but at the current growth rate we will soon need to look for alternative (more professional) solutions. Check the http://doc.metalex.eu Website for the latest numbers.

I am happy to say, also, that this work has not gone unnoticed. The IND was particularly enthused by the versioning mechanism, and is in the process of adopting the MDS approach as their internal content management system. Similarly, the ability to link concept descriptions to reliably versioned parts of legislation has been an eye opener for the Belastingdienst. We are also in touch with several people at ICTU, the organisation behind Wetten.nl, to help them improve their services.

Dr. Rinke Hoekstra is a postdoctoral researcher at the Leibniz Center for Law at the University of Amsterdam. He is the developer of the MetaLex Document Server, the principal author of the LKIF Core ontology of basic legal concepts, and one of the initial developers of the MetaLex XML format for legal sources.

VoxPopuLII is edited by Judith Pratt. Editor-in-Chief is Robert Richards, to whom queries should be directed. The statements above are not legal advice or legal representation. If you require legal advice, consult a lawyer. Find a lawyer in the Cornell LII Lawyer Directory.

Is it Time for Law Libraries to Collaborate on Description for Their Own Institutions’ Legal Scholarship?

Legal knowledge representation, Legal metadata, Legal ontologies, Legal publishing, Legal semantic web, Linked Data and law, Semantic annotation of legal texts, Semantic Web and law 2 Responses »

Sep 302011

Over the past couple of years, there has been a great deal of discussion — particularly in relation to the Durham Statement [1] — about technical standards and preservation issues for law reviews that publish openly and exclusively online. Other colleagues have already blogged or written more formally about the lack of metadata being produced in the production of law reviews, and about problems in indexing open access law journal literature. [2] In a previous VoxPopuLII post, Dr. Núria Casellas discussed the significance of semantic enhancement and how it affects how we should be thinking about providing access to legal information. [3] So I would like to marry the two discussions, by formally asking my academic law librarian colleagues whether the time has come for us to work together to develop an ontology [4], a substantive knowledge system that could be used by our law schools’ legal journals in “marking up” content for consumption on the Internet.

This ontology should be applied not only to what we think of as traditional law journal content, but also to “related” content — such as companion blogs, video, data, etc. This related content will inevitably grow, as what we think of as a “journal” evolves. Indeed, a typical “journal” will most likely look very different years from now than it does today. As members of the institutions that publish one of the major forms of literature in law, and as members of organizations that possess significant legal metadata and subject expertise, law librarians are uniquely positioned to facilitate the discoverability and utility of law reviews published on the Web. Such a project also has the potential to support additional projects such as new metrics and ways in which to look at scholarship.

If this system were indeed widely adopted, it could facilitate a type of access to law journal content that has not been accomplished with existing, centralized means of access, such as Google Scholar, the ABA’s project, and commercial databases. Ideally, I would like to see our community develop a cooperative that could provide a hosted technical infrastructure to be used by institutions that lack the financial or technical resources to invest in a major repository service or open source solutions. While this idea seems “utopian” at this point, I think that our community could realistically pursue standards and language for the more “substantive” aspects of the metadata, even if we are unable to agree on the ultimate solutions for preservation or platforms that “serve up” the content. Such an ontology could also serve as a precursor to even more ambitious, collaborative projects to make legal information more accessible and discoverable.

What do you mean by an ontology? Don’t you just mean a taxonomy or “shared vocabulary”?

I frame the idea as an “ontology project” because publishing on the Web has increasingly become about structured, open, Linked Data and marking up content for the Semantic Web [4]. As the significance of Linked Data grows, it is important for us to think about access to legal scholarship in terms of knowledge systems that contemplate access to information in those terms. The use of structured data/schemas in publishing law reviews would be optimized by human knowledge/expertise for the expression of ideas and language to be applied in that data. We need to think about subject access beyond the standard, familiar hierarchical “subjects” that we have come to use in our existing taxonomies, indexing, and classification systems. [5] We should be thinking about these “subjects” in a way that shows deeper interrelationships between concepts and “types” and that “interacts” and is “interoperable” with other systems. The ontology approach contemplates access to information from a variety of perspectives in relational and situational ways.

Law reviews could be published in a way that incorporates a particular ontology that could also be mapped to other ontologies. Linking ontologies in this way would yield useful connections across systems, bodies of knowledge, and perspectives, including multilingual thesauri, interdisciplinary knowledge, and practice-oriented and “pro se” consumer perspectives. Thinking about the project as an “ontology” also brings to mind three other important features of the system: (1) the “philosophical” definition of the term “ontology”; (2) the significance of “language” and subject expertise; and (3) the flexibility that would allow us to build something dynamic and responsive to the ever-changing nature of law. Such an approach contemplates the approach to legal information advocated in Dr. Casellas’s piece. [6]

Don’t legal indexes already do this? Why reinvent the wheel?

When some of these issues were raised last October at the workshop entitled “Implementing the Durham Statement: Best Practices for Open Access Law Journals”, someone asked why we should “reinvent the wheel” when other longstanding systems (e.g., law journal indexes) are already doing this. Most of these longstanding systems are based on paid subscription models and are not open in a way that facilitates rapid response to evolving developments in the law, or use by those who consume legal information. More importantly, this project would really be about facilitating publishing and improving access to online content by providing a quality, substantive, open knowledge structure for journals to use for marking up content and building access into publication. This project would not be an attempt to displace or “usurp” indexes which focus on access to content from the “outside” perspective of the publication itself (and which are increasingly concerned with marketable enhancements like full-text access, search features, user interface, Web 2.0 functionality, etc.).

The “wheels” we might be accused of reinventing also include “federated searching” and Web-scale discovery systems being purchased by libraries, but I think similar arguments about cost, perspective on the content, and scope would still apply. Development of our project and adoption of Web-scale discovery systems are not mutually exclusive. Web-scale discovery systems could potentially integrate and map to our system. In any event, the point is not to “throw out“ existing systems, but to create an additional knowledge structure that is open and potentially informed by and interoperable with other existing systems.

How would we proceed?

There are many approaches to ontology development [7], including derivation from or text-mining of legal texts [8], top-down development by humans, and building upon or extracting from existing ontologies. [9] Any of these methods (or a combination thereof) could work in the case of developing an ontological structure that could be applied to law review content.

A “top–down” approach based on the knowledge of individuals could start with librarians. But it should also involve working with law school faculty and scholars having expertise in particular subject areas, as well as with authors of law journal articles, and editors of law journals themselves (particularly those focusing on specialized legal topics). [10] Each law school has faculty and librarians who possess specialized legal subject knowledge — as well as collections in particular areas of law — that could enrich the project. In addition to contributing their substantive knowledge, librarians would have an opportunity to develop a language and a system that reflect how they think about and look for information.

Other colleagues have already suggested greater engagement with law school faculty, for purposes of learning about how faculty conduct and think about research. [11] A project like this would give us the opportunity to engage our stakeholders respecting how they think about, contextualize, and relate topics. (Perhaps we could learn more about the way law library stakeholders think about information by presenting them with samplings of articles, and inquiring as to how the stakeholders would “expect” to find those articles.) Instead of forcing those knowledgeable about their field to learn the taxonomy and structure we have been given by traditional systems, we would be harvesting the expertise of those subject specialists in order to create richer metadata that contemplates their habits and knowledge. Faculty, authors, and journal editors with subject expertise coupled with law librarians could potentially provide a very sophisticated, dynamic, and responsive system.

We should also consider looking at existing ontologies and other systems, including Library of Congress and other popular and relevant systems used in law. There are several ontologies related to law that could inform the project and that could also potentially be mapped. [12] Systems to be consulted (and mapped) could include ontologies designed for primary law and local knowledge management in legal settings, as well as ontologies in subject areas outside the law. Also, some law schools might have their own local systems that could inform ours.

Finally, while we would probably want to avoid using text mining as the only method, the project should also contemplate doing some mining and extraction from law journal literature itself. Such an approach might be particularly helpful in grappling with older legal concepts and appreciating the use of certain terms/language over time.

Whatever method(s) we select, we have a host of inspiration from other projects in legal informatics and from projects in other disciplines (particularly in the sciences) that strive to provide naming conventions within disciplines, and map knowledge across systems through coordinated efforts. Although it is a much more ambitious project than ours, John Willbanks’ Neurocommons project provides us with a model of how such a project could garner participation and grow, particularly if we were to coordinate with other projects and ontologies being developed. [13]

If we build it, who will use it?

If we do develop such a system, who would actually apply it in publishing law reviews? Hopefully, libraries will take the lead and realize that this is a role that they themselves should be fulfilling. While many libraries facilitate repositories and other platforms for publishing law journals and provide training and reference/research support for cite checking and preemption, many do not provide markup and metadata work on the articles themselves. In a recent survey by Benjamin Keele and me in relation to a paper we have been writing, only 1 in 57 respondents reported doing any work on article metadata for their journals. [14] Librarians are already cataloging books and spending time grappling with metadata development and changes in the ways in which we describe our cataloged resources (RDA, FRBR, etc.). Further, librarians today spend a lot of time and money purchasing or building “repositories” or other platforms for their law journals. Greater support of metadata development for our own institutional output (beyond provision of simplistic taxonomies) is a natural outgrowth of such activities. As other librarians have commented, providing institutional repositories is not sufficient. [15]

Such activity could contemplate new roles for catalogers. A recent NISO Webinar on the impact of Linked Data on library cataloging suggested that library catalogers will be less focused on creating “records” and more concerned with “graphs”. The presenters commented that catalogers will enhance the increasing amount of minimal metadata coming directly from publishers, and provide access to original and local content. [16] Some libraries have already integrated metadata work into their workflows for cataloged resources, and it is possible for a law library to integrate journal work into its technical services workflow. [17] From a reference perspective, librarians are already often exposed to journal content in the early stages of publication through support of preemption checking, student note/comment research help, and cite checking support. Librarians are thereby in a good position to understand the “aboutness” of the content. Such involvement could provide law librarians with a natural progression toward being more involved in helping journals “mark up” their content. It is an opportunity for us embrace more wholeheartedly the role of law libraries as publishers and knowledge managers. Many of our colleagues in the open law movement, in knowledge management in legal practice, and in other disciplines have made forays into this area. [18]

Some would probably argue that libraries do not have sufficient staff to get involved in law journal publishing activities, particularly in markup. In addition, some institutions have entire offices outside the library that support publication activities. Even if libraries feel that they are not in a position to manage the workflow of the application of this knowledge system, the most important contribution librarians can make lies in expertise or intellectual input. Application of this ontology could also be performed by law students themselves or other law school staff. Further, authors themselves are potential users and providers of metadata. In many other disciplines, especially the sciences, more authors are using author add-in tools and other software programs to help mark up their manuscripts for publication. Specialized tools could be developed to facilitate authors’ adding metadata to their own law review articles. (Many law authors are already used to contributing keywords to SSRN papers.)

“But…”: Obstacles and opportunities

As I write this piece, I anticipate comments such as, “That would be too big of an undertaking,” and: “Is that really our role?” While libraries are feeling the pressure of more limited resources and time, I would argue that this project would synergize with libraries’ existing interactions with our primary users (faculty and students) and could be built into other outreach activities. In the end, it could actually help to create an organic system responsive to users’ needs. Pursuing this project in tandem with other coordinated activities to facilitate open access law journals, law librarians would join many of our university library colleagues in thinking of ourselves in the role of producer/publisher and in providing new opportunities for our library staff (both technical services and reference/subject specialists).

I envision a host of other issues and problems (too many to enumerate in this posting) that might arise in relation to a project such as this, but I consider none of them “insurmountable.” Below is a sample of some issues that come to mind:

Coordination/governance: Who would control the project? Who would be the final arbiter of what is adopted? Past discussions of the Durham Statement have suggested the possibility of an organization providing support for journals that tried to “comply” with the Durham Statement. [19] Such an organization might consider taking on a project such as this. Perhaps leadership for this project could evolve in some way out of institutional and personal relationships, such as those that have evolved for collection development, [20] or possibly through some coordinated efforts of American Association of Law Libraries (AALL) Special Interest Sections (particularly ALL-SIS, TS-SIS, and RIPS-SIS). If our own institutions are not willing to support such a project, individual librarians on their own (myself included!) might be willing to contribute time and energy to the project. We are also fortunate to have a supportive community of technologists in the open law and knowledge management fields, who could serve as potential partners. The important aspects of the project are that it should be owned by an entity with diverse representation and interests, and that it should be established as something that will be free.

Target content and scope: Would we be framing subjects as they tend to exist in U.S. law review literature? While the structure would be designed for use by law reviews, if it were kept open and without restrictions, it could potentially be adopted by peer-reviewed journals and mapped to other indexing systems, either through Web-scale discovery or other systems. How do we frame an ontology that contemplates incorporation of multiple legal systems and relation to multiple languages? How do we deal with the translation issues that may arise? How would the ontology map to other systems and multilingual thesauri? Should we be contemplating ontologies in other disciplines that have addressed these issues?

Could it be for naught? One might ask, “If we build it, will they come?” Even if provided with such a system (as well as other best practices and support), would law reviews actually adopt it? Even if they do not apply such a system to their data structures, the substantive system that evolves could also be applied from the “outside” by third parties if the content itself is open. While one could argue that this would truly be “reinventing the wheel” in duplicating the efforts of existing indexing systems, one could argue alternatively that the scope, nature, and openness of the resulting system would offer a unique contribution to the indexing environment and would at least provide an additional alternative to the existing systems.

Technical questions: Which particular tools should we use to work collaboratively? What machine-readable formats would we contemplate using? How would we deal technically with systemic changes to the ontology and its application? There is a long list of tools and formats suitable for this project, and of methods for dealing with changes to metadata resources such as ontologies.

How would we contemplate application of this ontology in existing publishing platforms? What tools would we contemplate journals using to mark up documents with metadata from the ontology? Many of the repositories and platforms libraries are currently using permit enhancement of metadata with keywords, user-generated tags, or existing basic subject categories. But existing repositories and platforms do not necessarily facilitate markup that is optimized for the Semantic Web.

Who is the audience? Who is looking for such an ontology? If the language and concepts are at least in part based on the needs and knowledge of our faculty and students, do we develop something tailored to their use instead of developing something that serves broader norms? How could we take into consideration how others (pro se’s, court personnel, etc.) might be looking for information, and map or relate our ontology to other systems that incorporate those users’ perspectives? How could we develop an ontology that contemplates relating to primary law?

Rights issues: Are there rights issues involved in adaptations or derivations of others’ ontologies? How would we want to handle rights issues/licenses respecting the ontology that we develop? [21] Hopefully, the answer is freely and openly!

So what do you think?

Hopefully, this post will spur a discussion that could be continued on this blog or in another forum. In any event, law libraries should be rethinking their roles in the production of law review metadata. Law libraries should be considering how the evolution of the Semantic Web and cataloging standards might impact how they provide support for their own institution’s journals.

NOTES

This post is based in part on two draft papers: Benjamin Keele and Michelle Pearse, How Law Libraries Can Help Law Journals Publish Better (poster session presented during the 2011 AALL Annual Meeting in Philadelphia, PA on July 23-26, 2011), and Michelle Pearse, Whither the Future of Law Journal Indexing?.

[1] Richard A. Danner, Kelly Leong, and Wayne Miller, The Durham Statement Two Years Later: Open Access in the Law School Journal Environment, 103 Law Library Journal 39, 52 (2011), http://scholarship.law.duke.edu/faculty_scholarship/2358/; Implementing the Durham Statement: Best Practices for Open Access Law Journals Conference, http://www.law.duke.edu/libtech/openaccess/conference2010 (October 22, 2010).

[2] Tom Boone, Librarians Key to Open Access Electronic Law Reviews, http://tomboone.com/library-laws/2009/09/librarians-key-open-access-electronic-law-reviews (September 3, 2009) ; Sarah Glassmeyer, Getting to Durham Compliance, SarahGlassmeyer(Dot)Com, http://sarahglassmeyer.com/?p=442 (April 26, 2010); Edward T. Hart, Indexing Open Access Law Journals…or Maybe Not, 38 International Journal of Legal Information 19 (2010), http://scholarship.law.cornell.edu/ijli/vol38/iss1/5/.

[3] Dr. Nuria Casellas, Semantic Enhancement of Legal Information: Are We Up for the Challenge?, VoxPopuLII, http://blog.law.cornell.edu/voxpop/2010/02/15/semantic-enhancement-of-legal-information%e2%80%a6-are-we-up-for-the-challenge/ (February 15, 2010).

[4] Some resources related to this topic appear at http://schema.org and http://linkeddata.org. Some argue that the Semantic Web might already be ill-fated: Janna Quitney Anderson and Lee Rainie, The Fate of the Semantic Web, http://www.pewinternet.org/~/media//Files/Reports/2010/PIP-Future-of-the-Internet-Semantic-web.pdf (Pew Research Center 2010). Tom Gruber defines an ontology as “a specification of a conceptualization”: http://www-ksl.stanford.edu/kst/what-is-an-ontology.html; Tom Gruber, in the Encyclopedia of Database Systems, Ling Kiu and M. Tamer Ozsu (Eds.), Spring-Verlag, 2009 http://tomgruber.org/writing/ontology-definition-2007.htm and http://semanticweb.org/wiki/Ontology. Joost Breuker and colleagues elaborate: “The term ‘ontology’ may have different meanings: (i) philosophical discipline; (ii) informal conceptual system; (iii) a formal semantic account; (iv) a specification of a conceptualization; (v) a representation of a conceptual system via logical theory, (vi) the vocabulary used by a logical theory, (vii) a meta-level specification of a logical theory.” J. Breuker et al., “The Flood, the Channels and the Dykes,” in Joost Breuker, Pompeu Casanovas, Michael C.A. Klein and Enrico Francesconi, eds., Law, Ontologies and the Semantic Web: Channeling the Legal Information Flood (IOS Press 2009), at 11. Adam Wyner defines “ontology” in the following way: “An ontology represents a common vocabulary and organization of information that explicitly, formally, and generally specifies a conceptualization of a given domain. Ontologies are related to knowledge management (cf. Rusanow’s ‘Knowledge Management and the Smarter Lawyer’) and taxonomies (cf. Sherwin’s article ‘Legal Taxonomies’). But an ontology is a more specific, explicit and formal representation of knowledge than provided by KM [knowledge management]; and it is richer and more flexible than a taxonomy….In making an ontology, one turns tacit expert knowledge into explicit representations that can be shared, tested and modified by people as well as processed by a computer.” Dr. Adam Z. Wyner, “Legal Concepts Spin a Semantic Web”, Law Technology News, http://www.law.com/jsp/lawtechnologynews10/PubArticleLTN.jsp?id=1202431256007&slreturn=1 (June 8, 2009). Dr. Núria Casellas gives a good explanation of the Semantic Web and ontologies: Dr. Núria Casellas, Semantic Enhancement of Legal Information: Are We Up for the Challenge? http://blog.law.cornell.edu/voxpop/2010/02/15/semantic-enhancement-of-legal-information%e2%80%a6-are-we-up-for-the-challenge/ (February 15, 2010).

[5] Christopher A. Welty and Jessica Jenkins, Formal Ontology for Subject, 31 Journal of Knowledge and Data Engineering 155 (1999) (also available at http://www.cs.vassar.edu/~weltyc/papers/subjects/subject.html); Hope A. Olson, The Power to Name: Locating the Limits of Subject Representation in Libraries (Kluwer Academic Publishers, 2002); Knowledge Representation with Ontologies: Present Challenges – Future Possibilities, 65 International Journal of Human-Computer Studies 563 (2007), doi: 10.1016/j.ijhcs.2007.04.003.

[6] “In the subfield of computer science and information science known as Knowledge Representation, the term ‘ontology’ refers to a consensual and reusable vocabulary of identified concepts and their relationships regarding some phenomena of the world, which is made explicit in a machine-readable language. Ontologies may be regarded as advanced taxonomical structures, where concepts are formalized as classes and defined with axioms, enriched with description of attributes or constraints, and properties.” Dr. Núria Casellas, Semantic Enhancement of legal Information: Are We Up for the Challenge? http://blog.law.cornell.edu/voxpop/2010/02/15/semantic-enhancement-of-legal-information%e2%80%a6-are-we-up-for-the-challenge/. See also Dr. Adam Z. Wyner, “Legal Concepts Spin a Semantic Web”, Law Technology News, http://www.law.com/jsp/lawtechnologynews10/PubArticleLTN.jsp?id=1202431256007&slreturn=1 (June 8, 2009) (suggesting Web-based collaborative ontology development where legal professionals contribute to a free, open ontology for law); Dr. Adam Z. Wyner, Weaving the Legal Semantic Web with Natural Language Processing, VoxPopuLII, http://blog.law.cornell.edu/voxpop/2010/05/17/weaving-the-legal-semantic-web-with-natural-language-processing/ (May 17, 2010).

[7] Bill Cope, Mary Kalantzis and Liam Magee, Towards a Semantic Web: Connecting Knowledge in Academic Research (Chandos Publishing 2011), at 72 (noting several studies on investigating approaches and software); A Holistic Approach to Collaborative Ontology Development Based on Change Management, 9 Web Semantics: Science, Services and Agents on the World Wide Web 299 (2011), doi:10.1016/j.websem.2011.06.007; “Ontologies can be designed by means of methods such as…encompassing top-down expertise elicitation from humans, bottom-up learning from documents, and middle-out application of design patterns, which can be specialized from domain-independent ontologies, extracted from best practices, existing ontologies or other knowledge sources, as well as learnt from conceptual invariances found in experts’ documents.” Aldo Gangemi, “Introducing Pattern-Based Design for Legal Ontologies,” in Joost Breuker, Pompeu Casanovas, Michel C.A. Klein and Enrico Francesconi, eds., Law, Ontologies and the Semantic Web: Channelling the Information Flood (IOS Press, 2009), at 53.

[8] Enrico Francesconi, Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language (Springer 2008).

[9] “Creating and developing ontologies requires domain expertise and the ability to capture this knowledge in a clean conceptual model.” Roberta Cruel, Olga Morozova, Markus Rhode, Elena Simperl, Katharina Siorapes, Oksana Tokarchuk, Torben Wiedenhoefer, and Fahri Yetim, Motivation Mechanisms for Participation in Human-Driven Semantic Content Creation, 1 International Journal of Knowledge Engineering and Data Mining 331 (2011), doi: 10.1504/IJKEDM.2011.040653.

[10] This approach of working with faculty and other scholars from the legal academy would be similar to the “socio-legal” referenced by Dr. Casellas in her post regarding her Institute of Law and Technology project. Dr. Adam Z. Wyner has also advocated web-based collaborative ontolology development where legal professionals contribute to a free, open ontology for law. Dr. Adam Z. Wyner, “Legal Concepts Spin a Semantic Web”, Law Technology News, http://www.law.com/jsp/lawtechnologynews10/PubArticleLTN.jsp?id=1202431256007&slreturn=1 (June 8, 2009).

[11] Stephanie Davidson, Way Beyond Legal Research: Understanding the Research Habits of Legal Scholars, 102 Law Library Journal 561 (2010), http://www.aallnet.org/main-menu/Publications/llj/LLJ-Archives/Vol-102/publljv102n04/2010-32.pdf; Richard A. Danner, Supporting Scholarship: Thoughts on the Role of the Academic Librarian, 39 Journal of Law & Education 365-386 (2010), http://scholarship.law.duke.edu/faculty_scholarship/2071/ .

[12] Robert Richards, Legal Information Systems & Legal Informatics Resources: Knowledge Representation: Legal (Selected) http://www.personal.psu.edu/rcr5122/Ontologies.html; Robert Richards, Legal Information Systems & Legal Informatics Resources: General Resources for Application to Law, http://www.personal.psu.edu/rcr5122/OntologiesGeneral.html; Joost Breuker, Pompeu Casanovas, Michael C.A. Klein, and Enrico Francesconi, eds., Law, Ontologies and the Semantic Web: Channeling the Legal Information Flood (IOS Press 2009), at 12 (table of 23 ontologies).

[13] Alan Ruttenberg et al., Life Sciences on the Semantic Web: The Neurocommons and Beyond. Briefings in Bioinformatics, 10(2): 193-204 (2009), doi: 10.1093/bib/bbp004 (“The NeuroCommons project seeks to make all scientific research materials – research articles, knowledge bases, research data, physical materials – as available and as usable as they can be. We do this by fostering practices that render information in a form that promotes uniform access by computational agents – sometimes called ‘interoperability’. We want knowledge sources to combine easily and meaningfully, enabling semantically precise queries that span multiple information sources.”).

[14] Benjamin Keele and Michelle Pearse, How Law Libraries Can Help Law Journals Publish Better (poster session presented during the 2011 AALL Annual Meeting in Philadelphia, PA on July 23-26, 2011, http://scholarship.law.wm.edu/libpubs/25/).

[15] Tom Boone, Librarians Key to Open Access Law Reviews, http://tomboone.com/library-laws/2009/09/librarians-key-open-access-electronic-law-reviews (September 3, 2009).

[16] NISO/DCMI, International Bibliographic Standards, Linked Data and the Impact on Library Cataloging (Webinar), http://www.niso.org/news/events/2011/dcmi/linked (August 24, 2011).

[17] Valeri Craigle, Legal Scholarship in the Digital Domain: A Technical Roadmap for Implementing the Durham Statement, Technical Services Law Librarian, at 1 (December 2010), http://www.library.illinois.edu/archives/e-records/aall/8501591a/news/TSLLdecember2010.pdf.

[18] See Dr. Adam Z. Wyner, “Legal Concepts Spin a Semantic Web,” Law Technology News, http://www.law.com/jsp/lawtechnologynews10/PubArticleLTN.jsp?id=1202431256007&slreturn=1 (June 8, 2009) (suggesting Web-based collaborative ontology development where legal professionals contribute to a free, open ontology for law).

[19] Wayne Miller, A Foundational Proposal for Making the Durham Statement Real, http://scholarship.law.duke.edu/faculty_scholarship/2325/ (suggesting founding an organization “whose mission is to guarantee the ongoing viability and availability of all publications that adhere to the Durham Statement’s call to action, hereinafter called the Durham Statement Foundation.”); Richard A. Danner, Kelly Leong and Wayne Miller, The Durham Statement Two Years Later: Open Access in the Law School Journal Environment, 103 Law Library Journal 39, 52 (2011), http://scholarship.law.duke.edu/faculty_scholarship/2358/ (noting that the Durham Statement “calls for law schools to end print publication in a planned and coordinated effort led by the legal education community”).

[20] Some examples include the Northeast Foreign Law Libraries Cooperative Group and “B2F2” (currently in the process of being the process of being renamed) with Boston area law librarians.

[21] John Wilbanks, “Licensing and Ontologies: Research from Creative Commons,” http://ontolog.cim3.net/file/work/IPR/OOR-IPR-01_IPR-landscape_2010-09-09/licensing-n-ontologies–JohnWilbanks-CC_20100909.pdf (September 9, 2010).

Michelle Pearse is the Research Librarian for Open Access Initiatives and Scholarly Communication at the Harvard Law School Library where she manages implementation of the law school’s open access policy for its faculty, and other projects related to scholarly communication and open access to legal information and scholarship. She is also involved in efforts to archive born-digital content for the collection, and provides research services to faculty and staff.

VoxPopuLII is edited by Judith Pratt. Editor-in-Chief is Robert Richards, to whom queries should be directed. The statements above are not legal advice or legal representation. If you require legal advice, consult a lawyer. Find a lawyer in the Cornell LII Lawyer Directory.

Semantic Enhancement of Legal Information… Are We Up for the Challenge? [Revised Repost]

Cross-language legal information retrieval, information retrieval, knowledge management, Legal knowledge representation, Legal ontologies, Legal semantic web, Linked Data, Linked Data and law, Multilingual legal information retrieval, Semantic Web and law 1 Response »

Jan 182011

[Editor’s Note: We are republishing here, with some corrections, a post by Dr. Núria Casellas that appeared earlier on VoxPopuLII.]

The organization and formalization of legal information for computer processing in order to support decision-making or enhance information search, retrieval and knowledge management is not recent, and neither is the need to represent legal knowledge in a machine-readable form. Nevertheless, since the first ideas of computerization of the law in the late 1940s, the appearance of the first legal information systems in the 1950s, and the first legal expert systems in the 1970s, claims, such as Hafner’s, that “searching a large database is an important and time-consuming part of legal work,” which drove the development of legal information systems during the 80s, have not yet been left behind.

Similar claims may be found nowadays as, on the one hand, the amount of available unstructured (or poorly structured) legal information and documents made available by governments, free access initiatives, blawgs, and portals on the Web will probably keep growing as the Web expands. And, on the other, the increasing quantity of legal data managed by legal publishing companies, law firms, and government agencies, together with the high quality requirements applicable to legal information/knowledge search, discovery, and management (e.g., access and privacy issues, copyright, etc.) have renewed the need to develop and implement better content management tools and methods.

Information overload, however important, is not the only concern for the future of legal knowledge management; other and growing demands are increasing the complexity of the requirements that legal information management systems and, in consequence, legal knowledge representation must face in the future. Multilingual search and retrieval of legal information to enable, for example, integrated search between the legislation of several European countries; enhanced laypersons’ understanding of and access to e-government and e-administration sites or online dispute resolution capabilities (e.g., BATNA determination); the regulatory basis and capabilities of electronic institutions or normative and multi-agent systems (MAS); and multimedia, privacy or digital rights management systems, are just some examples of these demands.

How may we enable legal information interoperability? How may we foster legal knowledge usability and reuse between information and knowledge systems? How may we go beyond the mere linking of legal documents or the use of keywords or Boolean operators for legal information search? How may we formalize legal concepts and procedures in a machine-understandable form?

In short, how may we handle the complexity of legal knowledge to enhance legal information search and retrieval or knowledge management, taking into account the structure and dynamic character of legal knowledge, its relation with common sense concepts, the distinct theoretical perspectives, the flavor and influence of legal practice in its evolution, and jurisdictional and linguistic differences?

These are challenging tasks, for which different solutions and lines of research have been proposed. Here, I would like to draw your attention to the development of semantic solutions and applications and the construction of formal structures for representing legal concepts in order to make human-machine communication and understanding possible.

Semantic metadata

For example, in the search and retrieval area, we still perform nowadays most legal searches in online or application databases using keywords (that we believe to be contained in the document that we are searching for), maybe together with a combination of Boolean operators, or supported with a set of predefined categories (metadata regarding, for example, date, type of court, etc.), a list of pre-established topics, thesauri (e.g., EuroVoc), or a synonym-enhanced search.

These searches rely mainly on syntactic matching, and — with the exception of searches enhanced with categories, synonyms, or thesauri — they will return only documents that contain the exact term searched for. To perform more complex searches, to go beyond the term, we require the search engine to understand the semantic level of legal documents; a shared understanding of the domain of knowledge becomes necessary.

Although the quest for the representation of legal concepts is not new, these efforts have recently been driven by the success of the World Wide Web (WWW) and, especially, by the later development of the Semantic Web. Sir Tim Berners-Lee described it as an extension of the Web “in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”

Thus, the Semantic Web is envisaged as an extension of the current Web, which now comprises collaborative tools and social networks (the Social Web or Web 2.0). The Semantic Web is sometimes also referred to as Web 3.0, although there is no widespread agreement on this matter, as different visions exist regarding the enhancement and evolution of the current Web.

These efforts also include the Web of Data (or Linked Data), which relies on the existence of standard formats (URIs, HTTP and RDF) to allow the access and query of interrelated datasets, which may be granted through a SPARQL endpoint (e.g., Govtrack.us, US census data, etc.). Sharing and connecting data on the Web in compliance with the Linked Data principles enables the exploitation of content from different Web data sources with the development of search, browse, and other mashup applications. (See the Linking Open Data cloud diagram by Cyganiak and Jentzsch below.) [Editor’s Note: Legislation.gov.uk also applies Linked Data principles to legal information, as John Sheridan explains in his recent post.]

Thus, to allow semantics to be added to the current Web, new languages and tools (ontologies) were needed, as the development of the Semantic Web is based on the formal representation of meaning in order to share with computers the flexibility, intuition, and capabilities of the conceptual structures of human natural languages. In the subfield of computer science and information science known as Knowledge Representation, the term “ontology” refers to a consensual and reusable vocabulary of identified concepts and their relationships regarding some phenomena of the world, which is made explicit in a machine-readable language. Ontologies may be regarded as advanced taxonomical structures, where concepts are formalized as classes and defined with axioms, enriched with the description of attributes or constraints, and properties.

The task of developing interoperable technologies (ontology languages, guidelines, software, and tools) has been taken up by the World Wide Web Consortium (W3C). These technologies were arranged in the Semantic Web Stack according to increasing levels of complexity (like a layer cake). In this stack, higher layers depend on lower layers (and the latter are inherited from the original Web). These languages include XML (eXtensible Markup Language), a superset of HTML usually used to add structure to documents, and the so-called ontology languages: RDF/RDFS (Resource Description Framework/Schema), OWL, and OWL2 (Ontology Web Language). While the RDF language offers simple descriptive information about the resources on the Web, encoded in sets of triples of subject (a resource), predicate (a property or relation), and object (a resource or a value), RDFS allows the description of sets. OWL offers an even more expressive language to define structured ontologies (e.g. class disjointess, union or equivalence, etc.

Moreover, a specification to support the conversion of existing thesauri, taxonomies or subject headings into RDF triples has recently been published: the SKOS, Simple Knowledge Organization System standard. These specifications may be exploited in Linked Data efforts, such as the New York Times vocabularies. Also, EuroVoc, the multilingual thesaurus for activities of the EU is, for example, now available in this format.

Although there are different views in the literature regarding the scope of the definition or main characteristics of ontologies, the use of ontologies is seen as the key to implementing semantics for human-machine communication. Many ontologies have been built for different purposes and knowledge domains, for example:

OpenCyc: an open source version of the Cyc general ontology;
SUMO: the Suggested Upper Merged Ontology;
the upper ontologies PROTON (PROTo Ontology) and DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering);
the FRBRoo model (which represents bibliographic information);
the RDF representation of Dublin Core;
the Gene Ontology;
the FOAF (Friend of a Friend) ontology.

Although most domains are of interest for ontology modeling, the legal domain offers a perfect area for conceptual modeling and knowledge representation to be used in different types of intelligent applications and legal reasoning systems, not only due to its complexity as a knowledge intensive domain, but also because of the large amount of data that it generates. The use of semantically-enabled technologies for legal knowledge management could provide legal professionals and citizens with better access to legal information; enhance the storage, search, and retrieval of legal information; make possible advanced knowledge management systems; enable human-computer interaction; and even satisfy some hopes respecting automated reasoning and argumentation.

Regarding the incorporation of legal knowledge into the Web or into IT applications, or the more complex realization of the Legal Semantic Web, several directions have been taken, such as the development of XML standards for legal documentation and drafting (including Akoma Ntoso, LexML, CEN Metalex, and Norme in Rete), and the construction of legal ontologies.

Ontologizing legal knowledge

During the last decade, research on the use of legal ontologies as a technique to represent legal knowledge has increased and, as a consequence, a very interesting debate about their capacity to represent legal concepts and their relation to the different existing legal theories has arisen. It has even been suggested that ontologies could be the “missing link” between legal theory and Artificial Intelligence.

The literature suggests that legal ontologies may be distinguished by the levels of abstraction of the ideas they represent, the key distinction being between core and domain levels. Legal core ontologies model general concepts which are believed to be central for the understanding of law and may be used in all legal domains. In the past, ontologies of this type were mainly built upon insights provided by legal theory and largely influenced by normativism and legal positivism, especially by the works of Hart and Kelsen. Thus, initial legal ontology development efforts in Europe were influenced by hopes and trends in research on legal expert systems based on syllogistic approaches to legal interpretation.

More recent contributions at that level include the LKIF-Core Ontology, the LRI-Core Ontology, the DOLCE+CLO (Core Legal Ontology), and the Ontology of Fundamental Legal Concepts. Blue Scene Such ontologies usually include references to the concepts of Norm, Legal Act, and Legal Person, and may contain the formalization of deontic operators (e.g., Prohibition, Obligation, and Permission).

Domain ontologies, on the other hand, are directed towards the representation of conceptual knowledge regarding specific areas of the law or domains of practice, and are built with particular applications in mind, especially those that enable communication (shared vocabularies), or enhance indexing, search, and retrieval of legal information. Currently, most legal ontologies being developed are domain-specific ontologies, and some areas of legal knowledge have been heavily targeted, notably the representation of intellectual property rights respecting digital rights management (IPROnto Ontology, the Copyright Ontology, the Ontology of Licences, and the ALIS IP Ontology), and consumer-related legal issues (the Customer Complaint Ontology (or CContology), and the Consumer Protection Ontology). Many other well-documented ontologies have also been developed for purposes of the detection of financial fraud and other crimes; the representation of alternative dispute resolution methods, privacy compliance, patents, cases (e.g., Legal Case OWL Ontology), judicial proceedings, legal systems, and argumentation frameworks; and the multilingual retrieval of European law, among others. (See, for example, the proceedings of the JURIX and ICAIL conferences for further references.)

A socio-legal approach to legal ontology development

Thus, there are many approaches to the development of legal ontologies. Nevertheless, in the current legal ontology literature there are few explicit accounts or insights into the methods researchers use to elicit legal knowledge, and the accounts that are available reflect a lack of consensus as to the most appropriate methodology. For example, some accounts focus solely on the use of text mining techniques towards ontology learning from legal texts; while others concentrate on the analysis of legal theories and related materials to extract and formalize legal concepts. Moreover, legal ontology researchers disagree about the role that legal experts should play in ontology development and validation.

Orange Scene In this regard, at the Institute of Law and Technology, we are developing a socio-legal approach to the construction of legal conceptual models. This approach stems from our collaboration with firms, government agencies, and nonprofit organizations (and their experts, clients, and other users) for the gathering of either explicit or tacit knowledge according to their needs. This empirically-based methodology may require the modeling of legal knowledge in practice (or professional legal knowledge, PLK), and the acquisition of knowledge through ethnographic and other social science research methods, together with the extraction (and merging) of concepts from a range of different sources (acts, regulations, case law, protocols, technical reports, etc.) and their validation by both legal experts and users.

For example, the Ontology of Professional Judicial Knowledge (OPJK) was developed in collaboration with the Spanish School of the Judicary to enhance search and retrieval capabilities of a Web-based frequentl- asked-question system (IURISERVICE) containing a repository of practical knowledge for Spanish judges in their first appointment. The knowledge was elicited from an ethnographic survey in Spanish First Instance Courts. On the other hand, the Neurona Ontologies, for a data protection compliance application, are based on the knowledge of legal experts and the requirements of enterprise asset management, together with the analysis of privacy and data protection regulations and technical risk management standards.

This approach tries to take into account many of the criticisms that developers of legal knowledge-based systems (LKBS) received during the 1980s and the beginning of the 1990s, including, primarily, the lack of legal knowledge or legal domain understanding of most LKBS development teams at the time. These criticisms were rooted in the widespread use of legal sources (statutes, case law, etc.) directly as the knowledge for the knowledge base, instead of including in the knowledge base the “expert” knowledge of lawyers or law-related professionals.

Further, in order to represent knowledge in practice (PLK), legal ontology engineering could benefit from the use of social science research methods for knowledge elicitation, institutional/organizational analysis (institutional ethnography), as well as close collaboration with legal practitioners, users, experts, and other stakeholders, in order to discover the relevant conceptual models that ought to be represented in the ontologies. Moreover, I understand the participation of these stakeholders in ontology evaluation and validation to be crucial to ensuring consensus about, and the usability of, a given legal ontology.

Challenges and drawbacks

Although the use of ontologies and the implementation of the Semantic Web vision may offer great advantages to information and knowledge management, there are great challenges and problems to be overcome.

First, the problems related to knowledge acquisition techniques and bottlenecks in software engineering are inherent in ontology engineering, and ontology development is quite a time-consuming and complex task. Second, as ontologies are directed mainly towards enabling some communication on the basis of shared conceptualizations, how are we to determine the sharedness of a concept? And how are context-dependencies or (cultural) diversities to be represented? Furthermore, how can we evaluate the content of ontologies?

Current research is focused on overcoming these problems through the establishment of gold standards in concept extraction and ontology learning from texts, and the idea of collaborative development of legal ontologies, although these techniques might be unsuitable for the development of certain types of ontologies. Also, evaluation (validation, verification, and assessment) and quality measurement of ontologies are currently an important topic of research, especially ontology assessment and comparison for reuse purposes.

Regarding ontology reuse, the general belief is that the more abstract (or core) an ontology is, the less it owes to any particular domain and, therefore, the more reusable it becomes across domains and applications. This generates a usability-reusability trade-off that is often difficult to resolve.

Finally, once created, how are these ontologies to evolve? How are ontologies to be maintained and new concepts added to them?

Over and above these issues, in the legal domain there are taking place more particularized discussions: for example, the discussion of the advantages and drawbacks of adopting an empirically based perspective (bottom-up), and the complexity of establishing clear connections with legal dogmatics or general legal theory approaches (top-down). To what extent are these two different perspectives on legal ontology development incompatible? How might they complement each other? What is their relationship with text-based approaches to legal ontology modeling?

I would suggest that empirically based, socio-legal methods of ontology construction constitute a bottom-up approach that enhances the usability of ontologies, while the general legal theory-based approach to ontology engineering fosters the reusability of ontologies across multiple domains.

The scholarly discussion of legal ontology development also embraces more fundamental issues, among them the capabilities of ontology languages for the representation of legal concepts, the possibilities of incorporating a legal flavor into OWL, and the implications of combining ontology languages with the formalization of rules.

Finally, the potential value to legal ontology of other approaches, areas of expertise, and domains of knowledge construction ought to be explored, for example: pragmatics and sociology of law methodologies, experiences in biomedical ontology engineering, formal ontology approaches, and the relationships between legal ontology and legal epistemology, legal knowledge and common sense or world knowledge, expert and layperson’s knowledge, legal information and Linked Data possibilities, and legal dogmatics and political science (e.g., in e-Government ontologies).

As you may see, the challenges faced by legal ontology engineering are great, and the limitations of legal ontologies are substantial. Nevertheless, the potential of legal ontologies is immense. I believe that law-related professionals and legal experts have a central role to play in the successful development of legal ontologies and legal semantic applications.

[Editor’s Note: For many of us, the technical aspects of ontologies and the Semantic Web are unfamiliar. Yet these technologies are increasingly being incorporated into the legal information systems that we use everyday, so it’s in our interest to learn more about them. For those of us who would like a user-friendly introduction to ontologies and the Semantic Web, here are some suggestions:

Tom Gruber, Where the Social Web Meets the Semantic Web (video);
Sandro Hawke, How the Semantic Web Works;
Kevin Hemenway, The Semantic Web: 1-2-3;
Jim Hendler et al., Introduction to the Semantic Web (video);
Ivan Herman, Introduction to the Semantic Web;
Brian Lowe, Introduction to Ontologies: Adding Meaning to Metadata;
Marek Obitko, Introduction to Ontologies and Semantic Web;
Sean B. Palmer, The Semantic Web: An Introduction;
Ioana Robu et al., An Introduction to the Semantic Web for Health Sciences Librarians;
Barry Smith, Ontology: An Introduction: Video: How to Build an Ontology;
University of Manchester, CO-ODE, Tutorial: A Practical Introduction to Ontologies and OWL;
Dr. Adam Z. Wyner, Legal Ontologies Spin a Semantic Web.]

Dr. Núria Casellas is a visiting researcher at the Legal Information Institute at Cornell University. She is a researcher at the Institute of Law and Technology and an assistant professor at the UAB Law School (on leave). She has participated in several national and European-funded research projects regarding legal ontologies and legal knowledge management: these concern the acquisition of knowledge in judicial settings (IURISERVICE), modeling privacy compliance regulations (NEURONA), drafting legislation (DALOS), and the Legal Case Study of the Semantically Enabled Knowledge Technologies (SEKT VI Framework project), among others. Co-editor of the IDT Series, she holds a Law Degree from the Universitat Autònoma de Barcelona, a Master’s Degree in Health Care Ethics and Law from the University of Manchester, and a PhD (“Modelling Legal Knowledge through Ontologies. OPJK: the Ontology of Professional Judicial Knowledge”).

VoxPopuLII is edited by Judith Pratt. Editor in Chief is Robert Richards.

Legislation.gov.uk

free access to law, Legal descriptive metadata, Legal identifiers, Legal knowledge representation, Legal metadata, Legal XML, Legislative information systems, Public access to legal information, Semantic Web and law 19 Responses »

Aug 152010

The launch of legislation.gov.uk by The [UK] National Archives marks a step change in public access to a primary source of legal information for citizens in the UK. Legislation.gov.uk is extensive, covering the four jurisdictions that make up the United Kingdom (England, Scotland, Wales and Northern Ireland) and over 800 years of history.

John Sheridan, Head of e-Services and Strategy at The National Archives, writes:

First, some background

We had two objectives with legislation.gov.uk: to deliver a high quality public service for people who need to consult, cite, and use legislation on the Web; and to expose the UK’s Statute Book as data, for people to take, use, and re-use for whatever purpose or application they wish. In particular, our aim was to show how the statute book can contribute to the growing Web of data as well as to the Web of documents.

Legislation.gov.uk replaces two predecessor services the UK government set up to provide access to legislation. The first was created by Her Majesty’s Stationery Office (HMSO), later to become the Office of Public Sector Information (OPSI), which is responsible for the official publication of legislation, and the London, Belfast and Edinburgh Gazettes. The functions of HMSO have been operating from The National Archives since 2006, including the provision of public access to legislation online. HMSO started publishing new legislation on the Web in 1996. Where HMSO and later OPSI provided access to legislation as it was enacted or made, a second service was developed, to provide access to the UK Statute Law Database. This contains revised versions of primary legislation, showing how they have changed over time.

Browsing the many different types of legislation in the UK

As in the United States, most lawyers in the UK rely on pay-for commercial legal research services. The people using the government’s online legislation service are generally not lawyers, but are drawn from a much wider group of people who need to know, cite, or use legislation as part of their job. These can range from police officers, to head teachers, to citizens defending their rights. Our users are people who need to know what a statute says, and who go looking for it using Google. They then quickly find their way to legislation.gov.uk.

What do people think they are seeing?

Before starting work on legislation.gov.uk, we did some research into the users of both the OPSI service and the UK Statute Law Database service. This research showed that they were very well used (over 1.5 million unique visitors per month to www.opsi.gov.uk), but that most of the people accessing legislation on the Web were not clear about the status of the material they were looking at. Our research showed that many people using legislation online assume that what they are looking at is both current and in force, simply because it is on the Web and available from an official source. Often users were accessing the original or as-enacted version of a statute, not knowing that they should be looking at the revised version, or that a revised version even existed.

Intuitive presentation

Our job is to present legislative material in such a way that the context and status of the information are clear. Legislation is complicated to understand; for example, an Act may have multiple sections, each with a different commencement date, or the Act may have prospective provisions. With legislation.gov.uk we have tried to develop a user interface that makes the status of each Act clear, so people know whether the statute they are viewing is current and in force. The usability challenge is to align what people think they are seeing with what they are actually looking at. We have done this by presenting both an original (see, e.g., here) and a latest-available version (see, e.g., here) of each Act, and a toggle between the two.

For more advanced users there is a timeline (see, e.g., here) which can be turned on to see how the legislation has changed and to navigate through an Act at particular points in time, including future or prospective versions.

Point in time navigation and the timeline

Open data

On the surface, legislation.gov.uk is an attractive Website, providing simple and direct access to legislation; at legislation.gov.uk people can view whole Acts, or a particular section, in either HTML (see, e.g., here) or in a print version in PDF (see, e.g., here). To achieve this, under the hood two very different sources of data have been combined. The data model for the original (or as-enacted) versions of legislation is largely driven by the typographic layout of legislative documents. For revised legislation, the data model is largely driven by version control, the management of multiple versions of different segments of a statute at different points in time. Reconciling these two different data models was a prerequisite step to developing our system.

An ‘on the fly’ created PDF

We aimed to make legislation.gov.uk a source of open data from the outset. The importance of open legal data is made powerfully by people like Carl Malamud and the Law.Gov campaign. Our desire to make the statute book available as open data motivated a number of technology choices we made. For example, the legislation.gov.uk Website is built on top of an open Application Programming Interface (API). The same API is available for others to use to access the raw data.

Using the API

The simplest way to get hold of the underlying data on legislation.gov.uk is to go to a piece of legislation on the Website, either a whole item, or a part or section, and just append /data.xml or /data.rdf to the URL. So, the data for, say, Section 1 of the Communications Act 2003, which is at http://www.legislation.gov.uk/ukpga/2003/21/section/1, is available at http://www.legislation.gov.uk/ukpga/2003/21/section/1/data.xml. We have taken a similar approach with lists, both in browse and search results. When looking at any list of legislation on legislation.gov.uk, it is easy to view the data. Simply append /data.feed to return that list in ATOM. (See, e.g., here.)

Open standards have played an important role throughout the development of legislation.gov.uk. All the data is held in XML, using a native XML database. The application logic is similarly constructed using open standards, in XSLTs and XQueries. Data and application portability were key objectives. We made considerable use of open source software like Orbeon Forms, Squid, and Apache.

The XML conforms to the Crown Legislation Markup Language (CLML) and associated schema. More general interchange formats for legislation such as CEN MetaLex lack the expressive power we need for UK legislation, but could relatively easily be wrapped around the XML we are making available. We have sought to surface richer metadata about legislation using RDF, but we would welcome feedback from users of the XML data about whether a MetaLex wrapper would be useful. (Note: We have used the MetaLex vocabulary in our RDF along with FRBR, as discussed below.) Similarly, it should be relatively easy to add a wrapper for the OAI-PMH protocol on top of the API we have built. We are not yet clear who would make use of such a service, if we built one, or whether we should leave the creation of an OAI-PMH interface to others. It is another open issue where we would welcome some feedback.

Persistent URIs

A major influence on legislation.gov.uk was a blog posting by Rick Jelliffe for O’Reilly’s XML.com. Jelliffe writes about something he calls PRESTO. He describes this as a system for legislation and public information in which “all documents, views and metadata at all significant levels of granularity and composition should be available in the best formats practical from their own permanent hierarchical URIs.”

Persistent URIs to pieces of legislation are very important, as they are to sources of law more generally. Initiatives like LegisLink, which Joe Carmel has written about here, attempt to retrofit a reliable naming scheme for legislation onto existing document-based systems. The URN:LEX namespace aims to facilitate the process of creating URIs for legal sources independent of a document’s online availability, location, and access mode.

We wanted to create high quality, persistent URIs for UK legislation from the outset. There are a number of different ways one might assign an unequivocal identifier to a legislative document. We have decided to use HTTP URIs and see no particular advantage in using URNs over HTTP URIs and indeed some disadvantages with URNs. Most importantly, HTTP URIs are actionable names. The advantage is that there is a built-in, ready-made, widely deployed and cost-effective resolution mechanism for resolving the identifier to a document, and a document to a representation. Having said that, we would consider supporting URN:LEX URNs in addition to our own URI Set, and would greatly welcome feedback from the community on this issue -– so please do comment if you have a view.

So, it follows, there are three types of URI for legislation on legislation.gov.uk, namely, identifier URIs, document URIs and representation URIs. Identifier URIs are for the ‘concept’ of a piece of legislation, how it was, how it is, and how it will be. (See, e.g., here.) Our use of these follow the Linked Data principles — the identifier URI is for a so-called non-information resource, something which can’t be conveyed in an electronic message. In other words, the URI is for the notion of a piece of legislation, rather than a particular rendition of it in a document. These URIs have been designed following the guidelines the UK Government has created for URI Sets, which our work helped to shape.

With legislation.gov.uk we support content negotiation, and follow the HTTP-Range 14 resolution approach, of responding to a request for the ‘non-information resource’ URI with a 303 response which redirects to a document URI.

Our document URIs refer to particular documents on the Web, for example the current, in-force version of a particular section of an Act. (See, e.g., here.) Crucially there are also point-in-time URIs for documents, which shows how that Act stood on a particular date (/yyyy-mm-dd) (see, e.g., here), or how it was when originally made (/enacted) (see, e.g., here). For any document we can return different representations or formats: a Web page on legislation.gov.uk, the underlying XML, a PDF, an HTML snippet, or even some RDF metadata. We recommend that people cite UK legislation in HTML by pointing to the identifier URI and by using the rel=”cite” attribute in the anchor tag.

Of course, we quickly discovered, it is one thing to suggest a design approach like PRESTO, and quite another to actually implement it. Jeni Tennison, who, working as a consultant to The Stationery Office, devised the URI Set for legislation (and much else about the legislation.gov.uk system), has blogged about the limitations of PRESTO and XPath-based URLs. I hope Jeni will find the time to blog some more about legislation.gov.uk, as there are many stories to be told.

One of the earliest pieces of design work we did for legislation.gov.uk was the URI Set. We wanted to follow PRESTO principles, but also account for changes over time, and for some of the peculiarities of UK legislation, in particular different geographic extents. (See, e.g., here.) PRESTO thinking is very evident on legislation.gov.uk; just look at the URLs as you move through the site.

Linked Data

We were also keen that the UK’s Statute Book make a contribution to the growing Web of Linked Data. The UK government is working hard to publish government data using Linked Data standards as part of work on data.gov.uk. The idea of the Web of Linked Data is to connect related information across the Web based on its meaning. In practice this means creating names for things (by ‘thing’ I mean anything: people, places, ideas) using HTTP, and when someone requests some information about that thing, returning data about it, ideally using RDF.

Legislation can make an important contribution to the Web of Linked Data. First, many important concepts and ideas are formally defined by statute. For example, there are 27 types of school in the UK and each one has a statutory definition. (See, e.g., here and here.) What it means to be a private limited company is again defined by statute, as are the UK’s eight data protection principles. One of our objectives with legislation.gov.uk is to enable people creating vocabularies and ontologies to exploit these definitions. This can be done, for example, by using the skos:definition property, to link terms in a vocabulary to the statute. The idea is to ease the process of rooting the Semantic Web in legally defined concepts. Part of the value of this linking is that it enables automatic checking to determine whether a part of the statute book has been repealed, in which case the related concept no longer exists. Crucially, legislation.gov.uk gives accurate information about when a section is repealed, by what piece of legislation, and when that repeal comes into force.

At the moment, the RDF from legislation.gov.uk is limited to largely bibliographic information. We have made use of the Functional Requirements for Bibliographic Records (FRBR) and the MetaLex vocabularies, primarily to relate the different types of resource we are making available. FRBR has the notion of a work, expressions of that work, manifestations of those expressions, and items. Similarly, MetaLex has the concepts of a BibliographicWork and BibliographicExpression. In the context of legislation.gov.uk, the identifier URIs relate to the work. Different versions of the legislation (current, original, different points in time, or prospective) relate to different expressions. The different formats (HTML, HTML Snippets, XML, and PDF) relate to the different manifestations. We have also made extensive use of Dublin Core Terms, for example to reflect that different versions apply to geographic extents. This is important as, for example, the same section of a statute may have been amended in one way as it applies in Scotland and in another way for England and Wales. We think FRBR, MetaLex, and Dublin Core Terms have all worked well, individually and in combination, for relating the different types of resource that we are making available.

One challenge we have is with changes to legislation that have yet to be applied to the data by the editorial team. Since we know what these effects are, we have also tried to represent this in RDF. We have used the MetaLex vocabulary to do this, but the result is complicated to interpret, and thus we suspect difficult for users of the data. MetaLex does not aid the elegant expression of amendment information (such as: statute A is changed by statute B, but only when commencement order C brings that change into force). We will be developing our own light-weight ontology for expressing some of these relationships, with the primary focus on ease of querying our data, rather than creating an ontology with the expressive power to be a cross-jurisdictional model.

It should then be possible to align this ontology with others post hoc. Our current use of RDF — and the potential to do more — is another issue where we would welcome feedback from the community.

Early adopters

People have already started to make use of the legislation.gov.uk URIs to support their Linked Data. One example is a project by ESD Toolkit. They have a created a SKOS vocabulary for all the different types of service that Local Authorities need to provide. They have linked this vocabulary to the powers and duties placed on Local Authorities in the legislation, using legislation.gov.uk identifier URIs. They have also used the API to pull back some of the text of the relevant statutes.

The future

We think there is huge potential over the next few years in the development of “accountable systems”. These are systems that are explicitly aware of statutory and other legal requirements and are able to process information explicitly in a way that complies with the (ever-changing) law. Here the legislation URIs can help enormously, either for people seeking to develop such accountable systems or any time someone wants to integrate an external system with the official source for statutory information. If the API is used in this way, we will need to consider carefully whether, and if so, how, the data is authenticated. We are not currently supplying digitally signed versions of UK legislation (unlike the GPO in the US) but we will be supporting the use of HTTPS, to provide a reasonable level of secure access to the data. However, if the data starts to be increasingly used in a new generation of accountable systems, we may need to address authenticity, with a view to increasing the guarantees we can make over the data.

There is much more we can do with legislation as data. Parts of the statute book are surprisingly well structured. For example, every year there is one or more Appropriation Acts. These typically contain a schedule with a table listing each government department, the amount allocated to it by Parliament for the year, and what that departments’ objectives are (see, e.g., here). It wouldn’t take much to create an XSLT just for these tables in the Appropriation Acts, from the XML provided from the API, to extract this data from all the Appropriation Acts, and publish that as Linked Data. There are many other examples of almost-structured data in legislation, waiting to be freed by developers, now that they have easy access to the underlying source.

We see this as a start. There is still much to do if we are to realise the potential of the statute book as public source of data. We are aiming to improve the modelling and the quantity of RDF data we make available about legislation, but it’s what others will do with the data that is really interesting. Now the UK has opened its statute book as Linked Data, we are keen to share our work with other governments, and to engage with academics in the legal informatics community and others with an interest in exploiting this rich source of information.

John Sheridan is Head of e-Services and Strategy at The [UK] National Archives, where he leads the team responsible for legislation.gov.uk. He is a specialist in official publishing on the Web, and in using Linked Data standards for government information. He also co-chairs the W3C eGovernment Interest Group.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Weaving the Legal Semantic Web with Natural Language Processing

Annotation of legal texts, information retrieval, Legal knowledge representation, Legal semantic web, Legal text mining, Legal text processing, Legal XML, natural language processing, Semantic annotation of legal texts, Semantic Web and law 4 Responses »

May 172010

The World Wide Web is a virtual cornucopia of legal information bearing on all manner of topics and in a spectrum of formats, much of it textual. However, to make use of this storehouse of textual information, it must be annotated and structured in such a way as to be meaningful to people and processable by computers. One of the visions of the Semantic Web has been to enrich information on the Web with annotation and structure. Yet, given that text is in a natural language (e.g., English, German, Japanese, etc.), which people can understand but machines cannot, some automated processing of the text itself is needed before further processing can be applied. In this article, we discuss one approach to legal information on the World Wide Web, the Semantic Web, and Natural Language Processing (NLP). Each of these are large, complex, and heterogeneous topics of research; in this short post, we can only hope to touch on a fragment and that heavily biased to our interests and knowledge. Other important approaches are mentioned at the end of the post. We give small working examples of legal textual input, the Semantic Web output, and how NLP can be used to process the input into the output.

Legal Information on the Web

For clients, legal professionals, and public administrators, the Web provides an unprecedented opportunity to search for, find, and reason with legal information such as case law, legislation, legal opinions, journal articles, and material relevant to discovery in a court procedure. With a search tool such as Google or indexed searches made available by Lexis-Nexis, Westlaw, or the World Legal Information Institute, the legal researcher can input key words into a search and get in return a (usually long) list of documents which contain, or are indexed by, those key words.

As useful as such searches are, they are also highly limited to the particular words or indexation provided, for the legal researcher must still manually examine the documents to find the substantive information. Moreover, current legal search mechanisms do not support more meaningful searches such as for properties or relationships, where, for example, a legal researcher searches for cases in which a company has the property of being in the role of plaintiff or where a lawyer is in the relationship of representing a client. Nor, by the same token, can searches be made with respect to more general (or more specific) concepts, such as “all cases in which a company has any role,” some particular fact pattern, legislation bearing on related topics, or decisions on topics related to a legal subject.

The underlying problem is that legal textual information is expressed in natural language. What literate people read as meaningful words and sentences appear to a computer as just strings of ones and zeros. Only by imposing some structure on the binary code is it converted to textual characters as we know them. Yet, there is no similar widespread system for converting the characters into higher levels of structure which correlate to our understanding of meaning. While a search can be made for the string plaintiff, there are no (widely available) searches for a string that represents an individual who bears the role of plaintiff. To make language on the Web more meaningful and structured, additional content must be added to the source material, which is where the Semantic Web and Natural Language Processing come into play.

Semantic Web

The Semantic Web is a complex of design principles and technologies which are intended to make information on the Web more meaningful and usable to people. We focus on only a small portion of this structure, namely the syntactic XML (eXtensible Markup Language) level, where elements are annotated so as to indicate linguistically relevant information and structure. (Click here for more on these points.) While the XML level may be construed as a ‘lower’ level in the Semantic Web “stack” — i.e., the layers of interrelated technologies that make up the Semantic Web — the XML level is nonetheless crucial to providing information to higher levels where ontologies (and click here for more on this) and logic play a role. So as to be clear about the relation between the Semantic Web and NLP, we briefly review aspects of XML by example, and furnish motivations as we go.

Suppose one looks up a case where Harris Hill is the plaintiff and Jane Smith is the attorney for Harris Hill. In a document related to this case, we would see text such as the following portions:

Harris Hill, plaintiff.
Jane Smith, attorney for the plaintiff.

While it is relatively straightforward to structure the binary string into characters, adding further information is more difficult. Consider what we know about this small fragment: Harris and Jane are (very likely) first names, Hill and Smith are last names, Harris Hill and Jane Smith are full names of people, plaintiff and attorney are roles in a legal case, Harris Hill has the role of plaintiff, attorney for is a relationship between two entities, and Jane Smith is in the attorney for relationship to Harris Hill. It would be useful to encode this information into a standardised machine-readable and processable form.

XML helps to encode the information by specifying requirements for tags that can be used to annotate the text. It is a highly expressive language, allowing one to define tags that suit one’s purposes so long as the specification requirements are met. One requirement is that each tag has a beginning and an ending; the material in between is the data that is being tagged. For example, suppose tags such as the following, where … indicates the data:


<legalcase>...</legalcase>,
<firstname>...</firstname>,
<lastname>...</lastname>,
<fullname>...</fullname>,
<plaintiff>...</plaintiff>,
<attorney>...</attorney>, 
<legalrelationship>...</legalrelationship>

Another requirement is that the tags have a tree structure, where each pair of tags in the document is included in another pair of tags and there is no crossing over:


<fullname><firstname>...</firstname>, 
<lastname>...</lastname></fullname>

is acceptable, but


<fullname><firstname>...<lastname>
</firstname> ...</lastname></fullname>

is unacceptable. Finally, XML tags can be organised into schemas to structure the tags.

With these points in mind, we could represent our fragment as:


<legalcase>
  <legalrelationship>
    <plaintiff>
      <fullname><firstname>Harris</firstname>,
           <lastname>Hill</lastname></fullname>
    </plaintiff>,
    <attorney>
      <fullname><firstname>Jane</firstname>,
           <lastname>Smith</lastname></fullname>
    </attorney>
  </legalrelationship
</legalcase>

We have added structured information — the tags — to the original text. While this is more difficult for us to read, it is very easy for a machine to read and process. In addition, the tagged text contains the content of the information, which can be presented in a range of alternative ways and formats using a transformation language such as XSLT (click here for more on this point) so that we have an easier-to-read format.

Why bother to include all this additional information in a legal text? Because these additions allow us to query the source text and submit the information to further processing such as inference. Given a query language, we could submit to the machine the query Who is the attorney in the case? and the answer would be Jane Smith. Given a rule language — such as RuleML or Semantic Web Rule Language (SWRL) — which has a rule such as If someone is an attorney for a client then that client has a privileged relationship with the attorney, it might follow from this rule that the attorney could not divulge the client’s secrets. Applying such a rule to our sample, we could infer that Jane Smith cannot divulge Harris Hill’s secrets.

Tower of Babel Though it may seem here like too much technology for such a small and obvious task, it is essential where we scale up our queries and inferences on large corpora of legal texts — hundreds of thousands if not millions of documents — which comprise vast storehouses of unstructured, yet meaningful data. Were all legal cases uniformly annotated, we could, in principle, find out every attorney for every plaintiff for every legal case. Where our tagging structure is very rich, our queries and inferences could also be very rich and detailed. Perhaps a more familiar way to view documents annotated with XML is as a database to which further processes can be applied over the Web.

Natural Language Processing

As we have presented it, we have an input, the corpus of texts, and an output, texts annotated with XML tags. The objective is to support a range of processes such as querying and inference. However, getting from a corpus of textual information to annotated output is a demanding task, generically referred to as the knowledge acquisition bottleneck. Not only is the task demanding on resources (time, money, manpower); it is also highly knowledge intensive since whoever is doing the annotation must know what to look for, and it is important that all of the annotators annotate the text in the same way (inter-annotator agreement) to support the processes. Thus, automation is central.

Yet processing language to support such richly annotated documents confronts a spectrum of difficult issues. Among them, natural language supports (1) implicit or presupposed information, (2) multiple forms with the same meaning, (3) the same form with different contextually dependent meanings, and (4) dispersed meanings. (Similar points can be made for sentences or other linguistic elements.) Here are examples of these four issues:

(1) “When did you stop taking drugs?” (presupposes that the person being questioned took drugs at sometime in the past);
(2) Jane Smith, Jane R. Smith, Smith, Attorney Smith… (different ways to refer to the same person);
(3) The individual referred to by the name “Jane Smith” in one case decision may not be the individual referred to by the name “Jane Smith” in another case decision;
(4) Jane Smith represented Jones Inc. She works for Dewey, Cheetum, and Howe. To contact her, write to j.smith@dch.com .

When we search for information, a range of linguistic structures or relationships may be relevant to our query, such as:

grammatical constructions (passive or active sentence forms, quotation, reference to other individuals, and so on),
grammatical relations among terms (e.g., whether an individual is the agent or target of some action),
ontological relations (e.g., classes and subclasses of experts, or the relationships among courts in the judicial hierarchy),
relationships among elements (e.g., who works for what organization), or
high-level patterns such as legal arguments (e.g., expert testimony) and fact patterns (e.g., culpable intent).

People grasp relationships between words and phrases, such that Bill exercises daily contrasts with the meaning of Bill is a couch potato, or that if it is true that Bill used a knife to kill Phil, then Bill killed Phil. Finally, meaning tends to be sparse; that is, there are a few words and patterns that occur very regularly, while most words or patterns occur relatively rarely in the corpus.

Natural language processing (NLP) takes on this highly complex and daunting problem as an engineering problem, decomposing large problems into smaller problems and subdomains until it gets to those which it can begin to address. Having found a solution to smaller problems, NLP can then address other problems or larger scope problems. Some of the subtopics in NLP are:

Generation – converting information in a database into natural language.
Understanding – converting natural language into a machine-readable form.
Information Retrieval – gathering documents which contain key words or phrases. This is essentially what is done by Google.
Text Summarization – summarizing (in a paragraph) the main meaning of a text or corpus.
Question Answering – making queries and giving answers to them, in natural language, with respect to some corpus of texts.
Information Extraction — identifying, annotating, and extracting information from documents for reuse, representation, or reasoning.

In this article, we are primarily (here) interested in information extraction.

NLP Approaches: Knowledge Light v. Knowledge Heavy

There are a range of techniques that one can apply to analyse the linguistic data obtained from legal texts; each of these techniques has strengths and weaknesses with respect to different problems. Statistical and machine-learning techniques are considered “knowledge light.” With statistical approaches, the processing presumes very little knowledge by the system (or analyst). Rather, algorithms are applied that compare and contrast large bodies of textual data, and identify regularities and similarities. Such algorithms encounter problems with sparse data or patterns that are widely dispersed across the text. (See Turney and Pantel (2010) for an overview of this area.) Machine learning approaches apply learning algorithms to annotated material to extend results to unannotated material, thus introducing more knowledge into the processing pipeline. However, the results are somewhat of a black box in that we cannot really know the rules that are learned and use them further.

With a “knowledge-heavy” approach, we know, in a sense, what we are looking for, and make this knowledge explicit in lists and rules for processing. Yet, this is labour- and knowledge-intensive. In the legal domain it is crucial to have humanly understandable explanations and justifications for the analysis of a text, which to our thinking warrants a knowledge-heavy approach.

One open source text-mining package, General Architecture for Text Engineering (GATE), consists of multiple components in a cascade or pipeline, each component automatically processing some aspect of the text, and then feeding into the next process. The underlying strategy in all the components is to find a pattern (from either a list or a previous process) which matches a rule, and then to apply the rule which annotates the text. Each component performs a particular process on the text, such as:

Sentence segmentation – dividing text into sentences.
Tokenisation – words identified by spaces between them.
Part-of-speech tagging – noun, verb, adjective, etc., determined by look-up and relationships among words.
Shallow syntactic parsing/chunking – dividing the text by noun phrase, verb phrase, subordinate clause, etc.
Named entity recognition – the entities in the text such as organisations, people, and places.
Dependency analysis – subordinate clauses, pronominal anaphora [i.e., identifying what a pronoun refers to], etc.

The system can also be used to annotate more specifically to elements of interest. In one study, we annotated legal cases from a case base (a corpus of cases) in order to identify a range of particular pieces of information that would be relevant to legal professionals such as:

Case citation.
Names of parties.
Roles of parties, meaning plaintiff or defendant.
Type of court.
Names of judges.
Names of attorneys.
Roles of attorneys, meaning the side they represent.
Final decision.
Cases cited.
Nature of the case, meaning using keywords to classify the case in terms of subject (e.g., criminal assault, intellectual property, etc.)

Applying our lists and rules to a corpus of legal cases, a sample output is as follows, where the coloured highlights are annotated as per the key on the right; the colours are a visualisation of the sorts of tags discussed above (to see a larger version of the image, right click on the image, then click on “View Image” or a similar phrase; when finished viewing the image, use the browser’s back button to return to the text):

Annotation of a Criminal Case

The approach is very flexible and appears in similar systems. (See, for example, de Maat and Winkels, Automatic Classification of Sentences in Dutch Laws (2008).) While it is labour intensive to develop and maintain such list and rule systems, with a collaborative, Web-based approach, it may be feasible to construct rich systems to annotate large domains.

Conclusion

In this post, we have given a very brief overview of how the Semantic Web and Natural Language Processing (NLP) apply to legal textual information to support annotation which then enables querying and inference. Of course, this is but one take on a much larger domain. In our view, it holds great promise in making legal information more transparent and available to more legal professionals. Aside from GATE, some other resources on text analytics and NLP are textbooks and lecture notes (see, e.g., Wilcock), as well as workshops (such as SPLeT and LOAIT). While applications of Natural Language Processing to legal materials are largely lab studies, the use of NLP in conjunction with Semantic Web technology to annotate legal texts is a fast-developing, results-oriented area which targets meaningful applications for legal professionals. It is well worth watching.

Dr. Adam Zachary Wyner is a Research Fellow at the University of Leeds, Institute of Communication Studies, Centre for Digital Citizenship. He currently works on the EU-funded project IMPACT: Integrated Method for Policy Making Using Argument Modelling and Computer Assisted Text Analysis. Dr. Wyner has a Ph.D. in Linguistics (Cornell, 1994) and a Ph.D. in Computer Science (King’s College London, 2008). His computer science Ph.D. dissertation is entitled Violations and Fulfillments in the Formal Representation of Contracts. He has published in the syntax and semantics of adverbs, deontic logic, legal ontologies, and argumentation theory with special reference to law. He is workshop co-chair of SPLeT 2010: Workshop on Semantic Processing of Legal Texts, to be held 23 May 2010 in Malta. He writes about his research at his blog, Language, Logic, Law, Software.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Semantic Enhancement of Legal Information… Are We Up for the Challenge?

Cross-language legal information retrieval, information retrieval, knowledge management, Legal knowledge representation, Legal ontologies, Legal semantic web, Multilingual legal information retrieval, Semantic Web and law 8 Responses »

Feb 152010

The organization and formalization of legal information for computer processing in order to support decision-making or enhance information search, retrieval and knowledge management is not recent, and neither is the need to represent legal knowledge in a machine-readable form. Nevertheless, since the first ideas of computerization of the law in the late 1940s, the appearance of the first legal information systems in the 1950s, and the first legal expert systems in the 1970s, claims, such as Hafner’s, that “searching a large database is an important and time-consuming part of legal work,” which drove the development of legal information systems during the 80s, have not yet been left behind.

Similar claims may be found nowadays as, on the one hand, the amount of available unstructured (or poorly structured) legal information and documents made available by governments, free access initiatives, blawgs, and portals on the Web will probably keep growing as the Web expands. And, on the other, the increasing quantity of legal data managed by legal publishing companies, law firms, and government agencies, together with the high quality requirements applicable to legal information/knowledge search, discovery, and management (e.g., access and privacy issues, copyright, etc.) have renewed the need to develop and implement better content management tools and methods.

Information overload, however important, is not the only concern for the future of legal knowledge management; other and growing demands are increasing the complexity of the requirements that legal information management systems and, in consequence, legal knowledge representation must face in the future. Multilingual search and retrieval of legal information to enable, for example, integrated search between the legislation of several European countries; enhanced laypersons’ understanding of and access to e-government and e-administration sites or online dispute resolution capabilities (e.g., BATNA determination); the regulatory basis and capabilities of electronic institutions or normative and multi-agent systems (MAS); and multimedia, privacy or digital rights management systems, are just some examples of these demands.

How may we enable legal information interoperability? How may we foster legal knowledge usability and reuse between information and knowledge systems? How may we go beyond the mere linking of legal documents or the use of keywords or Boolean operators for legal information search? How may we formalize legal concepts and procedures in a machine-understandable form?

In short, how may we handle the complexity of legal knowledge to enhance legal information search and retrieval or knowledge management, taking into account the structure and dynamic character of legal knowledge, its relation with common sense concepts, the distinct theoretical perspectives, the flavor and influence of legal practice in its evolution, and jurisdictional and linguistic differences?

These are challenging tasks, for which different solutions and lines of research have been proposed. Here, I would like to draw your attention to the development of semantic solutions and applications and the construction of formal structures for representing legal concepts in order to make human-machine communication and understanding possible.

Semantic metadata

Nowadays, in the search and retrieval area, we still perform most legal searches in online or application databases using keywords (that we believe to be contained in the document that we are searching for), maybe together with a combination of Boolean operators, or supported with a set of predefined categories (metadata regarding, for example, date, type of court, etc.), a list of pre-established topics, thesauri (e.g., EUROVOC), or a synonym-enhanced search.

These searches rely mainly on syntactic matching, and — with the exception of searches enhanced with categories, synonyms, or thesauri — they will return only documents that contain the exact term searched for. To perform more complex searches, to go beyond the term, we require the search engine to understand the semantic level of legal documents; a shared understanding of the domain of knowledge becomes necessary.

Although the quest for the representation of legal concepts is not new, these efforts have recently been driven by the success of the World Wide Web (WWW) and, especially, by the later development of the Semantic Web. Sir Tim Berners-Lee described it as an extension of the Web “in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”

Thus, the Semantic Web (including Linked Data efforts or the Web of Data) is envisaged as an extension of the current Web, which now also comprises collaborative tools and social networks (the Social Web or Web 2.0). The Semantic Web is sometimes also referred to as Web 3.0, although there is no widespread agreement on this matter, as different visions exist regarding the enhancement and evolution of the current Web.

Towards that shift, new languages and tools (ontologies) were needed to allow semantics to be added to the current Web, as the development of the Semantic Web is based on the formal representation of meaning in order to share with computers the flexibility, intuition, and capabilities of the conceptual structures of human natural languages. In the subfield of computer science and information science known as Knowledge Representation, the term “ontology” refers to a consensual and reusable vocabulary of identified concepts and their relationships regarding some phenomena of the world, which is made explicit in a machine-readable language. Ontologies may be regarded as advanced taxonomical structures, where concepts formalized as classes (e.g., “Actor”) are defined with axioms, enriched with the description of attributes or constraints (for example, “cardinality”), and linked to other classes through properties (e.g., “possesses” or “is_possessed_by”).

The task of developing interoperable technologies (ontology languages, guidelines, software, and tools) has been taken up by the World Wide Web Consortium (W3C). These technologies were arranged in the Semantic Web Stack according to increasing levels of complexity (like a layer cake), in the sense that higher layers depend on lower layers (and the latter are inherited from the original Web). The languages include XML (eXtensible Markup Language), a superset of HTML usually used to add structure to documents, and the so-called ontology languages: RDF (Resource Description Framework), OWL, and OWL2 (Ontology Web Language). Recently, a specification to support the conversion of existing thesauri, taxonomies or subject headings into RDF has been released (the the SKOS, Simple Knowledge Organization System standard).

Although there are different views in the literature regarding the scope of the definition or main characteristics of ontologies, the use of ontologies is seen as the key to implementing semantics for human-machine communication. Many ontologies have been built for different purposes and knowledge domains, for example:

OpenCyc,
SUMO,
PROTON,
DOLCE,
the FRBRoo model (used in the above code and graph examples),
the RDF representation of Dublin Core,
the Gene Ontology,
the Wine Ontology, and
the SemanticBible.

Although most domains are of interest for ontology modeling, the legal domain offers a perfect area for conceptual modeling and knowledge representation to be used in different types of intelligent applications and legal reasoning systems, not only due to its complexity as a knowledge intensive domain, but also because of the large amount of data that it generates. The use of semantically-enabled technologies for legal knowledge management could provide legal professionals and citizens with better access to legal information; enhance the storage, search, and retrieval of legal information; make possible advanced knowledge management systems; enable human-computer interaction; and even satisfy some hopes respecting automated reasoning and argumentation.

Regarding the incorporation of legal knowledge into the Web or into IT applications, or the more complex realization of the Legal Semantic Web, several directions have been taken, such as the development of XML standards for legal documentation and drafting (including Akoma Ntoso, LexML, CEN Metalex, and Norme in Rete), and the construction of legal ontologies.

Ontologizing legal knowledge

During the last decade, research on the use of legal ontologies as a technique to represent legal knowledge has increased and, as a consequence, a very interesting debate about their capacity to represent legal concepts and their relation to the different existing legal theories has arisen. It has even been suggested that ontologies could be the “missing link” between legal theory and Artificial Intelligence.

The literature suggests that legal ontologies may be distinguished by the levels of abstraction of the ideas they represent, the key distinction being between core and domain levels. Legal core ontologies model general concepts which are believed to be central for the understanding of law and may be used in all legal domains. In the past, ontologies of this type were mainly built upon insights provided by legal theory and largely influenced by normativism and legal positivism, especially by the works of Hart and Kelsen. Thus, initial legal ontology development efforts in Europe were influenced by hopes and trends in research on legal expert systems based on syllogistic approaches to legal interpretation.

More recent contributions at that level include the LRI-Core Ontology, the DOLCE+CLO (Core Legal Ontology), and the Ontology of Fundamental Legal Concepts Blue Scene (the basis for the LKIF-Core Ontology). Such ontologies usually include references to the concepts of Norm, Legal Act, and Legal Person, and may contain the formalization of deontic operators (e.g., Prohibition, Obligation, and Permission).

Domain ontologies, on the other hand, are directed towards the representation of conceptual knowledge regarding specific areas of the law or domains of practice, and are built with particular applications in mind, especially those that enable communication (shared vocabularies), or enhance indexing, search, and retrieval of legal information. Currently, most legal ontologies being developed are domain-specific ontologies, and some areas of legal knowledge have been heavily targeted, notably the representation of intellectual property rights respecting digital rights management (IPROnto Ontology, the Copyright Ontology, the Ontology of Licences, and the ALIS IP Ontology), and consumer-related legal issues (the Customer Complaint Ontology (or CContology), and the Consumer Protection Ontology). Many other well-documented ontologies have also been developed for purposes of the detection of financial fraud and other crimes; the representation of alternative dispute resolution methods, cases, judicial proceedings, and argumentation frameworks; and the multilingual retrieval of European law, among others. (See, for example, the proceedings of the JURIX and ICAIL conferences for further references.)

A socio-legal approach to legal ontology development

Thus, there are many approaches to the development of legal ontologies. Nevertheless, in the current legal ontology literature there are few explicit accounts or insights into the methods researchers use to elicit legal knowledge, and the accounts that are available reflect a lack of consensus as to the most appropriate methodology. For example, some accounts focus solely on the use of legal text mining and statistical analysis, in which ontologies are built by means of machine learning from legal texts; while others concentrate on the analysis of legal theories and related materials. Moreover, legal ontology researchers disagree about the role that legal experts should play in ontology validation.