skip navigation
search

Lessons Gained from Parliamentary Information Visualization (PIV)

The emerging topic of Parliamentary Informatics (PI) has opened up new terrain in the research of the scope, usefulness and contribution of Informatics to parliamentary openness and legislative transparency. This is pretty interesting when visualizations are used as the preferred method in order to present and interpret parliamentary activity that seems complicated or incomprehensible to the public. In fact, this is one of the core issues discussed, not only on PI scientific conferences but also on parliamentary fora and meetings.

The issue of Parliamentary Information Visualization (PIV) is an interesting topic; not only because visualizations are, in most cases, inviting and impressive to the human eye and brain. The main reason is that visual representations reveal different aspects of an issue in a systematic way, ranging from simple parliamentary information (such as voting records) to profound socio-political issues that lie behind shapes and colors. This article aims to explore some of the aspects related to the visualization of parliamentary information and activity.

Untangling the mystery behind visualized parliamentary information
Recent research on 19 PIV initiatives, presented in CeDEM 2014, has proven that visualizing parliamentary information is a complicated task. Its success depends on several factors: the availability of data and information, the choice of the visualization method, the intention behind the visualizations, and their effectiveness when these technologies are tied to a citizen engagement project.

To begin with, what has impressed us most during our research is the kind of information that was visualized. Characteristics, personal data and performance of Members of Parliament (MPs)/Members of European Parliament (MEPs), as well as political groups and member-states, are the elements most commonly visualized. On the other hand, particular legislative proposals, actions of MPs/MEPs through parliamentary control procedures and texts of legislation are less often visualized, which is, to some extent, understandable due to the complexity of visually depicting long legislative documents and the changes that accompany them.

Gregor Aisch – The Making of a Law visualization

Gregor Aisch – The Making of a Law visualization

However, visually representing a legislative text and its amendments might possibly reveal important aspects of a bill, such as time of submission, specific modifications that have been performed, and additions or deleted articles and paragraphs in the text.

Another interesting aspect is the visualization method used. There is a variety of methods deployed even for the visualization of the same category of Parliamentary Informatics. Robert Kosara notes characteristically: “The seemingly simple choice between a bar and a line chart has implications on how we perceive the data”.

In the same line of thought, in a recent design camp of the Law Factory project, two designer groups independently combined data for law-making processes with an array of visualization methods, in order to bring forward different points of view of the same phenomenon. Indeed, one-method-fits-all approaches cannot be applied when it comes to parliamentary information visualization. A phenomenon can be visualized both quantitatively and qualitatively, and each method can bring different results. Therefore, visualizations can facilitate plain information or further explorations, depending on the aspirations of the designer. Enabling user information and exploration are, to some extent, the primary challenges set by PIV designers. However, not all visualization methods permit the same degree of exploration. Or sometimes, the ability of in-depth exploration is facilitated by providing further background information in order to help end users navigate, comprehend and interpret the visualization.

Beyond information and exploration
Surely, a visualization of MPs’ votes, characteristics, particular legislative proposals or texts of legislation can better inform citizens. But is it enough to make them empowered? The key to this question is interaction. Interaction whether in the sense of human-computer interaction or human-to-human interaction in a physical or digital context, always refers to a two-way procedure and effect. Schrage notes succinctly: “Don’t view visualization as a medium that substitutes pictures for words but as interfaces to human interactions that create new opportunities for new value creation.”

When it comes to knowledge gained through this exploration, it is understandable that knowledge is useless if it is not shared. This is a crucial challenge faced by visualization designers, because the creation of platforms that host visualizations and enable further exchange of views and dialogue between users can facilitate citizen engagement. Additionally, information sharing or information provision through an array of contemporary and traditional means (open data, social media, printing, e-mail etc.) can render PIV initiatives more complete and inclusive.

An issue of information representation, or information trustworthiness?
Beyond the technological and functional aspects of parliamentary information visualization, it is interesting to have a look into information management and the relationship between parliaments and Parliamentary Monitoring Organizations (PMOs). As also presented by a relevant survey, PMOs serve as a hub for presenting or monitoring the work of elected representatives, and seem to cover a wide range of activities concerning parliamentary development. This, however, might not always be easily acceptable by parliaments or MPs, since it may give to elected representatives a feeling of being surveilled by these organizations.

To further explain this, questions such as who owns vs. who holds parliamentary information, where and when is this information published, and to what ends, raise deeper issues of information originality, liability of information sources and trustworthiness of both the information and its owners. For parliaments and politicians, in particular, parliamentary information monitoring and visualization initiatives may be seen as a way to surveil their work and performance, whereas for PMOs themselves these initiatives can be seen as tools for pushing towards transparency of parliamentary work and accountability of elected representatives. This discussion is quite multi-faceted, and goes beyond the scope of this post. What should be kept in mind, however, is that establishing collaboration between politicians/parliaments and civil society surely requires time, effort, trust and common understanding from all the parties involved. Under these conditions, PIV and PMO initiatives can serve as hubs that bring parliaments and citizens closer, with a view to forming a more trusted relationship.

Towards transparency?

Most PIV initiatives provide information in a way compliant with the principles of the Declaration on Parliamentary Openness. Openness is a necessary condition for transparency. But, then, what is transparency? Is it possible to come up with a definition that accommodates the whole essence of this concept?

Transparent labyrinth by Robert Morris, Nelson-Atkins Museum of Art, Kansas City (Dezeen)

Transparent labyrinth by Robert Morris, Nelson-Atkins Museum of Art, Kansas City (Dezeen)

In this quest, it is important to consider that neither openness nor transparency can exist without access to information (ATI). Consequently, availability and accessibility of parliamentary information are fundamental prerequisites in order to apply any technology that will hopefully contribute to inform, empower and help citizens participate in public decision-making.
Apart from that, it is important to look back in the definition, essence and legal nature of Freedom of Information (FOI) and Right to Information (RTI) provisions, as these are stated in the constitution of each country. A closer consideration of the similarities and differences between the terms “Freedom” and “Right”, whose meanings we usually take for granted, can provide important insight for the dependencies between them. Clarifying the meaning and function of these terms in a socio-political system can be a helpful start towards unraveling the notion of transparency.

Still, one thing is for sure: being informed and educated on our rights as citizens, as well as on how to exercise them, is a necessity nowadays. Educated citizens are able not only to comprehend the information available, but also search further, participate and have their say in decision-making. The example of the Right to Know initiative in Australia, based on the Alaveteli open-source platform, is an example of such an effort. The PIV initiatives researched thus far have shown that citizen engagement is a hard-to-reach task, which requires constant commitment and strive through a variety of tools and actions. In the long run, the full potential and effectiveness of these constantly evolving initiatives remains to be seen. In this context, legislative transparency remains in itself an open issue with many interesting aspects yet to be explored.

 

The links provided in the post are indicative examples and do not intend to promote initiatives or written materials for commercial or advertising purposes.

 

OLYMPUS DIGITAL CAMERAAspasia Papaloi is a civil servant in the IT and New Technologies Directorate of the Hellenic Parliament, a PhD Candidate at the Faculty of Communication and Media Studies of the University of Athens and a research fellow of the Laboratory of New Technologies in Communication, Education and the Mass Media, contributing as a Teaching Assistant. She holds an MA with specialization in ICT Management from the University of the Aegean in Rhodes and a Bachelor of Arts in German Language and Literature (Germanistik) from the Aristotle University of Thessaloniki. Her research area involves e-Parliaments with a special focus on visualization for the achievement of transparency.

gouscosDimitris Gouscos is Assistant Professor with the Faculty of Communication and Media Studies of the University of Athens and a research fellow of the Laboratory of New Technologies in Communication, Education and the Mass Media, where he contributes to co-ordination of two research groups on Digital Media for Learning and Digital Media for Participation. His research interests evolve around applications of digital communication in open governance, participatory media, interactive storytelling and playful learning. More information available on http://www.media.uoa.gr/~gouscos.

 

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

“To be blunt, there is just too much stuff.” (Robert C. Berring, 1994 [1])

Law is an information profession where legal professionals take on the role of intermediaries towards their clients. Today, those legal professionals routinely use online legal research services like Westlaw and LexisNexis to gain electronic access to legislative, judicial and scholarly legal documents.

Put simply, legal service providers make legal documents available online and enable users to search these text collections in order to find documents relevant to their information needs. For quite some time the main focus of providers has been the addition of more and more documents to their online collections. Quite contrary to other areas, like Web search, where an increase in the number of available documents has been accompanied by major changes in the search technology employed, the search systems used in online legal research services have changed little since the early days of computer-assisted legal research (CALR).

It is my belief, however, that the search technology employed in CALR systems will have to dramatically change in the next years. The future of online legal research services will more and more depend on the systems’ ability to create useful result lists to users’ queries. The continuing need to make additional texts available will only speed up the change. Electronic availability of a sufficient number of potentially relevant texts is no longer the main issue; quick findability of a few highly relevant documents among hundreds or even thousands of other potentially relevant ones is.

To reach that goal, from a search system’s perspective, relevance ranking is key. In a constantly growing number of situations – just like Professor Berring already stated almost 20 years ago (see above ) – even carefully chosen keywords bring back “too much stuff”.  Successful ranking, that is the ordering of search results according to their estimated relevance, becomes the main issue. A system’s ability to correctly assess the relevance of texts for every single individual user, and for every single of their queries will quickly become – or has arguably already become in most cases – the next holy grail of computer-assisted legal research.

Until a few years back providers could still successfully argue that search systems should not be blamed for the lack of  “theoretically, maybe, sometimes feasible” relevance-ranking capabilities, but rather that users had to be blamed for their missing search skills. I do not often hear that line of argumentation any longer, which certainly does not have to do with any improvement of (Boolean) search skills of end users. Representatives of service providers do not dare to follow that line of argumentation any longer, I think, because every single day every one of them uses Google by punching in vague, short queries and still mostly gets back sufficiently relevant top results. Why should this not work in CALR systems?

Indeed. Why, one might ask, is there not more Web search technology in contemporary computer-assisted legal research? Sure, according to another often-stressed argument of system providers, computer-assisted legal research is certainly different from Web search. In Web search we typically do not care about low recall as long as this guarantees high precision, while in CALR trading off recall for precision is problematic. But even with those clear differences, I have, for example, not heard a single plausible argument why the cornerstone of modern Web search, link analysis, should not be successfully used in every single CALR system out there.

These statements certainly are blunt and provocative generalizations. Erich Schweighofer, for example, has already even shown in 1999 (pre-mainstream-Web),  that there had in fact been technological changes in legal information retrieval in his well-named piece “The Revolution in Legal Information Retrieval or: The Empire Strikes Back”. And there have also been free CALR systems like PreCYdent that have fully employed citation-analysis techniques in computer-assisted legal research and have thereby – even if they did not manage to stay profitable – shown “one of the most innovative SE [search engine] algorithms“, according to experts.

An exhaustive and objective discussion of the various factors that contribute to the slow technological change in computer-assisted legal research can certainly neither be offered by myself alone nor in this short post. For a whole mix of reasons, there is not (yet) more “Google” in CALR, including the fear of system providers to be held liable for query modifications which might (theoretically) lead to wrong expert advice, and the lack of pressure from potential and existing customers to use more modern search technology.

What I want to highlight, however, is one more general explanation which is seldom put forward explicitly. What slows down technological innovation in online legal research, in my opinion, is also the interest of the whole legal profession to hold on to a conception of “legal relevance” that is immune to any kind of computer algorithm. A successfully employed, Web search-like ranking algorithm in CALR would after all not only produce comfortable, highly relevant search results, but would also reveal certain truths about legal research: The search for documents of high “legal relevance” to a specific factual or legal situation is, in most cases, a process which follows clear rules. Many legal research routines follow clear and pre-defined patterns which could be translated into algorithms. The legal profession will have to accept that truth at some point, and will therefore have to define and communicate “legal relevance” much less mystically and more pragmatically.

Again, also at this point, one might ask “Why?” I am certain that if the legal profession, that is legal professionals and their CALR service providers, do not include up-to-date search technology in their CALR systems, someone else will at some point do so without the need for a lot of involvement of legal professionals. To be blunt, at this point, Google can still serve as an example for our systems, at some point soon it might simply set an example instead of our systems.

Anton GeistAnton Geist is Law Librarian at WU (Vienna University of Economics and Business) University Library. He law degrees from University of Vienna (2006) and University of  Edinburgh (2010). He is grateful for feedback and discussions and can be contacted at home@antongeist.com.

[1] Berring, Robert C. (1994), Collapse of the Structure of the Legal Research Universe: The Imperative of Digital Information, 69 Wash. L. Rev. 9.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed. The information above should not be considered legal advice. If you require legal representation, please consult a lawyer.

[Editor’s Note] For topic-related VoxPopuLII posts please see: Núria Casellas, Semantic Enhancement of legal information … Are we up for the challenge?; Marcie Baranich, HeinOnline Takes a New Approach to Legal Research With Subject Specific Research Platforms; Elisabetta Fersini, The JUMAS Experience: Extracting Knowledge From Judicial Multimedia Digital Libraries; João Lima, et.al, LexML Brazil Project; Joe Carmel, LegisLink.Org: Simplified Human-Readable URLs for Legislative Citations; Robert Richards, Context and Legal Informatics Research; John Sheridan, Legislation.gov.uk

Prosumption: shifting the barriers between information producers and consumers

One of the major revolutions of the Internet era has been the shifting of the frontiers between producers and consumers [1]. Prosumption refers to the emergence of a new category of actors who not only consume but also contribute to content creation and sharing. Under the umbrella of Web 2.0, many sites indeed enable users to share multimedia content, data, experiences [2], views and opinions on different issues, and even to act cooperatively to solve global problems [3]. Web 2.0 has become a fertile terrain for the proliferation of valuable user data enabling user profiling, opinion mining, trend and crisis detection, and collective problem solving [4].

The private sector has long understood the potentialities of user data and has used them for analysing customer preferences and satisfaction, for finding sales opportunities, for developing marketing strategies, and as a driver for innovation. Recently, corporations have relied on Web platforms for gathering new ideas from clients on the improvement or the development of new products and services (see for instance Dell’s Ideastorm; salesforce’s IdeaExchange; and My Starbucks Idea). Similarly, Lego’s Mindstorms encourages users to share online their projects on the creation of robots, by which the design becomes public knowledge and can be freely reused by Lego (and anyone else), as indicated by the Terms of Service. Furthermore, companies have been recently mining social network data to foresee future action of the Occupy Wall Street movement.

Even scientists have caught up and adopted collaborative methods that enable the participation of laymen in scientific projects [5].

Now, how far has government gone in taking up this opportunity?

Some recent initiatives indicate that the public sector is aware of the potential of the “wisdom of crowds.” In the domain of public health, MedWatcher is a mobile application that allows the general public to submit information about any experienced drug side effects directly to the US Food and Drug Administration. In other cases, governments have asked for general input and ideas from citizens, such as the brainstorming session organized by Obama government, the wiki launched by the New Zealand Police to get suggestions from citizens for the drafting of a new policing act to be presented to the parliament, or the Website of the Department of Transport and Main Roads of the State of Queensland, which encourages citizens to share their stories related to road tragedies.

Even in so crucial a task as the drafting of a constitution, government has relied on citizens’ input through crowdsourcing [6]. And more recently several other initiatives have fostered crowdsourcing for constitutional reform in Morocco and in Egypt .

It is thus undeniable that we are witnessing an accelerated redefinition of the frontiers between experts and non-experts, scientists and non-scientists, doctors and patients, public officers and citizens, professional journalists and street reporters. The ‘Net has provided the infrastructure and the platforms for enabling collaborative work. Network connection is hardly a problem anymore. The problem is data analysis.

In other words: how to make sense of the flood of data produced and distributed by heterogeneous users? And more importantly, how to make sense of user-generated data in the light of more institutional sets of data (e.g., scientific, medical, legal)? The efficient use of crowdsourced data in public decision making requires building an informational flow between user experiences and institutional datasets.

Similarly, enhancing user access to public data has to do with matching user case descriptions with institutional data repositories (“What are my rights and obligations in this case?”; “Which public office can help me”?; “What is the delay in the resolution of my case?”; “How many cases like mine have there been in this area in the last month?”).

From the point of view of data processing, we are clearly facing a problem of semantic mapping and data structuring. The challenge is thus to overcome the flood of isolated information while avoiding excessive management costs. There is still a long way to go before tools for content aggregation and semantic mapping are generally available. This is why private firms and governments still mostly rely on the manual processing of user input.

The new producers of legally relevant content: a taxonomy

Before digging deeper into the challenges of efficiently managing crowdsourced data, let us take a closer look at the types of user-generated data flowing through the Internet that have some kind of legal or institutional flavour.

One type of user data emerges spontaneously from citizens’ online activity, and can take the form of:

  • citizens’ forums
  • platforms gathering open public data and comments over them (see for instance data-publica )
  • legal expert blogs (blawgs)
  • or the journalistic coverage of the legal system.

User data can as well be prompted by institutions as a result of participatory governance initiatives, such as:

  • crowdsourcing (targeting a specific issue or proposal by government as an open brainstorming session)
  • comments and questions addressed by citizens to institutions through institutional Websites or through e-mail contact.

This variety of media supports and knowledge producers gives rise to a plurality of textual genres, semantically rich but difficult to manage given their heterogeneity and quick evolution.

Managing crowdsourcing

The goal of crowdsourcing in an institutional context is to extract and aggregate content relevant for the management of public issues and for public decision making. Knowledge management strategies vary considerably depending on the ways in which user data have been generated. We can think of three possible strategies for managing the flood of user data:

Pre-structuring: prompting the citizen narrative in a strategic way

A possible solution is to elicit user input in a structured way; that is to say, to impose some constraints on user input. This is the solution adopted by IdeaScale, a software application that was used by the Open Government Dialogue initiative of the Obama Administration. In IdeaScale, users are asked to check whether their idea has already been covered by other users, and alternatively to add a new idea. They are also invited to vote for the best ideas, so that it is the community itself that rates and thus indirectly filters the users’ input.

The MIT Deliberatorium, a technology aimed at supporting large-scale online deliberation, follows a similar strategy. Users are expected to follow a series of rules to enable the correct creation of a knowledge map of the discussion. Each post should be limited to a single idea, it should not be redundant, and it should be linked to a suitable part of the knowledge map. Furthermore, posts are validated by moderators, who should ensure that new posts follow the rules of the system. Other systems that implement the same idea are featurelist and Debategraph [7].

While these systems enhance the creation and visualization of structured argument maps and promote community engagement through rating systems, they present a series of limitations. The most important of these is the fact that human intervention is needed to manually check the correct structure of the posts. Semantic technologies can play an important role in bridging this gap.

Semantic analysis through ontologies and terminologies

Ontology-driven analysis of user-generated text implies finding a way to bridge Semantic Web data structures, such as formal ontologies expressed in RDF or OWL, with unstructured implicit ontologies emerging from user-generated content. Sometimes these emergent lightweight ontologies take the form of unstructured lists of terms used for tagging online content by users. Accordingly, some works have dealt with this issue, especially in the field of social tagging of Web resources in online communities. More concretely, different works have proposed models for making compatible the so-called top-down metadata structures (ontologies) with bottom-up tagging mechanisms (folksonomies).

The possibilities range from transforming folksonomies into lightly formalized semantic resources (Lux and Dsinger, 2007; Mika, 2005) to mapping folksonomy tags to the concepts and the instances of available formal ontologies (Specia and Motta, 2007; Passant, 2007). As the basis of these works we find the notion of emergent semantics (Mika, 2005), which questions the autonomy of engineered ontologies and emphasizes the value of meaning emerging from distributed communities working collaboratively through the Web.

We have recently worked on several case studies in which we have proposed a mapping between legal and lay terminologies. We followed the approach proposed by Passant (2007) and enriched the available ontologies with the terminology appearing in lay corpora. For this purpose, OWL classes were complemented with a has_lexicalization property linking them to lay terms.

The first case study that we conducted belongs to the domain of consumer justice, and was framed in the ONTOMEDIA project. We proposed to reuse the available Mediation-Core Ontology (MCO) and Consumer Mediation Ontology (COM) as anchors to legal, institutional, and expert knowledge, and therefore as entry points for the queries posed by consumers in common-sense language.

The user corpus contained around 10,000 consumer questions and 20,000 complaints addressed from 2007 to 2010 to the Catalan Consumer Agency. We applied a traditional terminology extraction methodology to identify candidate terms, which were subsequently validated by legal experts. We then manually mapped the lay terms to the ontological classes. The relations used for mapping lay terms with ontological classes are mostly has_lexicalisation and has_instance.

A second case study in the domain of consumer law was carried out with Italian corpora. In this case domain terminology was extracted from a normative corpus (the Code of Italian Consumer law) and from a lay corpus (around 4000 consumers’ questions).

In order to further explore the particularities of each corpus respecting the semantic coverage of the domain, terms were gathered together into a common taxonomic structure [8]. This task was performed with the aid of domain experts. When confronted with the two lists of terms, both laypersons and technical experts would link most of the validated lay terms to the technical list of terms through one of the following relations:

  • Subclass: the lay term denotes a particular type of legal concept. This is the most frequent case. For instance, in the class objects, telefono cellulare (cell phone) and linea telefonica (phone line) are subclasses of the legal terms prodotto (product) and servizio (service), respectively. Similarly, in the class actors agente immobiliare (estate agent) can be seen as subclass of venditore (seller). In other cases, the linguistic structures extracted from the consumers’ corpus denote conflictual situations in which the obligations have not been fulfilled by the seller and therefore the consumer is entitled to certain rights, such as diritto alla sostituzione (entitlement to a replacement). These types of phrases are subclasses of more general legal concepts such as consumer right.
  • Instance: the lay term denotes a concrete instance of a legal concept. In some cases, terms extracted from the consumer corpus are named entities that denote particular individuals, such as Vodafone, an instance of a domain actor, a seller.
  • Equivalent: a legal term is used in lay discourse. For instance, contratto (contract) or diritto di recessione (withdrawal right).
  • Lexicalisation: the lay term is a lexical variant of the legal concept. This is the case for instance of negoziante, used instead of the legal term venditore (seller) or professionista (professional).

The distribution of normative and lay terms per taxonomic level shows that, whereas normative terms populate mostly the upper levels of the taxonomy [9], deeper levels in the hierarchy are almost exclusively represented by lay terms.

Term distribution per taxonomic level

The result of this type of approach is a set of terminological-ontological resources that provide some insights on the nature of laypersons’ cognition of the law, such as the fact that citizens’ domain knowledge is mainly factual and therefore populates deeper levels of the taxonomy. Moreover, such resources can be used for the further processing of user input. However, this strategy presents some limitations as well. First, it is mainly driven by domain conceptual systems and, in a way, they might limit the potentialities of user-generated corpora. Second, they are not necessarily scalable. In other words, these terminological-ontological resources have to be rebuilt for each legal subdomain (such as consumer law, private law, or criminal law), and it is thus difficult to foresee mechanisms for performing an automated mapping between lay terms and legal terms.

Beyond domain ontologies: information extraction approaches

One of the most important limitations of ontology-driven approaches is the lack of scalability. In order to overcome this problem, a possible strategy is to rely on informational structures that occur generally in user-generated content. These informational structures go beyond domain conceptual models and identify mostly discursive, emotional, or event structures.

Discursive structures formalise the way users typically describe a legal case. It is possible to identify stereotypical situations appearing in the description of legal cases by citizens (i.e., the nature of the problem; the conflict resolution strategies, etc.). The core of those situations is usually predicates, so it is possible to formalize them as frame structures containing different frame elements. We followed such an approach for the mapping of the Spanish corpus of consumers’ questions to the classes of the domain ontology (Fernández-Barrera and Casanovas, 2011). And the same technique was applied for mapping a set of citizens’ complaints in the domain of acoustic nuisances to a legal domain ontology (Bourcier and Fernández-Barrera, 2011). By describing general structures of citizen description of legal cases we ensure scalability.

Emotional structures are extracted by current algorithms for opinion- and sentiment mining. User data in the legal domain often contain an important number of subjective elements (especially in the case of complaints and feedback on public services) that could be effectively mined and used in public decision making.

Finally, event structures, which have been deeply explored so far, could be useful for information extraction from user complaints and feedback, or for automatic classification into specific types of queries according to the described situation.

Crowdsourcing in e-government: next steps (and precautions?)

Legal prosumers’ input currently outstrips the capacity of government for extracting meaningful content in a cost-efficient way. Some developments are under way, among which are argument-mapping technologies and semantic matching between legal and lay corpora. The scalability of these methodologies is the main obstacle to overcome, in order to enable the matching of user data with open public data in several domains.

However, as technologies for the extraction of meaningful content from user-generated data develop and are used in public-decision making, a series of issues will have to be dealt with. For instance, should the system developer bear responsibility for the erroneous or biased analysis of data? Ethical questions arise as well: May governments legitimately analyse any type of user-generated content? Content-analysis systems might be used for trend- and crisis detection; but what if they are also used for restricting freedoms?

The “wisdom of crowds” can certainly be valuable in public decision making, but the fact that citizens’ online behaviour can be observed and analysed by governments without citizens’ acknowledgement poses serious ethical issues.

Thus, technical development in this domain will have to be coupled with the definition of ethical guidelines and standards, maybe in the form of a system of quality labels for content-analysis systems.

[Editor’s Note: For earlier VoxPopuLII commentary on the creation of legal ontologies, see Núria Casellas, Semantic Enhancement of Legal Information… Are We Up for the Challenge? For earlier VoxPopuLII commentary on Natural Language Processing and legal Semantic Web technology, see Adam Wyner, Weaving the Legal Semantic Web with Natural Language Processing. For earlier VoxPopuLII posts on user-generated content, crowdsourcing, and legal information, see Matt Baca and Olin Parker, Collaborative, Open Democracy with LexPop; Olivier Charbonneau, Collaboration and Open Access to Law; Nick Holmes, Accessible Law; and Staffan Malmgren, Crowdsourcing Legal Commentary.]


[1] The idea of prosumption existed actually long before the Internet, as highlighted by Ritzer and Jurgenson (2010): the consumer of a fast food restaurant is to some extent as well the producer of the meal since he is expected to be his own waiter, and so is the driver who pumps his own gasoline at the filling station.

[2] The experience project enables registered users to share life experiences, and it contained around 7 million stories as of January 2011: http://www.experienceproject.com/index.php.

[3] For instance, the United Nations Volunteers Online platform (http://www.onlinevolunteering.org/en/vol/index.html) helps volunteers to cooperate virtually with non-governmental organizations and other volunteers around the world.

[4] See for instance the experiment run by mathematician Gowers on his blog: he posted a problem and asked a large number of mathematicians to work collaboratively to solve it. They eventually succeeded faster than if they had worked in isolation: http://gowers.wordpress.com/2009/01/27/is-massively-collaborative-mathematics-possible/.

[5] The Galaxy Zoo project asks volunteers to classify images of galaxies according to their shapes: http://www.galaxyzoo.org/how_to_take_part. See as well Cornell’s projects Nestwatch (http://watch.birds.cornell.edu/nest/home/index) and FeederWatch (http://www.birds.cornell.edu/pfw/Overview/whatispfw.htm), which invite people to introduce their observation data into a Website platform.

[6] http://www.participedia.net/wiki/Icelandic_Constitutional_Council_2011.

[7] See the description of Debategraph in Marta Poblet’s post, Argument mapping: visualizing large-scale deliberations (http://serendipolis.wordpress.com/2011/10/01/argument-mapping-visualizing-large-scale-deliberations-3/).

[8] Terms have been organised in the form of a tree having as root nodes nine semantic classes previously identified. Terms have been added as branches and sub-branches, depending on their degree of abstraction.

[9] It should be noted that legal terms are mostly situated at the second level of the hierarchy rather than the first one. This is natural if we take into account the nature of the normative corpus (the Italian consumer code), which contains mostly domain specific concepts (for instance, withdrawal right) instead of general legal abstract categories (such as right and obligation).

REFERENCES

Bourcier, D., and Fernández-Barrera, M. (2011). A frame-based representation of citizen’s queries for the Web 2.0. A case study on noise nuisances. E-challenges conference, Florence 2011.

Fernández-Barrera, M., and Casanovas, P. (2011). From user needs to expert knowledge: Mapping laymen queries with ontologies in the domain of consumer mediation. AICOL Workshop, Frankfurt 2011.

Lux, M., and Dsinger, G. (2007). From folksonomies to ontologies: Employing wisdom of the crowds to serve learning purposes. International Journal of Knowledge and Learning (IJKL), 3(4/5): 515-528.

Mika, P. (2005). Ontologies are us: A unified model of social networks and semantics. In Proc. of Int. Semantic Web Conf., volume 3729 of LNCS, pp. 522-536. Springer.

Passant, A. (2007). Using ontologies to strengthen folksonomies and enrich information retrieval in Weblogs. In Int. Conf. on Weblogs and Social Media, 2007.

Poblet, M., Casellas, N., Torralba, S., and Casanovas, P. (2009). Modeling expert knowledge in the mediation domain: A Mediation Core Ontology, in N. Casellas et al. (Eds.), LOAIT- 2009. 3rd Workshop on Legal Ontologies and Artificial Intelligence Techniques joint with 2nd Workshop on Semantic Processing of Legal Texts. Barcelona, IDT Series n. 2.

Ritzer, G., and Jurgenson, N. (2010). Production, consumption, prosumption: The nature of capitalism in the age of the digital “prosumer.” In Journal of Consumer Culture 10: 13-36.

Specia, L., and Motta, E. (2007). Integrating folksonomies with the Semantic Web. Proc. Euro. Semantic Web Conf., 2007.

Meritxell Fernández-Barrera is a researcher at the Cersa (Centre d’Études et de Recherches de Sciences Administratives et Politiques) -CNRS, Université Paris 2-. She works on the application of natural language processing (NLP) to legal discourse and legal communication, and on the potentialities of Web 2.0 for participatory democracy.

VoxPopuLII is edited by Judith Pratt. Editor-in-Chief is Robert Richards, to whom queries should be directed. The statements above are not legal advice or legal representation. If you require legal advice, consult a lawyer. Find a lawyer in the Cornell LII Lawyer Directory.

WorldLII[Editor’s Note: We are republishing here, with some corrections, a post by Dr. Núria Casellas that appeared earlier on VoxPopuLII.]

The organization and formalization of legal information for computer processing in order to support decision-making or enhance information search, retrieval and knowledge management is not recent, and neither is the need to represent legal knowledge in a machine-readable form. Nevertheless, since the first ideas of computerization of the law in the late 1940s, the appearance of the first legal information systems in the 1950s, and the first legal expert systems in the 1970s, claims, such as Hafner’s, that “searching a large database is an important and time-consuming part of legal work,” which drove the development of legal information systems during the 80s, have not yet been left behind.

Similar claims may be found nowadays as, on the one hand, the amount of available unstructured (or poorly structured) legal information and documents made available by governments, free access initiatives, blawgs, and portals on the Web will probably keep growing as the Web expands. And, on the other, the increasing quantity of legal data managed by legal publishing companies, law firms, and government agencies, together with the high quality requirements applicable to legal information/knowledge search, discovery, and management (e.g., access and privacy issues, copyright, etc.) have renewed the need to develop and implement better content management tools and methods.

Information overload, however important, is not the only concern for the future of legal knowledge management; other and growing demands are increasing the complexity of the requirements that legal information management systems and, in consequence, legal knowledge representation must face in the future. Multilingual search and retrieval of legal information to enable, for example, integrated search between the legislation of several European countries; enhanced laypersons’ understanding of and access to e-government and e-administration sites or online dispute resolution capabilities (e.g., BATNA determination); the regulatory basis and capabilities of electronic institutions or normative and multi-agent systems (MAS); and multimedia, privacy or digital rights management systems, are just some examples of these demands.

How may we enable legal information interoperability? How may we foster legal knowledge usability and reuse between information and knowledge systems? How may we go beyond the mere linking of legal documents or the use of keywords or Boolean operators for legal information search? How may we formalize legal concepts and procedures in a machine-understandable form?

In short, how may we handle the complexity of legal knowledge to enhance legal information search and retrieval or knowledge management, taking into account the structure and dynamic character of legal knowledge, its relation with common sense concepts, the distinct theoretical perspectives, the flavor and influence of legal practice in its evolution, and jurisdictional and linguistic differences?

These are challenging tasks, for which different solutions and lines of research have been proposed. Here, I would like to draw your attention to the development of semantic solutions and applications and the construction of formal structures for representing legal concepts in order to make human-machine communication and understanding possible.

Semantic metadata

For example, in the search and retrieval area, we still perform nowadays most legal searches in online or application databases using keywords (that we believe to be contained in the document that we are searching for), maybe together with a combination of Boolean operators, or supported with a set of predefined categories (metadata regarding, for example, date, type of court, etc.), a list of pre-established topics, thesauri (e.g., EuroVoc), or a synonym-enhanced search.

These searches rely mainly on syntactic matching, and — with the exception of searches enhanced with categories, synonyms, or thesauri — they will return only documents that contain the exact term searched for. To perform more complex searches, to go beyond the term, we require the search engine to understand the semantic level of legal documents; a shared understanding of the domain of knowledge becomes necessary.

Although the quest for the representation of legal concepts is not new, these efforts have recently been driven by the success of the World Wide Web (WWW) and, especially, by the later development of the Semantic Web. Sir Tim Berners-Lee described it as an extension of the Web “in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”

From Web 2.0 to Web 3.0

Thus, the Semantic Web is envisaged as an extension of the current Web, which now comprises collaborative tools and social networks (the Social Web or Web 2.0). The Semantic Web is sometimes also referred to as Web 3.0, although there is no widespread agreement on this matter, as different visions exist regarding the enhancement and evolution of the current Web.

These efforts also include the Web of Data (or Linked Data), which relies on the existence of standard formats (URIs, HTTP and RDF) to allow the access and query of interrelated datasets, which may be granted through a SPARQL endpoint (e.g., Govtrack.us, US census data, etc.). Sharing and connecting data on the Web in compliance with the Linked Data principles enables the exploitation of content from different Web data sources with the development of search, browse, and other mashup applications. (See the Linking Open Data cloud diagram by Cyganiak and Jentzsch below.) [Editor’s Note: Legislation.gov.uk also applies Linked Data principles to legal information, as John Sheridan explains in his recent post.]

LinkedData

Thus, to allow semantics to be added to the current Web, new languages and tools (ontologies) were needed, as the development of the Semantic Web is based on the formal representation of meaning in order to share with computers the flexibility, intuition, and capabilities of the conceptual structures of human natural languages. In the subfield of computer science and information science known as Knowledge Representation, the term “ontology” refers to a consensual and reusable vocabulary of identified concepts and their relationships regarding some phenomena of the world, which is made explicit in a machine-readable language. Ontologies may be regarded as advanced taxonomical structures, Semantic Web Stackwhere concepts are formalized as classes and defined with axioms, enriched with the description of attributes or constraints, and properties.

The task of developing interoperable technologies (ontology languages, guidelines, software, and tools) has been taken up by the World Wide Web Consortium (W3C). These technologies were arranged in the Semantic Web Stack according to increasing levels of complexity (like a layer cake). In this stack, higher layers depend on lower layers (and the latter are inherited from the original Web). These languages include XML (eXtensible Markup Language), a superset of HTML usually used to add structure to documents, and the so-called ontology languages: RDF/RDFS (Resource Description Framework/Schema), OWL, and OWL2 (Ontology Web Language). While the RDF language offers simple descriptive information about the resources on the Web, encoded in sets of triples of subject (a resource), predicate (a property or relation), and object (a resource or a value), RDFS allows the description of sets. OWL offers an even more expressive language to define structured ontologies (e.g. class disjointess, union or equivalence, etc.

Moreover, a specification to support the conversion of existing thesauri, taxonomies or subject headings into RDF triples has recently been published: the SKOS, Simple Knowledge Organization System standard. These specifications may be exploited in Linked Data efforts, such as the New York Times vocabularies. Also, EuroVoc, the multilingual thesaurus for activities of the EU is, for example, now available in this format.

Although there are different views in the literature regarding the scope of the definition or main characteristics of ontologies, the use of ontologies is seen as the key to implementing semantics for human-machine communication. Many ontologies have been built for different purposes and knowledge domains, for example:

  • OpenCyc: an open source version of the Cyc general ontology;
  • SUMO: the Suggested Upper Merged Ontology;
  • the upper ontologies PROTON (PROTo Ontology) and DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering);
  • the FRBRoo model (which represents bibliographic information);
  • the RDF representation of Dublin Core;
  • the Gene Ontology;
  • the FOAF (Friend of a Friend) ontology.

Although most domains are of interest for ontology modeling, the legal domain offers a perfect area for conceptual modeling and knowledge representation to be used in different types of intelligent applications and legal reasoning systems, not only due to its complexity as a knowledge intensive domain, but also because of the large amount of data that it generates. The use of semantically-enabled technologies for legal knowledge management could provide legal professionals and citizens with better access to legal information; enhance the storage, search, and retrieval of legal information; make possible advanced knowledge management systems; enable human-computer interaction; and even satisfy some hopes respecting automated reasoning and argumentation.

Regarding the incorporation of legal knowledge into the Web or into IT applications, or the more complex realization of the Legal Semantic Web, several directions have been taken, such as the development of XML standards for legal documentation and drafting (including Akoma Ntoso, LexML, CEN Metalex, and Norme in Rete), and the construction of legal ontologies.

Ontologizing legal knowledge

During the last decade, research on the use of legal ontologies as a technique to represent legal knowledge has increased and, as a consequence, a very interesting debate about their capacity to represent legal concepts and their relation to the different existing legal theories has arisen. It has even been suggested that ontologies could be the “missing link” between legal theory and Artificial Intelligence.

The literature suggests that legal ontologies may be distinguished by the levels of abstraction of the ideas they represent, the key distinction being between core and domain levels. Legal core ontologies model general concepts which are believed to be central for the understanding of law and may be used in all legal domains. In the past, ontologies of this type were mainly built upon insights provided by legal theory and largely influenced by normativism and legal positivism, especially by the works of Hart and Kelsen. Thus, initial legal ontology development efforts in Europe were influenced by hopes and trends in research on legal expert systems based on syllogistic approaches to legal interpretation.

More recent contributions at that level include the LKIF-Core Ontology, the LRI-Core Ontology, the DOLCE+CLO (Core Legal Ontology), and the Ontology of Fundamental Legal Concepts.Blue Scene Such ontologies usually include references to the concepts of Norm, Legal Act, and Legal Person, and may contain the formalization of deontic operators (e.g., Prohibition, Obligation, and Permission).

Domain ontologies, on the other hand, are directed towards the representation of conceptual knowledge regarding specific areas of the law or domains of practice, and are built with particular applications in mind, especially those that enable communication (shared vocabularies), or enhance indexing, search, and retrieval of legal information. Currently, most legal ontologies being developed are domain-specific ontologies, and some areas of legal knowledge have been heavily targeted, notably the representation of intellectual property rights respecting digital rights management (IPROnto Ontology, the Copyright Ontology, the Ontology of Licences, and the ALIS IP Ontology), and consumer-related legal issues (the Customer Complaint Ontology (or CContology), and the Consumer Protection Ontology). Many other well-documented ontologies have also been developed for purposes of the detection of financial fraud and other crimes; the representation of alternative dispute resolution methods, privacy compliance, patents, cases (e.g., Legal Case OWL Ontology), judicial proceedings, legal systems, and argumentation frameworks; and the multilingual retrieval of European law, among others. (See, for example, the proceedings of the JURIX and ICAIL conferences for further references.)

A socio-legal approach to legal ontology development

Thus, there are many approaches to the development of legal ontologies. Nevertheless, in the current legal ontology literature there are few explicit accounts or insights into the methods researchers use to elicit legal knowledge, and the accounts that are available reflect a lack of consensus as to the most appropriate methodology. For example, some accounts focus solely on the use of text mining techniques towards ontology learning from legal texts; while others concentrate on the analysis of legal theories and related materials to extract and formalize legal concepts. Moreover, legal ontology researchers disagree about the role that legal experts should play in ontology development and validation.

Orange SceneIn this regard, at the Institute of Law and Technology, we are developing a socio-legal approach to the construction of legal conceptual models. This approach stems from our collaboration with firms, government agencies, and nonprofit organizations (and their experts, clients, and other users) for the gathering of either explicit or tacit knowledge according to their needs. This empirically-based methodology may require the modeling of legal knowledge in practice (or professional legal knowledge, PLK), and the acquisition of knowledge through ethnographic and other social science research methods, together with the extraction (and merging) of concepts from a range of different sources (acts, regulations, case law, protocols, technical reports, etc.) and their validation by both legal experts and users.

For example, the Ontology of Professional Judicial Knowledge (OPJK) was developed in collaboration with the Spanish School of the Judicary to enhance search and retrieval capabilities of a Web-based frequentl- asked-question system (IURISERVICE) containing a repository of practical knowledge for Spanish judges in their first appointment. The knowledge was elicited from an ethnographic survey in Spanish First Instance Courts. On the other hand, the Neurona Ontologies, for a data protection compliance application, are based on the knowledge of legal experts and the requirements of enterprise asset management, together with the analysis of privacy and data protection regulations and technical risk management standards.

This approach tries to take into account many of the criticisms that developers of legal knowledge-based systems (LKBS) received during the 1980s and the beginning of the 1990s, including, primarily, the lack of legal knowledge or legal domain understanding of most LKBS development teams at the time. These criticisms were rooted in the widespread use of legal sources (statutes, case law, etc.) directly as the knowledge for the knowledge base, instead of including in the knowledge base the “expert” knowledge of lawyers or law-related professionals.

Further, in order to represent knowledge in practice (PLK), legal ontology engineering could benefit from the use of social science research methods for knowledge elicitation, institutional/organizational analysis (institutional ethnography), as well as close collaboration with legal practitioners, users, experts, and other stakeholders, in order to discover the relevant conceptual models that ought to be represented in the ontologies. Moreover, I understand the participation of these stakeholders in ontology evaluation and validation to be crucial to ensuring consensus about, and the usability of, a given legal ontology.

Challenges and drawbacks

Although the use of ontologies and the implementation of the Semantic Web vision may offer great advantages to information and knowledge management, there are great challenges and problems to be overcome.

First, the problems related to knowledge acquisition techniques and bottlenecks in software engineering are inherent in ontology engineering, and ontology development is quite a time-consuming and complex task. Second, as ontologies are directed mainly towards enabling some communication on the basis of shared conceptualizations, how are we to determine the sharedness of a concept? And how are context-dependencies or (cultural) diversities to be represented? Furthermore, how can we evaluate the content of ontologies?

Collaborative Current research is focused on overcoming these problems through the establishment of gold standards in concept extraction and ontology learning from texts, and the idea of collaborative development of legal ontologies, although these techniques might be unsuitable for the development of certain types of ontologies. Also, evaluation (validation, verification, and assessment) and quality measurement of ontologies are currently an important topic of research, especially ontology assessment and comparison for reuse purposes.

Regarding ontology reuse, the general belief is that the more abstract (or core) an ontology is, the less it owes to any particular domain and, therefore, the more reusable it becomes across domains and applications. This generates a usability-reusability trade-off that is often difficult to resolve.

Finally, once created, how are these ontologies to evolve? How are ontologies to be maintained and new concepts added to them?

Over and above these issues, in the legal domain there are taking place more particularized discussions:  for example, the discussion of the advantages and drawbacks of adopting an empirically based perspective (bottom-up), and the complexity of establishing clear connections with legal dogmatics or general legal theory approaches (top-down). To what extent are these two different perspectives on legal ontology development incompatible? How might they complement each other? What is their relationship with text-based approaches to legal ontology modeling?

I would suggest that empirically based, socio-legal methods of ontology construction constitute a bottom-up approach that enhances the usability of ontologies, while the general legal theory-based approach to ontology engineering fosters the reusability of ontologies across multiple domains.

The scholarly discussion of legal ontology development also embraces more fundamental issues, among them the capabilities of ontology languages for the representation of legal concepts, the possibilities of incorporating a legal flavor into OWL, and the implications of combining ontology languages with the formalization of rules.

Finally, the potential value to legal ontology of other approaches, areas of expertise, and domains of knowledge construction ought to be explored, for example: pragmatics and sociology of law methodologies, experiences in biomedical ontology engineering, formal ontology approaches, salamander.jpgand the relationships between legal ontology and legal epistemology, legal knowledge and common sense or world knowledge, expert and layperson’s knowledge, legal information and Linked Data possibilities, and legal dogmatics and political science (e.g., in e-Government ontologies).

As you may see, the challenges faced by legal ontology engineering are great, and the limitations of legal ontologies are substantial. Nevertheless, the potential of legal ontologies is immense. I believe that law-related professionals and legal experts have a central role to play in the successful development of legal ontologies and legal semantic applications.

[Editor’s Note: For many of us, the technical aspects of ontologies and the Semantic Web are unfamiliar. Yet these technologies are increasingly being incorporated into the legal information systems that we use everyday, so it’s in our interest to learn more about them. For those of us who would like a user-friendly introduction to ontologies and the Semantic Web, here are some suggestions:

Dr. Núria Casellas Dr. Núria Casellas is a visiting researcher at the Legal Information Institute at Cornell University. She is a researcher at the Institute of Law and Technology and an assistant professor at the UAB Law School (on leave). She has participated in several national and European-funded research projects regarding legal ontologies and legal knowledge management: these concern the acquisition of knowledge in judicial settings (IURISERVICE), modeling privacy compliance regulations (NEURONA), drafting legislation (DALOS), and the Legal Case Study of the Semantically Enabled Knowledge Technologies (SEKT VI Framework project), among others. Co-editor of the IDT Series, she holds a Law Degree from the Universitat Autònoma de Barcelona, a Master’s Degree in Health Care Ethics and Law from the University of Manchester, and a PhD (“Modelling Legal Knowledge through Ontologies. OPJK: the Ontology of Professional Judicial Knowledge”).

VoxPopuLII is edited by Judith Pratt. Editor in Chief is Robert Richards.

There has been much discussion on this blog about law-related information retrieval systems, celticknotgreen.jpgontologies, and metadata. Today, I’d like to take you into another corner of legal informatics: rule-based legal information systems. I’ll tell you what they are, what their strengths and limitations are, and how they’re made. I’ll also explain why I’m optimistic about their potential to expand public access to law and to improve the way legal expertise is deployed and consumed.

First, what are they?

A rule-based expert system represents knowledge of a particular domain — such as medicine, finance, or law — in the form of “if-then” rules. Here’s an example of a rule:

the employee is entitled to standard FMLA leave IF
the employee is an eligible employee AND
the reason for the leave is enumerated in 29 U.S.C. § 2612

A rule consists of a bunch of variables (here, three Boolean statements) together with some logical operators (if, then, and, or, not, mathematical operators, etc.). Rules are chained together to form a rulebase, which is basically a database of rules. “Chained together” means that the rules connect to each other: a condition in one rule is the consequent or conclusion in another rule. For example, here’s a rule that links to our first rule:

the reason for the leave is enumerated in 29 U.S.C. § 2612 IF
the employee needs to care for a newborn child OR
the employee is becoming an adoptive or foster parent OR
the employee’s relative has a serious health condition OR
the employee cannot perform their job due to a serious health condition

Each of the conditions in this new rule can be defined by yet more rules. And other rules can sprout off of the main rule tree to form a complex web of inference. If we were to visualize such a network of rules, it might begin to look something like this:

rulebase_visualization4.jpg

The rulebase inputs are shown in blue and the outputs – or “goals” – are highlighted in orange. The core function of the inference engine (or rule engine) is to figure out what conclusions can be drawn from the input facts. Also, given incomplete information, an inference engine will figure out what additional facts are needed in order to reach one of the goals.

Rule-based systems in context

From this extremely simple example we can start to get a sense of the strengths and limitations of rule-based representations of legal knowledge. Let’s start with the strengths. First, the law, to a significant degree, seems to consist of rules, and representing them in a constrained, logical language is fairly straightforward and natural. As a result, rule-based systems are transparent: the system code looks a lot like the text that’s being represented. This “isomorphism” means that you can trace the system logic back to the original source material, easily spot errors, and quickly adapt to changes in the law. Furthermore, rule-based systems can justify their determinations by explaining how they arrived at a particular conclusion and by providing audit trails. It’s also fairly easy for people to interact with rule-based systems, as they integrate well with interviews. In short, it’s relatively easy to put legal knowledge into rule-based systems, easy to maintain it, and easy to get it out.

But all this simplicity comes with a price: the sophistication of the knowledge that can be represented. For one thing, common sense knowledge does not lend itself to simple rule-based representations, as the decades-long Cyc project illustrates. A significant portion of my own rule-authoring effort is spent representing mundane concepts, like figuring whether a given date falls on a legal holiday or counting the number of weeks in which a given condition is true. Secondly, there’s the problem of how to model vague or “open-textured” concepts. For instance, if a liability determination turns upon whether a person’s conduct was “reasonable”, the uncertainty and fuzziness of that term can’t be modeled in a way analogous to human thinking. A third limitation facing rule-based systems is the “knowledge acquisition bottleneck.” This is the effort required to codify, test, and validate expert domain knowledge. Part of the challenge derives from the reasons I’ve already mentioned, and part results from the need to capture the knowledge of human subject matter experts who don’t always think in complete and precise “if-then” constructs. Another criticism often lodged at legal expert systems is that law is in essence not rule-based but is instead a fray of competing textual interpretations which cannot be accurately modeled.

My view is that, even given these limitations, there are still many problems that can be solved by rule-based systems. No one is asking them to solve all legal automation problems, or claiming that all legal knowledge can be represented in the form of rules. (Part of why little attention is paid to these systems today is that they were over-hyped during the artificial intelligence boom of the 1970s and 80s.) But there is a place for them, and that place is quite large even given the semantic confines that I just described. Rule-based systems are ideal for encoding legal principles found in statutes, regulations, and agency decisions — that is, law that’s explicit and knowable, but logically complicated. And there are millions of pages of such law, across thousands of jurisdictions around the world, just waiting to be embedded in rule-based systems.

Let me give you a few examples of what rule-based information systems can do, although chances are that you’ve already encountered one. Perhaps, like millions of American taxpayers, you used TurboTax tax preparation software to file your taxes this year. This and other tax preparation programs interview you about your income and finances, perform a multitude of behind-the-scenes calculations, and then fill out the relevant tax forms for you. I don’t actually know how this software was constructed, but if I were doing it I would absolutely take a rule-based approach. In fact, my team did use a rule engine when tasked to build a tax law advisory system for the IRS. That system, the Interactive Tax Assistant, answers seven common tax questions, is driven by about 1,300 rules, and contains around 200 question screens. Rule-based design can also produce systems like the Australian Visa Wizard, DirectLaw, and The Benefit Bank. Other rule-driven systems work behind the scenes at government agencies and corporations to process claims by making fast, consistent, and transparent decisions.

Available tools

In my view, the premier tool for engineering rule-based legal information systems is Oracle Policy Modeling (OPM, formerly known as Haley Office Rules, RuleBurst, and Softlaw). (Full disclosure: I used to work for Oracle.) OPM lets you write natural language rules that capture statutory text, calculations, date and time-based reasoning, and basic ontological relationships. It has decent debugging and rulebase visualization features (that’s how I created the rule network diagram above), and an excellent regression testing facility. OPM lets you deploy rulebases as Web interviews and integrate them into other computer systems. The major downside to OPM is its cost: I understand the list price to be in the ballpark of $100K per license.

You can also model legal rules using other business rule engines, such as ILOG, Blaze Advisor, JBoss Drools (free), and Jess (free). JBoss Drools has a promising feature that lets you create Domain Specific Languages by mapping natural language expressions to the underlying programming code. You could also use traditional logic programming / expert system languages like Prolog or CLIPS, which are extremely powerful but which do not allow for isomorphic representation of the law. OWL-centric ontology editors such as Protege are also beginning to support rule-based knowledge representation.

To address the lack of freely-available, practical legal modeling tools, I’ve been working on Jureeka.org, a project affiliated with Stanford’s CodeX Center for Computers and Law. Jureeka is an open, Web-based rule authoring platform that lets lawyers, law students, and other subject matter experts represent their knowledge as “if-then” rules. Jureeka then uses the rules to generate jurisdiction-specific interviews, which present the relevant topic in a digestible manner. Its strengths are that it’s completely Web-based, it makes navigation of the rules easy, and it lets rule authors work collaboratively to rapidly develop knowledge bases in a wiki-like fashion. The motivating vision is to provide a way for legal knowledge engineers to build topical rulebases, and then connect these modules together to form an information backbone that drives other IT systems and helps the general public get answers to their legal questions.

jureeka_screenshot1.jpg

Jureeka is very much a work in progress, and I’ll be the first to admit that its main weakness is the oversimplicity of its rule syntax. (For example, I’m currently working on an ontology layer and a way to reason across multiple instances of an object or variable.) But this is the type of knowledge-generating project that I’d like to see a developer community coalesce around.

Future potential

Rule-based programming is not the be-all and end-all of legal informatics, but it does have significant untapped potential. Government agencies are beginning to adopt rule-based legal information systems as a way to better serve the public. I think there are also lucrative opportunities available for law firms to seize the first mover advantage by automating slices of the law of interest to consumers. Rule-based systems can help nonprofit organizations advance their missions by guiding constituents through labyrinthine legal processes. And these systems are of obvious benefit to corporations, which need to comply with a variety of regulations across numerous jurisdictions.

Rule-based systems can also benefit the legislative drafting process. For example, an early incarnation of the OPM software helped the Australian Taxation Office simplify that country’s tax code. In addition to this kind of legislative refactoring (which entails clarifying and reorganizing Rube Goldberg-like legal texts), legislatures could also promulgate law in an “inference-ready” machine readable form. That is, portions of the law could be written in a syntax that both humans and machines can read, making the law not only accessible but executable. I’m not merely referring to high-level metadata; I’m talking about code that is intended to be run in an inference engine and that can be deployed as is into society’s computing infrastructure. [See, e.g., Professor Monica Palmirani’s example of legal rules coded in the Legal Knowledge Interchange Format (LKIF) (at slides 48 through 50); please note that this is a 4.5M download.]

Some people have raised the objection that rule-based systems and their creators engage in the unauthorized practice of law by dispensing “legal advice.” I think this concern is overblown and founded upon a lack of understanding of how these systems work. Legal advice entails applying the law to the facts of a particular case or, conversely, interpreting facts in light of the applicable law. Rule-based systems don’t do that.  Instead, they break up complicated legal provisions into atomic pieces and ask users to determine how each atom applies to them. Conceptually, it’s no different than reading a plain language description of legal rules and applying those rules to your own situation.

My goal in this post has been to introduce you to something that you may not have heard about and to convince you that it is a viable and worthwhile activity. Rule-based legal information systems have been around for a few decades, but we still have a long way to go until our rule-based legal modeling tools are as sophisticated as the Mathematica software is in the domain of mathematical computation. As we move in that direction, and as our legal knowledge engineering proficiency grows, we can advance toward the day when all people can take equal advantage of their legal rights. Knowing that they have them is the first step.

mp.jpgMichael Poulshock is a consultant specializing in legal knowledge engineering and a Fellow at Stanford University’s CodeX Center for Computers and Law. He is the creator of Jureeka.org and the Jureeka legal research browser add-on for Firefox and Chrome. He was previously a human rights lawyer.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

CornucopiaThe World Wide Web is a virtual cornucopia of legal information bearing on all manner of topics and in a spectrum of formats, much of it textual. However, to make use of this storehouse of textual information, it must be annotated and structured in such a way as to be meaningful to people and processable by computers. One of the visions of the Semantic Web has been to enrich information on the Web with annotation and structure. Yet, given that text is in a natural language (e.g., English, German, Japanese, etc.), which people can understand but machines cannot, some automated processing of the text itself is needed before further processing can be applied. In this article, we discuss one approach to legal information on the World Wide Web, the Semantic Web, and Natural Language Processing (NLP). Each of these are large, complex, and heterogeneous topics of research; in this short post, we can only hope to touch on a fragment and that heavily biased to our interests and knowledge. Other important approaches are mentioned at the end of the post. We give small working examples of legal textual input, the Semantic Web output, and how NLP can be used to process the input into the output.

Legal Information on the Web

For clients, legal professionals, and public administrators, the Web provides an unprecedented opportunity to search for, find, and reason with legal information such as case law, legislation, legal opinions, journal articles, and material relevant to discovery in a court procedure. With a search tool such as Google or indexed searches made available by Lexis-Nexis, Westlaw, or the World Legal Information Institute, the legal researcher can input key words into a search and get in return a (usually long) list of documents which contain, or are indexed by, those key words.

As useful as such searches are, they are also highly limited to the particular words or indexation provided, for the legal researcher must still manually examine the documents to find the substantive information. Moreover, current legal search mechanisms do not support more meaningful searches such as for properties or relationships, where, for example, a legal researcher searches for cases in which a company has the property of being in the role of plaintiff or where a lawyer is in the relationship of representing a client. Nor, by the same token, can searches be made with respect to more general (or more specific) concepts, such as “all cases in which a company has any role,” some particular fact pattern, legislation bearing on related topics, or decisions on topics related to a legal subject.

Binary MysteryThe underlying problem is that legal textual information is expressed in natural language. What literate people read as meaningful words and sentences appear to a computer as just strings of ones and zeros. Only by imposing some structure on the binary code is it converted to textual characters as we know them. Yet, there is no similar widespread system for converting the characters into higher levels of structure which correlate to our understanding of meaning. While a search can be made for the string plaintiff, there are no (widely available) searches for a string that represents an individual who bears the role of plaintiff. To make language on the Web more meaningful and structured, additional content must be added to the source material, which is where the Semantic Web and Natural Language Processing come into play.

Semantic Web

The Semantic Web is a complex of design principles and technologies which are intended to make information on the Web more meaningful and usable to people.Semantic Web Stack We focus on only a small portion of this structure, namely the syntactic XML (eXtensible Markup Language) level, where elements are annotated so as to indicate linguistically relevant information and structure. (Click here for more on these points.) While the XML level may be construed as a ‘lower’ level in the Semantic Web “stack” — i.e., the layers of interrelated technologies that make up the Semantic Web — the XML level is nonetheless crucial to providing information to higher levels where ontologies (and click here for more on this) and logic play a role. So as to be clear about the relation between the Semantic Web and NLP, we briefly review aspects of XML by example, and furnish motivations as we go.

Suppose one looks up a case where Harris Hill is the plaintiff and Jane Smith is the attorney for Harris Hill. In a document related to this case, we would see text such as the following portions:

Harris Hill, plaintiff.
Jane Smith, attorney for the plaintiff.

While it is relatively straightforward to structure the binary string into characters, adding further information is more difficult. Consider what we know about this small fragment: Harris and Jane are (very likely) first names, Hill and Smith are last names, Harris Hill and Jane Smith are full names of people, plaintiff and attorney are roles in a legal case, Harris Hill has the role of plaintiff, attorney for is a relationship between two entities, and Jane Smith is in the attorney for relationship to Harris Hill. It would be useful to encode this information into a standardised machine-readable and processable form.

XML helps to encode the information by specifying requirements for tags that can be used to annotate the text. It is a highly expressive language, allowing one to define tags that suit one’s purposes so long as the specification requirements are met. One requirement is that each tag has a beginning and an ending; the material in between is the data that is being tagged. For example, suppose tags such as the following, where … indicates the data:


<legalcase>...</legalcase>,
<firstname>...</firstname>,
<lastname>...</lastname>,
<fullname>...</fullname>,
<plaintiff>...</plaintiff>,
<attorney>...</attorney>, 
<legalrelationship>...</legalrelationship>

Another requirement is that the tags have a tree structure, where each pair of tags in the document is included in another pair of tags and there is no crossing over:


<fullname><firstname>...</firstname>, 
<lastname>...</lastname></fullname>

is acceptable, but


<fullname><firstname>...<lastname>
</firstname> ...</lastname></fullname>

is unacceptable. Finally, XML tags can be organised into schemas to structure the tags.

With these points in mind, we could represent our fragment as:


<legalcase>
  <legalrelationship>
    <plaintiff>
      <fullname><firstname>Harris</firstname>,
           <lastname>Hill</lastname></fullname>
    </plaintiff>,
    <attorney>
      <fullname><firstname>Jane</firstname>,
           <lastname>Smith</lastname></fullname>
    </attorney>
  </legalrelationship
</legalcase>

We have added structured information — the tags — to the original text. While this is more difficult for us to read, it is very easy for a machine to read and process. In addition, the tagged text contains the content of the information, which can be presented in a range of alternative ways and formats using a transformation language such as XSLT (click here for more on this point) so that we have an easier-to-read format.

Why bother to include all this additional information in a legal text? Because these additions allow us to query the source text and submit the information to further processing such as inference. Given a query language, we could submit to the machine the query Who is the attorney in the case? and the answer would be Jane Smith. Given a rule language — such as RuleML or Semantic Web Rule Language (SWRL) — which has a rule such as If someone is an attorney for a client then that client has a privileged relationship with the attorney, it might follow from this rule that the attorney could not divulge the client’s secrets. Applying such a rule to our sample, we could infer that Jane Smith cannot divulge Harris Hill’s secrets.

Tower of BabelThough it may seem here like too much technology for such a small and obvious task, it is essential where we scale up our queries and inferences on large corpora of legal texts — hundreds of thousands if not millions of documents — which comprise vast storehouses of unstructured, yet meaningful data. Were all legal cases uniformly annotated, we could, in principle, find out every attorney for every plaintiff for every legal case. Where our tagging structure is very rich, our queries and inferences could also be very rich and detailed. Perhaps a more familiar way to view documents annotated with XML is as a database to which further processes can be applied over the Web.

Natural Language Processing

As we have presented it, we have an input, the corpus of texts, and an output, texts annotated with XML tags. The objective is to support a range of processes such as querying and inference. However, getting from a corpus of textual information to annotated output is a demanding task, generically referred to as the knowledge acquisition bottleneck. Not only is the task demanding on resources (time, money, manpower); it is also highly knowledge intensive since whoever is doing the annotation must know what to look for, and it is important that all of the annotators annotate the text in the same way (inter-annotator agreement) to support the processes. Thus, automation is central.

Yet processing language to support such richly annotated documents confronts a spectrum of difficult issues. Among them, natural language supports (1) implicit or presupposed information, (2) multiple forms with the same meaning, (3) the same form with different contextually dependent meanings, and (4) dispersed meanings. (Similar points can be made for sentences or other linguistic elements.) Here are examples of these four issues:

(1) “When did you stop taking drugs?” (presupposes that the person being questioned took drugs at sometime in the past);
(2) Jane Smith, Jane R. Smith, Smith, Attorney Smith… (different ways to refer to the same person);
(3) The individual referred to by the name “Jane Smith” in one case decision may not be the individual referred to by the name “Jane Smith” in another case decision;
(4) Jane Smith represented Jones Inc. She works for Dewey, Cheetum, and Howe. To contact her, write to j.smith@dch.com .

When we search for information, a range of linguistic structures or relationships may be relevant to our query, such as:

People grasp relationships between words and phrases, such that Bill exercises daily contrasts with the meaning of Bill is a couch potato, or that if it is true that Bill used a knife to kill Phil, then Bill killed Phil. Finally, meaning tends to be sparse; that is, there are a few words and patterns that occur very regularly, while most words or patterns occur relatively rarely in the corpus.

Natural language processing (NLP) takes on this highly complex and daunting problem as an engineering problem, decomposing large problems into smaller problems and subdomains until it gets to those which it can begin to address. Having found a solution to smaller problems, NLP can then address other problems or larger scope problems. Some of the subtopics in NLP are:

  • Generation – converting information in a database into natural language.
  • Understanding – converting natural language into a machine-readable form.
  • Information Retrieval – gathering documents which contain key words or phrases. This is essentially what is done by Google.
  • Text Summarization – summarizing (in a paragraph) the main meaning of a text or corpus.
  • Question Answering – making queries and giving answers to them, in natural language, with respect to some corpus of texts.
  • Information Extraction — identifying, annotating, and extracting information from documents for reuse, representation, or reasoning.

In this article, we are primarily (here) interested in information extraction.

NLP Approaches: Knowledge Light v. Knowledge Heavy

There are a range of techniques that one can apply to analyse the linguistic data obtained from legal texts; each of these techniques has strengths and weaknesses with respect to different problems. Statistical and machine-learning techniques are considered “knowledge light.” With statistical approaches, the processing presumes very little knowledge by the system (or analyst). Rather, algorithms are applied that compare and contrast large bodies of textual data, and identify regularities and similarities. Such algorithms encounter problems with sparse data or patterns that are widely dispersed across the text. (See Turney and Pantel (2010) for an overview of this area.) Machine learning approaches apply learning algorithms to annotated material to extend results to unannotated material, thus introducing more knowledge into the processing pipeline. However, the results are somewhat of a black box in that we cannot really know the rules that are learned and use them further.

With a “knowledge-heavy” approach, we know, in a sense, what we are looking for, and make this knowledge explicit in lists and rules for processing. Yet, this is labour- and knowledge-intensive. In the legal domain it is crucial to have humanly understandable explanations and justifications for the analysis of a text, which to our thinking warrants a knowledge-heavy approach.

One open source text-mining package, General Architecture for Text Engineering (GATE), consists of multiple components in a cascade or pipeline, each component automatically processing some aspect of the text, and then feeding into the next process. The underlying strategy in all the components is to find a pattern (from either a list or a previous process) which matches a rule, and then to apply the rule which annotates the text. Each component performs a particular process on the text, such as:

  • Sentence segmentation – dividing text into sentences.
  • Tokenisation – words identified by spaces between them.
  • Part-of-speech tagging – noun, verb, adjective, etc., determined by look-up and relationships among words.
  • Shallow syntactic parsing/chunking – dividing the text by noun phrase, verb phrase, subordinate clause, etc.
  • Named entity recognition – the entities in the text such as organisations, people, and places.
  • Dependency analysis – subordinate clauses, pronominal anaphora [i.e., identifying what a pronoun refers to], etc.

The system can also be used to annotate more specifically to elements of interest. In one study, we annotated legal cases from a case base (a corpus of cases) in order to identify a range of particular pieces of information that would be relevant to legal professionals such as:

  • Case citation.
  • Names of parties.
  • Roles of parties, meaning plaintiff or defendant.
  • Type of court.
  • Names of judges.
  • Names of attorneys.
  • Roles of attorneys, meaning the side they represent.
  • Final decision.
  • Cases cited.
  • Nature of the case, meaning using keywords to classify the case in terms of subject (e.g., criminal assault, intellectual property, etc.)

Applying our lists and rules to a corpus of legal cases, a sample output is as follows, where the coloured highlights are annotated as per the key on the right; the colours are a visualisation of the sorts of tags discussed above (to see a larger version of the image, right click on the image, then click on “View Image” or a similar phrase; when finished viewing the image, use the browser’s back button to return to the text):

Annotation of a Criminal Case

The approach is very flexible and appears in similar systems. (See, for example, de Maat and Winkels, Automatic Classification of Sentences in Dutch Laws (2008).) While it is labour intensive to develop and maintain such list and rule systems, with a collaborative, Web-based approach, it may be feasible to construct rich systems to annotate large domains.

Conclusion

In this post, we have given a very brief overview of how the Semantic Web and Natural Language Processing (NLP) apply to legal textual information to support annotation which then enables querying and inference. Of course, this is but one take on a much larger domain. In our view, it holds great promise in making legal information more transparent and available to more legal professionals. Aside from GATE, some other resources on text analytics and NLP are textbooks and lecture notes (see, e.g., Wilcock), as well as workshops (such as SPLeT and LOAIT). While applications of Natural Language Processing to legal materials are largely lab studies, the use of NLP in conjunction with Semantic Web technology to annotate legal texts is a fast-developing, results-oriented area which targets meaningful applications for legal professionals. It is well worth watching.

Adam WynerDr. Adam Zachary Wyner is a Research Fellow at the University of Leeds, Institute of Communication Studies, Centre for Digital Citizenship. He currently works on the EU-funded project IMPACT: Integrated Method for Policy Making Using Argument Modelling and Computer Assisted Text Analysis. Dr. Wyner has a Ph.D. in Linguistics (Cornell, 1994) and a Ph.D. in Computer Science (King’s College London, 2008). His computer science Ph.D. dissertation is entitled Violations and Fulfillments in the Formal Representation of Contracts. He has published in the syntax and semantics of adverbs, deontic logic, legal ontologies, and argumentation theory with special reference to law. He is workshop co-chair of SPLeT 2010: Workshop on Semantic Processing of Legal Texts, to be held 23 May 2010 in Malta. He writes about his research at his blog, Language, Logic, Law, Software.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.