Legal text mining » VoxPopuLII

Is it good enough for the law to be written for lawyers?

Access to justice, Citizen participation in lawmaking, Collaboration in legal informatics, Crowdsourcing and free access to law, Crowdsourcing and legal information systems, Disruptive legal technology, Interdisciplinary collaboration in legal informatics, Legal publishing, Legal text mining, Legal text processing, Linked Data and law Comments Off

Sep 182014

c.c. BY-SA 3.0. wikipedia.org

If you think that law isn’t written for lawyers, try reading some. It can even start looking normal after a while (say about the length of time it takes to get through law degree). But research on the main street impact of legal language suggests that for most people, the law is likely to be either incomprehensible or very hard to read.

This problem is a focus of a research project which a team of us at ANU and Cornell LII have been addressing over the past months (Eric McCreath (Australian National University, Research School of Computer Science), Wayne Weibel (Cornell University Law School, Legal Information Insitute), Nic Ceynowa (LII), Sara Frug (LII), Tom Bruce (LII) and myself (ANU)). With the generous help of thousands of LII users, as part of a citizen science project, we’ve been collecting data on the readability of law as well as demographic data about the users of law.

If you are concerned about access to law, and many are, the current situation is not really good enough. Whether you tend to ‘human rights’, ‘democratic values’, ‘economic efficiency’, ‘rule of law’ or are just wanting to make sure your hapless minions follow your every command, you’ll be able to think of a good reason why the law should be more accessible (readable) than it is.

Of course the problem has been around for a very long time, and plain language is a standing goal of many legislative drafting offices. Reform efforts have been underway since the middle ages. Certainly legal language has improved considerably, particularly as a result of 19th and 20th century reforms with that goal in mind. Still, the law can’t be said to be readily accessible to the general public, in the sense of its readability.

What has changed that makes the problem more urgent today is that the general public can now at least get to the law. That’s the revolution that’s been achieved by online publishers of the law, including the Free Access to Law Movement and official and commercial law publishers. As the UK’s First Parliamentary Counsel observed last year:

Legislation affects us all. And increasingly, legislation is being searched for, read and used by a broad range of people. It is no longer confined to professional libraries; websites like legislation.gov.uk have made it accessible to everyone. So the digital age has made it easier for people to find the law of the land; but once they have found it, they may be baffled. The law is regarded by its users as intricate and intimidating.

In 2010, the Plain Writing Act was adopted by the US Congress with the aim of improving government writing. Sad to say, the Act itself is no model of plain language. Section, sub-section and paragraph roll on, line after line, provision after convoluted provision. In substance they say not much more than: write clearly so that the public can understand and use what you write. Didn’t anyone see the irony? Then again, reality check, most legislation is never read by the people who vote to make it law. Just to make sure the drawbridge was well and truly up, if you read through to the fine print at the end there is an important rider. What happens if no one can understand what the law is supposed to mean? Well, nothing a judge can do about it. Great aspiration, but …

A sea change could be on the way, though. The Good Law initiative is one great example of efforts to address the complexity and readability of legislation. What is significant is that how we are thinking about legal rules is changing. Official publishers of the law are beginning to talk about the law as if it’s data. The UK National Archives Office has even published an API — Application Programmers Interface (basically a ‘how to’ for developers who want to use the “data”).So now we’re thinking of law as data. And we’re going to unleash computer scientists on it, to do whatever their imaginations can come up with. Bommarito and Katz‘ work on the legal code as a mathematical network is a great example of the virtually infinite possibilities.

Our own research uses the potential of computational technologies in another way. Online legal sites are not just ‘documents’. They are places where people are actively interacting with the law. We used crowd-sourcing to engage with this audience, asking them to rate law on readability characteristics as well as exploring the demographics of who uses the law. Our aim was to develop a labelled dataset that could be used as input to machine learning. “Labelled data” is machine learning gold — hard to get, but if you can you get it, you can use it to make predictions about what human judges would say. In our case we are trying to predict whether a legal sentence will be readable or not.

In the process we learned quite a bit about the audience using the law, and about which law they use. Scouring the Google Analytics data, it became obvious that the law is not equally read. We may all be equal before the law, but the law is not equal before us. Just 37 sections of the US Code account for almost 10% of the page visits to US code pages (there are about 65,000). So a tiny fraction of the Code is being read all the time. On the other hand there are huge swathes of the Code that hardly ever see the light of a back illuminated screen. This is not trivial news. Computer scientists love lists. Prioritised lists get their own special lectures for first year CS students — and here we have a prioritized list. You want to know what law is at the top of your priority list — the users will tell you. If you’re concerned with cleaning up the law code or making it easier to understand, there’s useful stuff here.

Ranking of sections by frequency of readership (on a logarithmic scale)

It will be no surprise that we found that law is harder for just about every part of the community than legal professionals. What was surprising was that legal professionals (including law students), turn out to be a minority of those interested enough to respond, on the LII site at least.

These were just a few of the demographic insights we were able to draw.

On the machine learning front, we were able to show that machine learning can improve on traditional readability metrics in predicting language difficulty (they’ve long been regarded as suspect in application to legal texts anyway). That said, it’s early days and we would like to extend the research we have done so far. There is a lot of potential for future research applying computational techniques to the readability of law. A co-authored publication further describing the research introduced in this article will be presented at this year’s Law Via the Internet Conference being held at the end of September.

But while we’re thinking about it, there are other ways to think about `access’ to law. What if instead of writing the law, it was visualized? You know — like in pictures. Before you storm off in contempt, note this: research is validating that pictures can improve user experience — for example in the contract space, where what your clients think of your contract can impact on your bottom line.

It’s radical enough unleashing computer scientists on legal rules. What might the law look like if we try thinking like designers? ‘User experience’ of legal rules? That one didn’t come up in law school. We’re in some surreally different world at this point. Designers create artefacts for people to use which are optimised for functionality, beauty and other characteristics –- not things that are meant to tell people what to do. ‘User experience’ is their kind of thinking.

As readers of Vox Pop will know, the idea of legal design is starting to get traction. Helena Haapio and Stefania Passera’s great article on legal design covers some of the field. An article they jointly published last year points out some of the benefits of visualization. Earlier this year, we worked on a joint paper exploring the feasibility of automating legal visualization. We were able to demonstrate the automation of visualization of clauses, such as a contract term clause, a liquidated damages clause or a payment clause. Visit our proof of concept site, where you can play with visualizing different options.

OK. So perhaps some of the above reads like we’re on the up-slope of the hype curve. But that of course is the fun. For those of us who’ve spent many years in the law, looking at the law from a different professional paradigm can help us see things that didn’t stand out before. It certainly enjoyable and brings a breath of fresh air to the law.

Michael Curtotti is undertaking a PhD in the Research School of Computer Science at the Australian National University. His co-authored publications on legal informatics include: A Right to Access Implies a Right to Know: An Open Online Platform for Readability Research; Enhancing the Visualization of Law and A corpus of Australian contract language: description, profiling and analysis. He holds a Bachelor of Laws and a Bachelor of Commerce from the University of New South Wales, and a Masters of International Law from the Australian National University. He works part-time as a legal adviser to the ANU Students Association and the ANU Post-graduate & research students Association, providing free legal services to ANU students.

—————————

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

Legal Prosumers: How Can Government Leverage User-Generated Content?

Crowdsourcing and legal information systems, Legal knowledge representation, Legal ontologies, Legal text mining, Legal text processing, natural language processing, Semantic Web and law, User-generated content and legal information 1 Response »

Nov 172011

Prosumption: shifting the barriers between information producers and consumers

One of the major revolutions of the Internet era has been the shifting of the frontiers between producers and consumers [1]. Prosumption refers to the emergence of a new category of actors who not only consume but also contribute to content creation and sharing. Under the umbrella of Web 2.0, many sites indeed enable users to share multimedia content, data, experiences [2], views and opinions on different issues, and even to act cooperatively to solve global problems [3]. Web 2.0 has become a fertile terrain for the proliferation of valuable user data enabling user profiling, opinion mining, trend and crisis detection, and collective problem solving [4].

The private sector has long understood the potentialities of user data and has used them for analysing customer preferences and satisfaction, for finding sales opportunities, for developing marketing strategies, and as a driver for innovation. Recently, corporations have relied on Web platforms for gathering new ideas from clients on the improvement or the development of new products and services (see for instance Dell’s Ideastorm; salesforce’s IdeaExchange; and My Starbucks Idea). Similarly, Lego’s Mindstorms encourages users to share online their projects on the creation of robots, by which the design becomes public knowledge and can be freely reused by Lego (and anyone else), as indicated by the Terms of Service. Furthermore, companies have been recently mining social network data to foresee future action of the Occupy Wall Street movement.

Even scientists have caught up and adopted collaborative methods that enable the participation of laymen in scientific projects [5].

Now, how far has government gone in taking up this opportunity?

Some recent initiatives indicate that the public sector is aware of the potential of the “wisdom of crowds.” In the domain of public health, MedWatcher is a mobile application that allows the general public to submit information about any experienced drug side effects directly to the US Food and Drug Administration. In other cases, governments have asked for general input and ideas from citizens, such as the brainstorming session organized by Obama government, the wiki launched by the New Zealand Police to get suggestions from citizens for the drafting of a new policing act to be presented to the parliament, or the Website of the Department of Transport and Main Roads of the State of Queensland, which encourages citizens to share their stories related to road tragedies.

Even in so crucial a task as the drafting of a constitution, government has relied on citizens’ input through crowdsourcing [6]. And more recently several other initiatives have fostered crowdsourcing for constitutional reform in Morocco and in Egypt .

It is thus undeniable that we are witnessing an accelerated redefinition of the frontiers between experts and non-experts, scientists and non-scientists, doctors and patients, public officers and citizens, professional journalists and street reporters. The ‘Net has provided the infrastructure and the platforms for enabling collaborative work. Network connection is hardly a problem anymore. The problem is data analysis.

In other words: how to make sense of the flood of data produced and distributed by heterogeneous users? And more importantly, how to make sense of user-generated data in the light of more institutional sets of data (e.g., scientific, medical, legal)? The efficient use of crowdsourced data in public decision making requires building an informational flow between user experiences and institutional datasets.

Similarly, enhancing user access to public data has to do with matching user case descriptions with institutional data repositories (“What are my rights and obligations in this case?”; “Which public office can help me”?; “What is the delay in the resolution of my case?”; “How many cases like mine have there been in this area in the last month?”).

From the point of view of data processing, we are clearly facing a problem of semantic mapping and data structuring. The challenge is thus to overcome the flood of isolated information while avoiding excessive management costs. There is still a long way to go before tools for content aggregation and semantic mapping are generally available. This is why private firms and governments still mostly rely on the manual processing of user input.

The new producers of legally relevant content: a taxonomy

Before digging deeper into the challenges of efficiently managing crowdsourced data, let us take a closer look at the types of user-generated data flowing through the Internet that have some kind of legal or institutional flavour.

One type of user data emerges spontaneously from citizens’ online activity, and can take the form of:

citizens’ forums

platforms gathering open public data and comments over them (see for instance data-publica )

legal expert blogs (blawgs)

or the journalistic coverage of the legal system.

User data can as well be prompted by institutions as a result of participatory governance initiatives, such as:

crowdsourcing (targeting a specific issue or proposal by government as an open brainstorming session)

comments and questions addressed by citizens to institutions through institutional Websites or through e-mail contact.

This variety of media supports and knowledge producers gives rise to a plurality of textual genres, semantically rich but difficult to manage given their heterogeneity and quick evolution.

Managing crowdsourcing

The goal of crowdsourcing in an institutional context is to extract and aggregate content relevant for the management of public issues and for public decision making. Knowledge management strategies vary considerably depending on the ways in which user data have been generated. We can think of three possible strategies for managing the flood of user data:

Pre-structuring: prompting the citizen narrative in a strategic way

A possible solution is to elicit user input in a structured way; that is to say, to impose some constraints on user input. This is the solution adopted by IdeaScale, a software application that was used by the Open Government Dialogue initiative of the Obama Administration. In IdeaScale, users are asked to check whether their idea has already been covered by other users, and alternatively to add a new idea. They are also invited to vote for the best ideas, so that it is the community itself that rates and thus indirectly filters the users’ input.

The MIT Deliberatorium, a technology aimed at supporting large-scale online deliberation, follows a similar strategy. Users are expected to follow a series of rules to enable the correct creation of a knowledge map of the discussion. Each post should be limited to a single idea, it should not be redundant, and it should be linked to a suitable part of the knowledge map. Furthermore, posts are validated by moderators, who should ensure that new posts follow the rules of the system. Other systems that implement the same idea are featurelist and Debategraph [7].

While these systems enhance the creation and visualization of structured argument maps and promote community engagement through rating systems, they present a series of limitations. The most important of these is the fact that human intervention is needed to manually check the correct structure of the posts. Semantic technologies can play an important role in bridging this gap.

Semantic analysis through ontologies and terminologies

Ontology-driven analysis of user-generated text implies finding a way to bridge Semantic Web data structures, such as formal ontologies expressed in RDF or OWL, with unstructured implicit ontologies emerging from user-generated content. Sometimes these emergent lightweight ontologies take the form of unstructured lists of terms used for tagging online content by users. Accordingly, some works have dealt with this issue, especially in the field of social tagging of Web resources in online communities. More concretely, different works have proposed models for making compatible the so-called top-down metadata structures (ontologies) with bottom-up tagging mechanisms (folksonomies).

The possibilities range from transforming folksonomies into lightly formalized semantic resources (Lux and Dsinger, 2007; Mika, 2005) to mapping folksonomy tags to the concepts and the instances of available formal ontologies (Specia and Motta, 2007; Passant, 2007). As the basis of these works we find the notion of emergent semantics (Mika, 2005), which questions the autonomy of engineered ontologies and emphasizes the value of meaning emerging from distributed communities working collaboratively through the Web.

We have recently worked on several case studies in which we have proposed a mapping between legal and lay terminologies. We followed the approach proposed by Passant (2007) and enriched the available ontologies with the terminology appearing in lay corpora. For this purpose, OWL classes were complemented with a has_lexicalization property linking them to lay terms.

The first case study that we conducted belongs to the domain of consumer justice, and was framed in the ONTOMEDIA project. We proposed to reuse the available Mediation-Core Ontology (MCO) and Consumer Mediation Ontology (COM) as anchors to legal, institutional, and expert knowledge, and therefore as entry points for the queries posed by consumers in common-sense language.

The user corpus contained around 10,000 consumer questions and 20,000 complaints addressed from 2007 to 2010 to the Catalan Consumer Agency. We applied a traditional terminology extraction methodology to identify candidate terms, which were subsequently validated by legal experts. We then manually mapped the lay terms to the ontological classes. The relations used for mapping lay terms with ontological classes are mostly has_lexicalisation and has_instance.

A second case study in the domain of consumer law was carried out with Italian corpora. In this case domain terminology was extracted from a normative corpus (the Code of Italian Consumer law) and from a lay corpus (around 4000 consumers’ questions).

In order to further explore the particularities of each corpus respecting the semantic coverage of the domain, terms were gathered together into a common taxonomic structure [8]. This task was performed with the aid of domain experts. When confronted with the two lists of terms, both laypersons and technical experts would link most of the validated lay terms to the technical list of terms through one of the following relations:

Subclass: the lay term denotes a particular type of legal concept. This is the most frequent case. For instance, in the class objects, telefono cellulare (cell phone) and linea telefonica (phone line) are subclasses of the legal terms prodotto (product) and servizio (service), respectively. Similarly, in the class actors agente immobiliare (estate agent) can be seen as subclass of venditore (seller). In other cases, the linguistic structures extracted from the consumers’ corpus denote conflictual situations in which the obligations have not been fulfilled by the seller and therefore the consumer is entitled to certain rights, such as diritto alla sostituzione (entitlement to a replacement). These types of phrases are subclasses of more general legal concepts such as consumer right.

Instance: the lay term denotes a concrete instance of a legal concept. In some cases, terms extracted from the consumer corpus are named entities that denote particular individuals, such as Vodafone, an instance of a domain actor, a seller.

Equivalent: a legal term is used in lay discourse. For instance, contratto (contract) or diritto di recessione (withdrawal right).

Lexicalisation: the lay term is a lexical variant of the legal concept. This is the case for instance of negoziante, used instead of the legal term venditore (seller) or professionista (professional).

The distribution of normative and lay terms per taxonomic level shows that, whereas normative terms populate mostly the upper levels of the taxonomy [9], deeper levels in the hierarchy are almost exclusively represented by lay terms.

Term distribution per taxonomic level

The result of this type of approach is a set of terminological-ontological resources that provide some insights on the nature of laypersons’ cognition of the law, such as the fact that citizens’ domain knowledge is mainly factual and therefore populates deeper levels of the taxonomy. Moreover, such resources can be used for the further processing of user input. However, this strategy presents some limitations as well. First, it is mainly driven by domain conceptual systems and, in a way, they might limit the potentialities of user-generated corpora. Second, they are not necessarily scalable. In other words, these terminological-ontological resources have to be rebuilt for each legal subdomain (such as consumer law, private law, or criminal law), and it is thus difficult to foresee mechanisms for performing an automated mapping between lay terms and legal terms.

Beyond domain ontologies: information extraction approaches

One of the most important limitations of ontology-driven approaches is the lack of scalability. In order to overcome this problem, a possible strategy is to rely on informational structures that occur generally in user-generated content. These informational structures go beyond domain conceptual models and identify mostly discursive, emotional, or event structures.

Discursive structures formalise the way users typically describe a legal case. It is possible to identify stereotypical situations appearing in the description of legal cases by citizens (i.e., the nature of the problem; the conflict resolution strategies, etc.). The core of those situations is usually predicates, so it is possible to formalize them as frame structures containing different frame elements. We followed such an approach for the mapping of the Spanish corpus of consumers’ questions to the classes of the domain ontology (Fernández-Barrera and Casanovas, 2011). And the same technique was applied for mapping a set of citizens’ complaints in the domain of acoustic nuisances to a legal domain ontology (Bourcier and Fernández-Barrera, 2011). By describing general structures of citizen description of legal cases we ensure scalability.

Emotional structures are extracted by current algorithms for opinion- and sentiment mining. User data in the legal domain often contain an important number of subjective elements (especially in the case of complaints and feedback on public services) that could be effectively mined and used in public decision making.

Finally, event structures, which have been deeply explored so far, could be useful for information extraction from user complaints and feedback, or for automatic classification into specific types of queries according to the described situation.

Crowdsourcing in e-government: next steps (and precautions?)

Legal prosumers’ input currently outstrips the capacity of government for extracting meaningful content in a cost-efficient way. Some developments are under way, among which are argument-mapping technologies and semantic matching between legal and lay corpora. The scalability of these methodologies is the main obstacle to overcome, in order to enable the matching of user data with open public data in several domains.

However, as technologies for the extraction of meaningful content from user-generated data develop and are used in public-decision making, a series of issues will have to be dealt with. For instance, should the system developer bear responsibility for the erroneous or biased analysis of data? Ethical questions arise as well: May governments legitimately analyse any type of user-generated content? Content-analysis systems might be used for trend- and crisis detection; but what if they are also used for restricting freedoms?

The “wisdom of crowds” can certainly be valuable in public decision making, but the fact that citizens’ online behaviour can be observed and analysed by governments without citizens’ acknowledgement poses serious ethical issues.

Thus, technical development in this domain will have to be coupled with the definition of ethical guidelines and standards, maybe in the form of a system of quality labels for content-analysis systems.

[Editor’s Note: For earlier VoxPopuLII commentary on the creation of legal ontologies, see Núria Casellas, Semantic Enhancement of Legal Information… Are We Up for the Challenge? For earlier VoxPopuLII commentary on Natural Language Processing and legal Semantic Web technology, see Adam Wyner, Weaving the Legal Semantic Web with Natural Language Processing. For earlier VoxPopuLII posts on user-generated content, crowdsourcing, and legal information, see Matt Baca and Olin Parker, Collaborative, Open Democracy with LexPop; Olivier Charbonneau, Collaboration and Open Access to Law; Nick Holmes, Accessible Law; and Staffan Malmgren, Crowdsourcing Legal Commentary.]

[1] The idea of prosumption existed actually long before the Internet, as highlighted by Ritzer and Jurgenson (2010): the consumer of a fast food restaurant is to some extent as well the producer of the meal since he is expected to be his own waiter, and so is the driver who pumps his own gasoline at the filling station.

[2] The experience project enables registered users to share life experiences, and it contained around 7 million stories as of January 2011: http://www.experienceproject.com/index.php.

[3] For instance, the United Nations Volunteers Online platform (http://www.onlinevolunteering.org/en/vol/index.html) helps volunteers to cooperate virtually with non-governmental organizations and other volunteers around the world.

[4] See for instance the experiment run by mathematician Gowers on his blog: he posted a problem and asked a large number of mathematicians to work collaboratively to solve it. They eventually succeeded faster than if they had worked in isolation: http://gowers.wordpress.com/2009/01/27/is-massively-collaborative-mathematics-possible/.

[5] The Galaxy Zoo project asks volunteers to classify images of galaxies according to their shapes: http://www.galaxyzoo.org/how_to_take_part. See as well Cornell’s projects Nestwatch (http://watch.birds.cornell.edu/nest/home/index) and FeederWatch (http://www.birds.cornell.edu/pfw/Overview/whatispfw.htm), which invite people to introduce their observation data into a Website platform.

[6] http://www.participedia.net/wiki/Icelandic_Constitutional_Council_2011.

[7] See the description of Debategraph in Marta Poblet’s post, Argument mapping: visualizing large-scale deliberations (http://serendipolis.wordpress.com/2011/10/01/argument-mapping-visualizing-large-scale-deliberations-3/).

[8] Terms have been organised in the form of a tree having as root nodes nine semantic classes previously identified. Terms have been added as branches and sub-branches, depending on their degree of abstraction.

[9] It should be noted that legal terms are mostly situated at the second level of the hierarchy rather than the first one. This is natural if we take into account the nature of the normative corpus (the Italian consumer code), which contains mostly domain specific concepts (for instance, withdrawal right) instead of general legal abstract categories (such as right and obligation).

REFERENCES

Bourcier, D., and Fernández-Barrera, M. (2011). A frame-based representation of citizen’s queries for the Web 2.0. A case study on noise nuisances. E-challenges conference, Florence 2011.

Fernández-Barrera, M., and Casanovas, P. (2011). From user needs to expert knowledge: Mapping laymen queries with ontologies in the domain of consumer mediation. AICOL Workshop, Frankfurt 2011.

Lux, M., and Dsinger, G. (2007). From folksonomies to ontologies: Employing wisdom of the crowds to serve learning purposes. International Journal of Knowledge and Learning (IJKL), 3(4/5): 515-528.

Mika, P. (2005). Ontologies are us: A unified model of social networks and semantics. In Proc. of Int. Semantic Web Conf., volume 3729 of LNCS, pp. 522-536. Springer.

Passant, A. (2007). Using ontologies to strengthen folksonomies and enrich information retrieval in Weblogs. In Int. Conf. on Weblogs and Social Media, 2007.

Poblet, M., Casellas, N., Torralba, S., and Casanovas, P. (2009). Modeling expert knowledge in the mediation domain: A Mediation Core Ontology, in N. Casellas et al. (Eds.), LOAIT- 2009. 3rd Workshop on Legal Ontologies and Artificial Intelligence Techniques joint with 2^nd Workshop on Semantic Processing of Legal Texts. Barcelona, IDT Series n. 2.

Ritzer, G., and Jurgenson, N. (2010). Production, consumption, prosumption: The nature of capitalism in the age of the digital “prosumer.” In Journal of Consumer Culture 10: 13-36.

Specia, L., and Motta, E. (2007). Integrating folksonomies with the Semantic Web. Proc. Euro. Semantic Web Conf., 2007.

Meritxell Fernández-Barrera is a researcher at the Cersa (Centre d’Études et de Recherches de Sciences Administratives et Politiques) -CNRS, Université Paris 2-. She works on the application of natural language processing (NLP) to legal discourse and legal communication, and on the potentialities of Web 2.0 for participatory democracy.

VoxPopuLII is edited by Judith Pratt. Editor-in-Chief is Robert Richards, to whom queries should be directed. The statements above are not legal advice or legal representation. If you require legal advice, consult a lawyer. Find a lawyer in the Cornell LII Lawyer Directory.

Weaving the Legal Semantic Web with Natural Language Processing

Annotation of legal texts, information retrieval, Legal knowledge representation, Legal semantic web, Legal text mining, Legal text processing, Legal XML, natural language processing, Semantic annotation of legal texts, Semantic Web and law 4 Responses »

May 172010

The World Wide Web is a virtual cornucopia of legal information bearing on all manner of topics and in a spectrum of formats, much of it textual. However, to make use of this storehouse of textual information, it must be annotated and structured in such a way as to be meaningful to people and processable by computers. One of the visions of the Semantic Web has been to enrich information on the Web with annotation and structure. Yet, given that text is in a natural language (e.g., English, German, Japanese, etc.), which people can understand but machines cannot, some automated processing of the text itself is needed before further processing can be applied. In this article, we discuss one approach to legal information on the World Wide Web, the Semantic Web, and Natural Language Processing (NLP). Each of these are large, complex, and heterogeneous topics of research; in this short post, we can only hope to touch on a fragment and that heavily biased to our interests and knowledge. Other important approaches are mentioned at the end of the post. We give small working examples of legal textual input, the Semantic Web output, and how NLP can be used to process the input into the output.

Legal Information on the Web

For clients, legal professionals, and public administrators, the Web provides an unprecedented opportunity to search for, find, and reason with legal information such as case law, legislation, legal opinions, journal articles, and material relevant to discovery in a court procedure. With a search tool such as Google or indexed searches made available by Lexis-Nexis, Westlaw, or the World Legal Information Institute, the legal researcher can input key words into a search and get in return a (usually long) list of documents which contain, or are indexed by, those key words.

As useful as such searches are, they are also highly limited to the particular words or indexation provided, for the legal researcher must still manually examine the documents to find the substantive information. Moreover, current legal search mechanisms do not support more meaningful searches such as for properties or relationships, where, for example, a legal researcher searches for cases in which a company has the property of being in the role of plaintiff or where a lawyer is in the relationship of representing a client. Nor, by the same token, can searches be made with respect to more general (or more specific) concepts, such as “all cases in which a company has any role,” some particular fact pattern, legislation bearing on related topics, or decisions on topics related to a legal subject.

The underlying problem is that legal textual information is expressed in natural language. What literate people read as meaningful words and sentences appear to a computer as just strings of ones and zeros. Only by imposing some structure on the binary code is it converted to textual characters as we know them. Yet, there is no similar widespread system for converting the characters into higher levels of structure which correlate to our understanding of meaning. While a search can be made for the string plaintiff, there are no (widely available) searches for a string that represents an individual who bears the role of plaintiff. To make language on the Web more meaningful and structured, additional content must be added to the source material, which is where the Semantic Web and Natural Language Processing come into play.

Semantic Web

The Semantic Web is a complex of design principles and technologies which are intended to make information on the Web more meaningful and usable to people. We focus on only a small portion of this structure, namely the syntactic XML (eXtensible Markup Language) level, where elements are annotated so as to indicate linguistically relevant information and structure. (Click here for more on these points.) While the XML level may be construed as a ‘lower’ level in the Semantic Web “stack” — i.e., the layers of interrelated technologies that make up the Semantic Web — the XML level is nonetheless crucial to providing information to higher levels where ontologies (and click here for more on this) and logic play a role. So as to be clear about the relation between the Semantic Web and NLP, we briefly review aspects of XML by example, and furnish motivations as we go.

Suppose one looks up a case where Harris Hill is the plaintiff and Jane Smith is the attorney for Harris Hill. In a document related to this case, we would see text such as the following portions:

Harris Hill, plaintiff.
Jane Smith, attorney for the plaintiff.

While it is relatively straightforward to structure the binary string into characters, adding further information is more difficult. Consider what we know about this small fragment: Harris and Jane are (very likely) first names, Hill and Smith are last names, Harris Hill and Jane Smith are full names of people, plaintiff and attorney are roles in a legal case, Harris Hill has the role of plaintiff, attorney for is a relationship between two entities, and Jane Smith is in the attorney for relationship to Harris Hill. It would be useful to encode this information into a standardised machine-readable and processable form.

XML helps to encode the information by specifying requirements for tags that can be used to annotate the text. It is a highly expressive language, allowing one to define tags that suit one’s purposes so long as the specification requirements are met. One requirement is that each tag has a beginning and an ending; the material in between is the data that is being tagged. For example, suppose tags such as the following, where … indicates the data:


<legalcase>...</legalcase>,
<firstname>...</firstname>,
<lastname>...</lastname>,
<fullname>...</fullname>,
<plaintiff>...</plaintiff>,
<attorney>...</attorney>, 
<legalrelationship>...</legalrelationship>

Another requirement is that the tags have a tree structure, where each pair of tags in the document is included in another pair of tags and there is no crossing over:


<fullname><firstname>...</firstname>, 
<lastname>...</lastname></fullname>

is acceptable, but


<fullname><firstname>...<lastname>
</firstname> ...</lastname></fullname>

is unacceptable. Finally, XML tags can be organised into schemas to structure the tags.

With these points in mind, we could represent our fragment as:


<legalcase>
  <legalrelationship>
    <plaintiff>
      <fullname><firstname>Harris</firstname>,
           <lastname>Hill</lastname></fullname>
    </plaintiff>,
    <attorney>
      <fullname><firstname>Jane</firstname>,
           <lastname>Smith</lastname></fullname>
    </attorney>
  </legalrelationship
</legalcase>

We have added structured information — the tags — to the original text. While this is more difficult for us to read, it is very easy for a machine to read and process. In addition, the tagged text contains the content of the information, which can be presented in a range of alternative ways and formats using a transformation language such as XSLT (click here for more on this point) so that we have an easier-to-read format.

Why bother to include all this additional information in a legal text? Because these additions allow us to query the source text and submit the information to further processing such as inference. Given a query language, we could submit to the machine the query Who is the attorney in the case? and the answer would be Jane Smith. Given a rule language — such as RuleML or Semantic Web Rule Language (SWRL) — which has a rule such as If someone is an attorney for a client then that client has a privileged relationship with the attorney, it might follow from this rule that the attorney could not divulge the client’s secrets. Applying such a rule to our sample, we could infer that Jane Smith cannot divulge Harris Hill’s secrets.

Tower of Babel Though it may seem here like too much technology for such a small and obvious task, it is essential where we scale up our queries and inferences on large corpora of legal texts — hundreds of thousands if not millions of documents — which comprise vast storehouses of unstructured, yet meaningful data. Were all legal cases uniformly annotated, we could, in principle, find out every attorney for every plaintiff for every legal case. Where our tagging structure is very rich, our queries and inferences could also be very rich and detailed. Perhaps a more familiar way to view documents annotated with XML is as a database to which further processes can be applied over the Web.

Natural Language Processing

As we have presented it, we have an input, the corpus of texts, and an output, texts annotated with XML tags. The objective is to support a range of processes such as querying and inference. However, getting from a corpus of textual information to annotated output is a demanding task, generically referred to as the knowledge acquisition bottleneck. Not only is the task demanding on resources (time, money, manpower); it is also highly knowledge intensive since whoever is doing the annotation must know what to look for, and it is important that all of the annotators annotate the text in the same way (inter-annotator agreement) to support the processes. Thus, automation is central.

Yet processing language to support such richly annotated documents confronts a spectrum of difficult issues. Among them, natural language supports (1) implicit or presupposed information, (2) multiple forms with the same meaning, (3) the same form with different contextually dependent meanings, and (4) dispersed meanings. (Similar points can be made for sentences or other linguistic elements.) Here are examples of these four issues:

(1) “When did you stop taking drugs?” (presupposes that the person being questioned took drugs at sometime in the past);
(2) Jane Smith, Jane R. Smith, Smith, Attorney Smith… (different ways to refer to the same person);
(3) The individual referred to by the name “Jane Smith” in one case decision may not be the individual referred to by the name “Jane Smith” in another case decision;
(4) Jane Smith represented Jones Inc. She works for Dewey, Cheetum, and Howe. To contact her, write to j.smith@dch.com .

When we search for information, a range of linguistic structures or relationships may be relevant to our query, such as:

grammatical constructions (passive or active sentence forms, quotation, reference to other individuals, and so on),
grammatical relations among terms (e.g., whether an individual is the agent or target of some action),
ontological relations (e.g., classes and subclasses of experts, or the relationships among courts in the judicial hierarchy),
relationships among elements (e.g., who works for what organization), or
high-level patterns such as legal arguments (e.g., expert testimony) and fact patterns (e.g., culpable intent).

People grasp relationships between words and phrases, such that Bill exercises daily contrasts with the meaning of Bill is a couch potato, or that if it is true that Bill used a knife to kill Phil, then Bill killed Phil. Finally, meaning tends to be sparse; that is, there are a few words and patterns that occur very regularly, while most words or patterns occur relatively rarely in the corpus.

Natural language processing (NLP) takes on this highly complex and daunting problem as an engineering problem, decomposing large problems into smaller problems and subdomains until it gets to those which it can begin to address. Having found a solution to smaller problems, NLP can then address other problems or larger scope problems. Some of the subtopics in NLP are:

Generation – converting information in a database into natural language.
Understanding – converting natural language into a machine-readable form.
Information Retrieval – gathering documents which contain key words or phrases. This is essentially what is done by Google.
Text Summarization – summarizing (in a paragraph) the main meaning of a text or corpus.
Question Answering – making queries and giving answers to them, in natural language, with respect to some corpus of texts.
Information Extraction — identifying, annotating, and extracting information from documents for reuse, representation, or reasoning.

In this article, we are primarily (here) interested in information extraction.

NLP Approaches: Knowledge Light v. Knowledge Heavy

There are a range of techniques that one can apply to analyse the linguistic data obtained from legal texts; each of these techniques has strengths and weaknesses with respect to different problems. Statistical and machine-learning techniques are considered “knowledge light.” With statistical approaches, the processing presumes very little knowledge by the system (or analyst). Rather, algorithms are applied that compare and contrast large bodies of textual data, and identify regularities and similarities. Such algorithms encounter problems with sparse data or patterns that are widely dispersed across the text. (See Turney and Pantel (2010) for an overview of this area.) Machine learning approaches apply learning algorithms to annotated material to extend results to unannotated material, thus introducing more knowledge into the processing pipeline. However, the results are somewhat of a black box in that we cannot really know the rules that are learned and use them further.

With a “knowledge-heavy” approach, we know, in a sense, what we are looking for, and make this knowledge explicit in lists and rules for processing. Yet, this is labour- and knowledge-intensive. In the legal domain it is crucial to have humanly understandable explanations and justifications for the analysis of a text, which to our thinking warrants a knowledge-heavy approach.

One open source text-mining package, General Architecture for Text Engineering (GATE), consists of multiple components in a cascade or pipeline, each component automatically processing some aspect of the text, and then feeding into the next process. The underlying strategy in all the components is to find a pattern (from either a list or a previous process) which matches a rule, and then to apply the rule which annotates the text. Each component performs a particular process on the text, such as:

Sentence segmentation – dividing text into sentences.
Tokenisation – words identified by spaces between them.
Part-of-speech tagging – noun, verb, adjective, etc., determined by look-up and relationships among words.
Shallow syntactic parsing/chunking – dividing the text by noun phrase, verb phrase, subordinate clause, etc.
Named entity recognition – the entities in the text such as organisations, people, and places.
Dependency analysis – subordinate clauses, pronominal anaphora [i.e., identifying what a pronoun refers to], etc.

The system can also be used to annotate more specifically to elements of interest. In one study, we annotated legal cases from a case base (a corpus of cases) in order to identify a range of particular pieces of information that would be relevant to legal professionals such as:

Case citation.
Names of parties.
Roles of parties, meaning plaintiff or defendant.
Type of court.
Names of judges.
Names of attorneys.
Roles of attorneys, meaning the side they represent.
Final decision.
Cases cited.
Nature of the case, meaning using keywords to classify the case in terms of subject (e.g., criminal assault, intellectual property, etc.)

Applying our lists and rules to a corpus of legal cases, a sample output is as follows, where the coloured highlights are annotated as per the key on the right; the colours are a visualisation of the sorts of tags discussed above (to see a larger version of the image, right click on the image, then click on “View Image” or a similar phrase; when finished viewing the image, use the browser’s back button to return to the text):

Annotation of a Criminal Case

The approach is very flexible and appears in similar systems. (See, for example, de Maat and Winkels, Automatic Classification of Sentences in Dutch Laws (2008).) While it is labour intensive to develop and maintain such list and rule systems, with a collaborative, Web-based approach, it may be feasible to construct rich systems to annotate large domains.

Conclusion

In this post, we have given a very brief overview of how the Semantic Web and Natural Language Processing (NLP) apply to legal textual information to support annotation which then enables querying and inference. Of course, this is but one take on a much larger domain. In our view, it holds great promise in making legal information more transparent and available to more legal professionals. Aside from GATE, some other resources on text analytics and NLP are textbooks and lecture notes (see, e.g., Wilcock), as well as workshops (such as SPLeT and LOAIT). While applications of Natural Language Processing to legal materials are largely lab studies, the use of NLP in conjunction with Semantic Web technology to annotate legal texts is a fast-developing, results-oriented area which targets meaningful applications for legal professionals. It is well worth watching.

Dr. Adam Zachary Wyner is a Research Fellow at the University of Leeds, Institute of Communication Studies, Centre for Digital Citizenship. He currently works on the EU-funded project IMPACT: Integrated Method for Policy Making Using Argument Modelling and Computer Assisted Text Analysis. Dr. Wyner has a Ph.D. in Linguistics (Cornell, 1994) and a Ph.D. in Computer Science (King’s College London, 2008). His computer science Ph.D. dissertation is entitled Violations and Fulfillments in the Formal Representation of Contracts. He has published in the syntax and semantics of adverbs, deontic logic, legal ontologies, and argumentation theory with special reference to law. He is workshop co-chair of SPLeT 2010: Workshop on Semantic Processing of Legal Texts, to be held 23 May 2010 in Malta. He writes about his research at his blog, Language, Logic, Law, Software.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

Suffusion theme by Sayontan Sinha

VoxPopuLII

Is it good enough for the law to be written for lawyers?

Legal Prosumers: How Can Government Leverage User-Generated Content?

Weaving the Legal Semantic Web with Natural Language Processing

Recent Posts

VoxPop people and posts

Subscribe to VoxPopuLII

Blogroll

VoxPopuLII

Is it good enough for the law to be written for lawyers?

Legal Prosumers: How Can Government Leverage User-Generated Content?

Weaving the Legal Semantic Web with Natural Language Processing

Recent Posts

VoxPop people and posts

Subscribe to VoxPopuLII

Blogroll

Tags