skip navigation

In this post, I will describe how natural language processing can help in creating computer systems dealing with the law.

Law Books EmileA lot of computer systems are being designed to help users deal with legal texts — accessing, understanding, or applying them. [Editor’s Note: Michael Poulshock’s Jureeka is an example of a system that automates the application of legal texts.] Other systems — such as DALOS — are about creating legal texts, providing support for the writers, or simulating the effects of a text. Such systems are based on something more than “just” the legal text: there is XML mark-up, an OWL ontology, or a representation of the rules in SWRL or some programming language. This means that any piece of legislation that you want to use on your computer system needs to be translated into this computer representation.

We try to support this translation using natural language processing, so that (part of) the translation can be done by a computer. This automation should have a number of advantages. First of all, computers are cheaper than human experts, and automating the process should reduce the amount of resources needed for this task. Second, the models that are produced by automated processes are more consistent; human experts may treat two similar sentences differently, but a computer program will always behave the same. Finally, an approach employing structures ensures that there is a clear mapping between the elements of the computer model and the original text.

Natural Language Processing isn’t perfect yet: computers cannot understand human language. However, legal text is quite structured, and offers a lot more handholds for automated translation than, say, a novel.

Document Structure

The first step that we will have to undertake is to determine the structure of the document. Online services like and can make it easier to access legal documents because they can point you to the right part of the document (such as a chapter, paragraph, sentence, etc.). In most law texts, the structure has been made explicit using clear headings, like: Chapter 1 or Chapter 1. General Provisions. So, in order to detect structure, we need to detect these headings. This means we’ll need to search the document for lines starting with Chapter, followed by some designation (which we refer to as an index), and perhaps followed by some text – say, the title of the chapter. The index can Numbersbe a lot of things: Arabic numbers (1, 2, 3, …), Roman numbers (I, II, III, …) or letters (a, b, c, …). Sometimes the index is an ordinal appearing before the chapter label: First chapter. It may even be a combination of several numbers and letters (5.2a). This is not a great problem, as we can more or less assume that whatever follows the word Chapter is the index.

The main problem with this approach is that there are also regular sentences that start with the word Chapter, and we need to separate those out. To do so, we can use some heuristics: A title will not end with a full stop (.); a heading will always start on a new line; etc.

Queen BeatrixThis procedure to find the headings for chapters is repeated to find headings for sections, subsections, etc. Also, some sections (like numbered paragraphs or list items) will not have a full heading, but just a number, which we also need to recognise. Finally, some sections don’t have a heading but can be recognised because they start with a fixed language pattern. For example, a preamble in a (recent) Dutch Law — such as this — will start with: We, Beatrix, Queen of the Netherlands, Princess of Orange-Nassau, etc. etc. etc.

This procedure assumes that the input for the process is just text. Many documents will contain more information — such as textual markup — and headings may be more easily identified because they are marked as bold text, or even as headings. So, in situations where the input is made up of documents that are marked-up in a consistent way, it may be easier to recognise the patterns by taking layout into account in addition to text.

To actually find the patterns, we can use existing toolkits like GATE. After the patterns have been found, and the structure has been recognised, we can store it using a format such as MetaLex.


The second step is to detect the references from a portion of a law text to other portions of that text, or from a law text to other texts. References, like headings, follow a pattern. The simplest patterns are rather similar to headings; the text chapter 13 is probably a reference, unless it is part of a heading. ReferencesJust like headings, basic references consist of a label (section, chapter, article) and an index (13, 13.2.1, XIII,  m). And, just as with headings, we can find the references by looking for these patterns in the text.

However, this is only the simplest form of references. Besides references to a specific section, such as chapter 13, there are of course also references to a complete law. Some of these references follow a pattern as well, such as the law of October 1st, 2007. Most laws are cited by means of a citation title, though, such as the Railroad Act. Such titles can contain all kinds of words, and they don’t follow a strict pattern. Thus, such references cannot be detected using patterns. Instead, we use a list containing all (citation) titles to detect such references.

Other, more complex references contain multiple references in one statement, such as articles 13 and 14, or multiple levels: article 13, item e, of the Railroad Act, or even more complex combinations of the two: articles 13, item e, 14, item f, 15 and 16, items a and b, of the Railroad Act. Though more complex than the simple combination of label and index, these references still follow clear (sometimes recurring) patterns, and can be found in the text by searching for such patterns.

At the Leibniz Center for Law, we’ve created a parser based on these patterns, which had an accuracy of over 95%. For each reference found, we can construct some standardised name, and store it. With this technology, not only can we add hyperlinks to documents; we can also search for documents that refer to some specific document.


Now that we’ve got the structure and links in place, it’s time to start with the actual meaning of the text. Rather than tackling the entire text as a whole, we’ve selected sentences as the basic building blocks, and we attempt to create computer models for individual sentences first. Later, we can integrate those individual models to a complete model.

As a first step in creating the models, we start by assigning a broad meaning, or classification, to each sentence. Does the sentence give a definition for a concept, describe an obligation, or make a change in another law? In total, we distinguish fourteen different classes of sentences that appear in Dutch law texts. The next step in our automated approach is to assign a class to each sentence automatically.

To do so, we turn once again to language patterns. Legal language is rather strict, and legislative drafters don’t vary their language a lot — in a novel, variation may make for a more appealing text, but in a law, variation invites ambiguity. In fact, there are official Guidelines for Legislative Drafting that (among other things) reduce the variety of texts used. [Editor’s Note: For example, drafters of legislation in the U.S. House of Representatives Office of the Legislative Counsel have used Donald Hirsch’s Drafting Federal Law.] This means that for each of our classes, there’s a rather limited set of language patterns used. For example, definitions will look like one of these:

Under … is understood …

This law understands under … …

There are some variations in word order, but in the end, a small set of patterns is sufficient to describe all commonly used phrases. There is only one class of sentences where we cannot define a full set of patterns: obligations. In Dutch laws, obligations are often expressed without signal words like must or is obliged to. Instead, the obligations are presented as a fact:

No bodies are buried on a closed cemetery.

However, since the obligations are the only sentences lacking all-compassing patterns, we will assume that any sentence that does not mark a pattern is one of these obligations.

Based on the patterns found, we’ve created a classifier that attempts to sort sentences into these different classes. This classifier has an accuracy of 91%, and we expect that this can improved a bit further.

(As a side note: For classification tasks as these, a machine learning approach is often preferred; see, e.g., here. With such an approach, you provide the computer, not with patterns, but with a bunch of sample sentences. The computer will then extract its own patterns from those sentences, and use these to classify any new sentences. We’ve tried this approach as well (using the toolkit WEKA), and reached similarly accurate results.)


Having classified the sentences, we now want to create models of the sentences. In essence, this means breaking down each sentence into smaller components and defining relationships between them. In some cases, the patterns used to classify the sentence already give us sufficient information to break up the sentence. Suppose we have a sentence like:

In article 7.12, sub one, second sentence, «article 7.3b» is replaced by: article 7.3c.

We classify this sentence as a replacement because of the text is replaced by. We can then also conclude that the text between angle quotes is the text to be replaced, the text following the colon is the replacing text, and the reference preceding it (which we’ve already detected) is the location where the replacement should take place.

This works fine for sentences that are somehow “about” the law. But for sentences that deal with some other domain, such as taxes, traffic, or commerce, we cannot predict all the elements. These sentences could be about anything — and statutes are full of such sentences. For such sentences, we need to follow a generic method. The aim is to model rules as a situation or action that is allowed or not allowed, similar to the models created in the HARNESS system of the ESTRELLA project. For example, for an obligation, we assume that the sentence describes some action that must be done. warrantWe try to identify who should be doing the action, and what other elements are involved. Thus, for the sentence:

Our Minister issues a warrant to the negligent person.

we would like to extract the following information:

Action: Issue
Agent: Our Minister
Patient: Warrant
Recipient: Negligent person

(Such a table, or frame, is not the same as a computer model, but has all the elements needed to create one.)

Now, identifying these different elements of the sentence (agent, patient, recipient) is something that computer linguists have already worked on for a long time, which means we do not have to start from scratch. Instead, we can use existing parsers to do much of the work for us. For our Dutch laws, we use the Alpino parser. voxemiletree.jpgSuch a parser will create a parse tree of a sentence. In this parse tree, the sentence will be split up in parts. The parser can identify which part is the subject, the direct object, the indirect object, etc. Based on this information, we can determine the agent, patient, and recipient (so-called semantic roles). In a sentence with a verb in the active voice, the subject is the agent, the direct object is the patient, and the indirect object is the recipient. Furthermore, the parser will determine the relationship between words, such as an adjective that modifies a noun. This information, too, helps us to make more accurate models.

We start out with the output of these parsers, and then try to extract all terms that have some more significance. If we want an application to compute whether or not a situation is allowed, a word like car can be treated in a generic way, but terms like allowed and not some special attention.

To Be Continued…

We still need to refine the method for making these models, and evaluate the results. After that, the individual models will need to be merged. But even as things stand now, we think these tools will help with getting legal text from paper into your computer systems.

[Editor’s Note: For more information about this topic, please see Dr. Adam Wyner’s post, Weaving the Legal Semantic Web with Natural Language Processing.]

Emile_de_MaatEmile de Maat is a researcher at the Leibniz Center for Law (University of Amsterdam). His research focuses on the automatic extraction of metadata and meaning from legal sources.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.

CornucopiaThe World Wide Web is a virtual cornucopia of legal information bearing on all manner of topics and in a spectrum of formats, much of it textual. However, to make use of this storehouse of textual information, it must be annotated and structured in such a way as to be meaningful to people and processable by computers. One of the visions of the Semantic Web has been to enrich information on the Web with annotation and structure. Yet, given that text is in a natural language (e.g., English, German, Japanese, etc.), which people can understand but machines cannot, some automated processing of the text itself is needed before further processing can be applied. In this article, we discuss one approach to legal information on the World Wide Web, the Semantic Web, and Natural Language Processing (NLP). Each of these are large, complex, and heterogeneous topics of research; in this short post, we can only hope to touch on a fragment and that heavily biased to our interests and knowledge. Other important approaches are mentioned at the end of the post. We give small working examples of legal textual input, the Semantic Web output, and how NLP can be used to process the input into the output.

Legal Information on the Web

For clients, legal professionals, and public administrators, the Web provides an unprecedented opportunity to search for, find, and reason with legal information such as case law, legislation, legal opinions, journal articles, and material relevant to discovery in a court procedure. With a search tool such as Google or indexed searches made available by Lexis-Nexis, Westlaw, or the World Legal Information Institute, the legal researcher can input key words into a search and get in return a (usually long) list of documents which contain, or are indexed by, those key words.

As useful as such searches are, they are also highly limited to the particular words or indexation provided, for the legal researcher must still manually examine the documents to find the substantive information. Moreover, current legal search mechanisms do not support more meaningful searches such as for properties or relationships, where, for example, a legal researcher searches for cases in which a company has the property of being in the role of plaintiff or where a lawyer is in the relationship of representing a client. Nor, by the same token, can searches be made with respect to more general (or more specific) concepts, such as “all cases in which a company has any role,” some particular fact pattern, legislation bearing on related topics, or decisions on topics related to a legal subject.

Binary MysteryThe underlying problem is that legal textual information is expressed in natural language. What literate people read as meaningful words and sentences appear to a computer as just strings of ones and zeros. Only by imposing some structure on the binary code is it converted to textual characters as we know them. Yet, there is no similar widespread system for converting the characters into higher levels of structure which correlate to our understanding of meaning. While a search can be made for the string plaintiff, there are no (widely available) searches for a string that represents an individual who bears the role of plaintiff. To make language on the Web more meaningful and structured, additional content must be added to the source material, which is where the Semantic Web and Natural Language Processing come into play.

Semantic Web

The Semantic Web is a complex of design principles and technologies which are intended to make information on the Web more meaningful and usable to people.Semantic Web Stack We focus on only a small portion of this structure, namely the syntactic XML (eXtensible Markup Language) level, where elements are annotated so as to indicate linguistically relevant information and structure. (Click here for more on these points.) While the XML level may be construed as a ‘lower’ level in the Semantic Web “stack” — i.e., the layers of interrelated technologies that make up the Semantic Web — the XML level is nonetheless crucial to providing information to higher levels where ontologies (and click here for more on this) and logic play a role. So as to be clear about the relation between the Semantic Web and NLP, we briefly review aspects of XML by example, and furnish motivations as we go.

Suppose one looks up a case where Harris Hill is the plaintiff and Jane Smith is the attorney for Harris Hill. In a document related to this case, we would see text such as the following portions:

Harris Hill, plaintiff.
Jane Smith, attorney for the plaintiff.

While it is relatively straightforward to structure the binary string into characters, adding further information is more difficult. Consider what we know about this small fragment: Harris and Jane are (very likely) first names, Hill and Smith are last names, Harris Hill and Jane Smith are full names of people, plaintiff and attorney are roles in a legal case, Harris Hill has the role of plaintiff, attorney for is a relationship between two entities, and Jane Smith is in the attorney for relationship to Harris Hill. It would be useful to encode this information into a standardised machine-readable and processable form.

XML helps to encode the information by specifying requirements for tags that can be used to annotate the text. It is a highly expressive language, allowing one to define tags that suit one’s purposes so long as the specification requirements are met. One requirement is that each tag has a beginning and an ending; the material in between is the data that is being tagged. For example, suppose tags such as the following, where … indicates the data:


Another requirement is that the tags have a tree structure, where each pair of tags in the document is included in another pair of tags and there is no crossing over:


is acceptable, but

</firstname> ...</lastname></fullname>

is unacceptable. Finally, XML tags can be organised into schemas to structure the tags.

With these points in mind, we could represent our fragment as:


We have added structured information — the tags — to the original text. While this is more difficult for us to read, it is very easy for a machine to read and process. In addition, the tagged text contains the content of the information, which can be presented in a range of alternative ways and formats using a transformation language such as XSLT (click here for more on this point) so that we have an easier-to-read format.

Why bother to include all this additional information in a legal text? Because these additions allow us to query the source text and submit the information to further processing such as inference. Given a query language, we could submit to the machine the query Who is the attorney in the case? and the answer would be Jane Smith. Given a rule language — such as RuleML or Semantic Web Rule Language (SWRL) — which has a rule such as If someone is an attorney for a client then that client has a privileged relationship with the attorney, it might follow from this rule that the attorney could not divulge the client’s secrets. Applying such a rule to our sample, we could infer that Jane Smith cannot divulge Harris Hill’s secrets.

Tower of BabelThough it may seem here like too much technology for such a small and obvious task, it is essential where we scale up our queries and inferences on large corpora of legal texts — hundreds of thousands if not millions of documents — which comprise vast storehouses of unstructured, yet meaningful data. Were all legal cases uniformly annotated, we could, in principle, find out every attorney for every plaintiff for every legal case. Where our tagging structure is very rich, our queries and inferences could also be very rich and detailed. Perhaps a more familiar way to view documents annotated with XML is as a database to which further processes can be applied over the Web.

Natural Language Processing

As we have presented it, we have an input, the corpus of texts, and an output, texts annotated with XML tags. The objective is to support a range of processes such as querying and inference. However, getting from a corpus of textual information to annotated output is a demanding task, generically referred to as the knowledge acquisition bottleneck. Not only is the task demanding on resources (time, money, manpower); it is also highly knowledge intensive since whoever is doing the annotation must know what to look for, and it is important that all of the annotators annotate the text in the same way (inter-annotator agreement) to support the processes. Thus, automation is central.

Yet processing language to support such richly annotated documents confronts a spectrum of difficult issues. Among them, natural language supports (1) implicit or presupposed information, (2) multiple forms with the same meaning, (3) the same form with different contextually dependent meanings, and (4) dispersed meanings. (Similar points can be made for sentences or other linguistic elements.) Here are examples of these four issues:

(1) “When did you stop taking drugs?” (presupposes that the person being questioned took drugs at sometime in the past);
(2) Jane Smith, Jane R. Smith, Smith, Attorney Smith… (different ways to refer to the same person);
(3) The individual referred to by the name “Jane Smith” in one case decision may not be the individual referred to by the name “Jane Smith” in another case decision;
(4) Jane Smith represented Jones Inc. She works for Dewey, Cheetum, and Howe. To contact her, write to .

When we search for information, a range of linguistic structures or relationships may be relevant to our query, such as:

People grasp relationships between words and phrases, such that Bill exercises daily contrasts with the meaning of Bill is a couch potato, or that if it is true that Bill used a knife to kill Phil, then Bill killed Phil. Finally, meaning tends to be sparse; that is, there are a few words and patterns that occur very regularly, while most words or patterns occur relatively rarely in the corpus.

Natural language processing (NLP) takes on this highly complex and daunting problem as an engineering problem, decomposing large problems into smaller problems and subdomains until it gets to those which it can begin to address. Having found a solution to smaller problems, NLP can then address other problems or larger scope problems. Some of the subtopics in NLP are:

  • Generation – converting information in a database into natural language.
  • Understanding – converting natural language into a machine-readable form.
  • Information Retrieval – gathering documents which contain key words or phrases. This is essentially what is done by Google.
  • Text Summarization – summarizing (in a paragraph) the main meaning of a text or corpus.
  • Question Answering – making queries and giving answers to them, in natural language, with respect to some corpus of texts.
  • Information Extraction — identifying, annotating, and extracting information from documents for reuse, representation, or reasoning.

In this article, we are primarily (here) interested in information extraction.

NLP Approaches: Knowledge Light v. Knowledge Heavy

There are a range of techniques that one can apply to analyse the linguistic data obtained from legal texts; each of these techniques has strengths and weaknesses with respect to different problems. Statistical and machine-learning techniques are considered “knowledge light.” With statistical approaches, the processing presumes very little knowledge by the system (or analyst). Rather, algorithms are applied that compare and contrast large bodies of textual data, and identify regularities and similarities. Such algorithms encounter problems with sparse data or patterns that are widely dispersed across the text. (See Turney and Pantel (2010) for an overview of this area.) Machine learning approaches apply learning algorithms to annotated material to extend results to unannotated material, thus introducing more knowledge into the processing pipeline. However, the results are somewhat of a black box in that we cannot really know the rules that are learned and use them further.

With a “knowledge-heavy” approach, we know, in a sense, what we are looking for, and make this knowledge explicit in lists and rules for processing. Yet, this is labour- and knowledge-intensive. In the legal domain it is crucial to have humanly understandable explanations and justifications for the analysis of a text, which to our thinking warrants a knowledge-heavy approach.

One open source text-mining package, General Architecture for Text Engineering (GATE), consists of multiple components in a cascade or pipeline, each component automatically processing some aspect of the text, and then feeding into the next process. The underlying strategy in all the components is to find a pattern (from either a list or a previous process) which matches a rule, and then to apply the rule which annotates the text. Each component performs a particular process on the text, such as:

  • Sentence segmentation – dividing text into sentences.
  • Tokenisation – words identified by spaces between them.
  • Part-of-speech tagging – noun, verb, adjective, etc., determined by look-up and relationships among words.
  • Shallow syntactic parsing/chunking – dividing the text by noun phrase, verb phrase, subordinate clause, etc.
  • Named entity recognition – the entities in the text such as organisations, people, and places.
  • Dependency analysis – subordinate clauses, pronominal anaphora [i.e., identifying what a pronoun refers to], etc.

The system can also be used to annotate more specifically to elements of interest. In one study, we annotated legal cases from a case base (a corpus of cases) in order to identify a range of particular pieces of information that would be relevant to legal professionals such as:

  • Case citation.
  • Names of parties.
  • Roles of parties, meaning plaintiff or defendant.
  • Type of court.
  • Names of judges.
  • Names of attorneys.
  • Roles of attorneys, meaning the side they represent.
  • Final decision.
  • Cases cited.
  • Nature of the case, meaning using keywords to classify the case in terms of subject (e.g., criminal assault, intellectual property, etc.)

Applying our lists and rules to a corpus of legal cases, a sample output is as follows, where the coloured highlights are annotated as per the key on the right; the colours are a visualisation of the sorts of tags discussed above (to see a larger version of the image, right click on the image, then click on “View Image” or a similar phrase; when finished viewing the image, use the browser’s back button to return to the text):

Annotation of a Criminal Case

The approach is very flexible and appears in similar systems. (See, for example, de Maat and Winkels, Automatic Classification of Sentences in Dutch Laws (2008).) While it is labour intensive to develop and maintain such list and rule systems, with a collaborative, Web-based approach, it may be feasible to construct rich systems to annotate large domains.


In this post, we have given a very brief overview of how the Semantic Web and Natural Language Processing (NLP) apply to legal textual information to support annotation which then enables querying and inference. Of course, this is but one take on a much larger domain. In our view, it holds great promise in making legal information more transparent and available to more legal professionals. Aside from GATE, some other resources on text analytics and NLP are textbooks and lecture notes (see, e.g., Wilcock), as well as workshops (such as SPLeT and LOAIT). While applications of Natural Language Processing to legal materials are largely lab studies, the use of NLP in conjunction with Semantic Web technology to annotate legal texts is a fast-developing, results-oriented area which targets meaningful applications for legal professionals. It is well worth watching.

Adam WynerDr. Adam Zachary Wyner is a Research Fellow at the University of Leeds, Institute of Communication Studies, Centre for Digital Citizenship. He currently works on the EU-funded project IMPACT: Integrated Method for Policy Making Using Argument Modelling and Computer Assisted Text Analysis. Dr. Wyner has a Ph.D. in Linguistics (Cornell, 1994) and a Ph.D. in Computer Science (King’s College London, 2008). His computer science Ph.D. dissertation is entitled Violations and Fulfillments in the Formal Representation of Contracts. He has published in the syntax and semantics of adverbs, deontic logic, legal ontologies, and argumentation theory with special reference to law. He is workshop co-chair of SPLeT 2010: Workshop on Semantic Processing of Legal Texts, to be held 23 May 2010 in Malta. He writes about his research at his blog, Language, Logic, Law, Software.

VoxPopuLII is edited by Judith Pratt. Editor in chief is Robert Richards.