{"id":222,"date":"2010-05-17T14:27:02","date_gmt":"2010-05-17T19:27:02","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/voxpop\/2010\/05\/17\/weaving-the-legal-semantic-web-with-natural-language-processing\/"},"modified":"2010-07-17T10:36:25","modified_gmt":"2010-07-17T15:36:25","slug":"weaving-the-legal-semantic-web-with-natural-language-processing","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/voxpop\/2010\/05\/17\/weaving-the-legal-semantic-web-with-natural-language-processing\/","title":{"rendered":"Weaving the Legal Semantic Web with Natural Language Processing"},"content":{"rendered":"

\"Cornucopia\"The World Wide Web<\/a> is a virtual cornucopia of legal information bearing on all manner of topics and in a spectrum of formats, much of it textual. However, to make use of this storehouse of textual information, it must be annotated<\/a><\/em> and structured<\/a><\/em> in such a way as to be meaningful to people and processable by computers. One of the visions of the Semantic Web<\/a> has been to enrich information on the Web with annotation and structure. Yet, given that text is in a natural language<\/a> (e.g., English, German, Japanese, etc.), which people can understand but machines cannot, some automated processing of the text itself is needed before further processing can be applied. In this article, we discuss one approach to legal information on the World Wide Web, the Semantic Web, and Natural Language Processing (NLP)<\/a>. Each of these are large, complex, and heterogeneous topics of research; in this short post, we can only hope to touch on a fragment and that heavily biased to our interests and knowledge. Other important approaches are mentioned at the end of the post. We give small working examples of legal textual input, the Semantic Web output, and how NLP can be used to process the input into the output.<\/p>\n

Legal Information on the Web<\/strong><\/p>\n

For clients, legal professionals, and public administrators, the Web provides an unprecedented opportunity to search for, find, and reason with legal information such as case law, legislation, legal opinions, journal articles, and material relevant to discovery in a court procedure. With a search tool such as Google<\/a> or indexed searches made available by Lexis-Nexis<\/a>, Westlaw<\/a>, or the World Legal Information Institute<\/a>, the legal researcher can input key words into a search and get in return a (usually long) list of documents<\/a> which contain, or are indexed by, those key words.<\/p>\n

As useful as such searches are, they are also highly limited to the particular words or indexation provided, for the legal researcher must still manually<\/em> examine the documents to find the substantive information. Moreover, current legal search mechanisms<\/a> do not support more meaningful searches such as for properties or relationships, where, for example, a legal researcher searches for cases in which a company has the property of being in the role of plaintiff<\/em> or where a lawyer is in the relationship of representing a client<\/em>. Nor, by the same token, can searches be made with respect to more general (or more specific) concepts<\/a>, such as “all cases in which a company has any role,” some particular fact pattern, legislation bearing on related topics, or decisions on topics related to a legal subject.<\/p>\n

Binary Mystery<\/a>The underlying problem is that legal textual information is expressed in natural language<\/a>. What literate people read as meaningful words and sentences appear to a computer as just strings of ones and zeros. Only by imposing some structure on the binary code<\/a> is it converted to textual characters as we know them. Yet, there is no similar widespread system for converting the characters into higher levels of structure which correlate to our understanding of meaning. While a search can be made for the string<\/a> plaintiff<\/em>, there are no (widely available) searches for a string that represents an individual who bears the role of plaintiff. To make language on the Web more meaningful and structured, additional content must be added to the source material, which is where the Semantic Web<\/a> and Natural Language Processing<\/a> come into play.<\/p>\n

Semantic Web<\/strong><\/p>\n

The Semantic Web<\/a> is a complex of design principles and technologies which are intended to make information on the Web more meaningful and usable to people.Semantic Web Stack<\/a> We focus on only a small portion of this structure, namely the syntactic XML (eXtensible Markup Language)<\/a> level, where elements are annotated<\/a><\/em> so as to indicate linguistically relevant information and structure. (Click here<\/a> for more on these points.) While the XML level may be construed as a ‘lower’ level in the Semantic Web “stack”<\/a> — i.e., the layers of interrelated technologies that make up the Semantic Web — the XML level is nonetheless crucial to providing information to higher levels where ontologies<\/a> (and click here<\/a> for more on this) and logic<\/a> play a role. So as to be clear about the relation between the Semantic Web and NLP, we briefly review aspects of XML by example, and furnish motivations as we go.<\/p>\n

Suppose one looks up a case where Harris Hill<\/em> is the plaintiff and Jane Smith<\/em> is the attorney for Harris Hill. In a document related to this case, we would see text such as the following portions:<\/p>\n

\nHarris Hill, plaintiff.
\nJane Smith, attorney for the plaintiff.\n<\/p><\/blockquote>\n

While it is relatively straightforward to structure the binary string<\/a> into characters, adding further information is more difficult. Consider what we know about this small fragment: Harris<\/em> and Jane<\/em> are (very likely) first names, Hill<\/em> and Smith<\/em> are last names, Harris Hill<\/em> and Jane Smith<\/em> are full names of people, plaintiff<\/em> and attorney<\/em> are roles in a legal case, Harris Hill has the role of plaintiff, attorney for<\/em> is a relationship between two entities, and Jane Smith is in the attorney for<\/em> relationship to Harris Hill. It would be useful to encode this information into a standardised machine-readable and processable form.<\/p>\n

XML helps to encode the information by specifying requirements for tags<\/a><\/em> that can be used to annotate<\/em> the text. It is a highly expressive<\/a> language, allowing one to define tags that suit one’s purposes so long as the specification requirements are met. One requirement is that each tag has a beginning and an ending; the material in between is the data that is being tagged. For example, suppose tags such as the following, where … indicates the data:<\/p>\n

\r\n\r\n<legalcase>...<\/legalcase>,\r\n<firstname>...<\/firstname>,\r\n<lastname>...<\/lastname>,\r\n<fullname>...<\/fullname>,\r\n<plaintiff>...<\/plaintiff>,\r\n<attorney>...<\/attorney>, \r\n<legalrelationship>...<\/legalrelationship>\r\n<\/code>\r\n<\/pre>\n

Another requirement is that the tags have a tree structure<\/a><\/em>, where each pair of tags in the document is included in another pair of tags and there is no crossing over<\/em>:<\/p>\n

\r\n<fullname><firstname>...<\/firstname>, \r\n<lastname>...<\/lastname><\/fullname>\r\n<\/code><\/pre>\n

is acceptable, but<\/p>\n

\r\n<fullname><firstname>...<lastname>\r\n<\/firstname> ...<\/lastname><\/fullname>\r\n<\/code><\/pre>\n

is unacceptable. Finally, XML tags can be organised into schemas<\/a><\/em> to structure the tags.<\/p>\n

With these points in mind, we could represent our fragment as:<\/p>\n

\r\n<legalcase>\r\n  <legalrelationship>\r\n    <plaintiff>\r\n      <fullname><firstname>Harris<\/firstname>,\r\n           <lastname>Hill<\/lastname><\/fullname>\r\n    <\/plaintiff>,\r\n    <attorney>\r\n      <fullname><firstname>Jane<\/firstname>,\r\n           <lastname>Smith<\/lastname><\/fullname>\r\n    <\/attorney>\r\n  <\/legalrelationship\r\n<\/legalcase>\r\n<\/code><\/pre>\n

We have added structured information — the tags — to the original text. While this is more difficult for us to read, it is very easy for a machine to read and process. In addition, the tagged text contains the content of the information, which can be presented in a range of alternative ways and formats using a transformation language<\/a> such as XSLT<\/a> (click here<\/a> for more on this point) so that we have an easier-to-read format.<\/p>\n

Why bother to include all this additional information in a legal text? Because these additions allow us to query<\/a><\/em> the source text and submit the information to further processing such as inference<\/a>. Given a query language<\/a>, we could submit to the machine the query Who is the attorney in the case?<\/em> and the answer would be Jane Smith<\/em>. Given a rule language — such as RuleML<\/a> or Semantic Web Rule Language (SWRL)<\/a> — which has a rule such as If someone is an attorney for a client then that client has a privileged relationship with the attorney<\/em>, it might follow from this rule that the attorney could not divulge the client’s secrets. Applying such a rule to our sample, we could infer that Jane Smith cannot divulge Harris Hill’s secrets.<\/p>\n

\"TowerThough it may seem here like too much technology for such a small and obvious task, it is essential where we scale up our queries and inferences on large corpora of legal texts<\/a> — hundreds of thousands if not millions of documents — which comprise vast storehouses of unstructured, yet meaningful data. Were all legal cases uniformly annotated, we could, in principle, find out every attorney for every plaintiff for every legal case. Where our tagging structure is very rich, our queries and inferences could also be very rich and detailed. Perhaps a more familiar way to view documents annotated with XML is as a database<\/a> to which further processes can be applied over the Web.<\/p>\n

Natural Language Processing<\/strong><\/p>\n

As we have presented it, we have an input, the corpus of texts, and an output, texts annotated with XML tags. The objective is to support a range of processes such as querying and inference. However, getting from a corpus of textual information to annotated output is a demanding task, generically referred to as the knowledge acquisition bottleneck<\/a><\/em>. Not only is the task demanding on resources (time, money, manpower); it is also highly knowledge intensive since whoever is doing the annotation must know what to look for, and it is important that all of the annotators annotate the text in the same way (inter-annotator agreement) to support the processes. Thus, automation is central.<\/p>\n

Yet processing language to support such richly annotated documents confronts a spectrum of difficult issues. Among them, natural language supports (1) implicit or presupposed information, (2) multiple forms with the same meaning, (3) the same form with different contextually dependent meanings, and (4) dispersed meanings. (Similar points can be made for sentences or other linguistic elements.) Here are examples of these four issues:<\/p>\n

(1) “When did you stop taking drugs?” (presupposes that the person being questioned took drugs at sometime in the past);
\n(2) Jane Smith, Jane R. Smith, Smith, Attorney Smith… (different ways to refer to the same person);
\n(3) The individual referred to by the name “Jane Smith” in one case decision may not be the individual referred to by the name “Jane Smith” in another case decision;
\n(4) Jane Smith represented Jones Inc. She works for Dewey, Cheetum, and Howe. To contact her, write to j.smith@dch.com .<\/p>\n

When we search for information, a range of linguistic structures or relationships may be relevant to our query, such as:<\/p>\n