VoxPopuLII
To take the words of Walt Whitman, when it comes to improving legal information retrieval (IR), lawyers, legal librarians and informaticians are all to some extent, “Both in and out of the game, and watching and wondering at it“. The reason is that each group holds only a piece of the solution to the puzzle, and as pointed out in an earlier post, they’re talking past each other.
In addition, there appears to be a conservative contingent in each group who actively hinder the kind of cooperation that could generate real progress: lawyers do not always take up technical advances when they are made available, thus discouraging further research, legal librarians cling to indexing when all modern search technologies use free-text search, and informaticians are frequently impatient with, and misunderstand, the needs of legal professionals.
What’s holding progress back?
At root, what has held legal IR back may be the lack of cross-training of professionals in law and informatics, although I’m impressed with the open-mindedness I observe at law and artificial intelligence conferences, and there are some who are breaking out of their comfort zone and neat disciplinary boundaries to address issues in legal informatics.
I recently came back from a visit to the National Institute of Informatics in Japan where I met Ken Satoh, a logician who, late in his professional career, has just graduated from law school. This is not just hard work. I believe it takes a great deal of character for a seasoned academic to maintain students’ respect when they are his seniors in a secondary degree. But the end result is worth it: a lab with an exemplary balance of lawyers and computer scientists, experience and enthusiasm, pulling side-by-side.
Still, I occasionally get the feeling we’re all hoping for some sort of miracle to deliver us from the current predicament posed by the explosion of legal information. Legal professionals hope to be saved by technical wizardry, and informaticians like myself are waiting for data provision, methodical legal feedback on system requirements and performance, and in some cases research funding. In other words, we all want someone other than ourselves to get the ball rolling.
The need to evaluate
Take for example, the lack of large corpora for study, which is one of the biggest stumbling blocks in informatics. Both IR and natural language processing (NLP) currently thrive on experimentation with vast amounts of data, which is used in statistical processing. More data means better statistical estimates and the fewer `guesses’ at relevant probabilities. Even commercial legal case retrieval systems, which give the appearance of being Boolean, use statistics and have done so for around 15 years. (They are based on inference networks that simulate Boolean retrieval with weighted indexing by reducing the rigidness associated with conditional probability estimates for Boolean operators `and’, `or’ and `not’. In this way, document ranking increasingly depends on the number of query constraints met).
The problem is that to evaluate new techniques in IR (and thus improve the field), you need not only a corpus of documents to search but also a sample of legal queries and a list of all the relevant documents in response to those queries that exist in your corpus, perhaps even with some indication of how relevant they are. This is not easy to come by. In theory a lot of case data is publicly available, but accumulating and cleaning legal text downloaded from the internet, making it amenable to search, is nothing short of tortuous. Then relevance judgments must be given by legal professionals, which is difficult given that we are talking about a community of people who charge their time by the hour.
Of course, the cooperation of commercial search providers, who own both the data and training sets with relevance judgments, would make everyone’s life much easier, but for obvious commercial reasons they keep their data to themselves.
To see the productive effects of a good data set we need only look at the research boom now occurring in e-discovery (discovery of electronic evidence, or DESI). In 2006 the TREC Legal Track, including a large evaluation corpus, was established in response to the number of trials requiring e-discovery: 75% of Fortune 500 company trials, with more than 90% of company information now stored electronically. This has generated so much interest that an annual DESI workshop has been established since 2007.
Qualitative evaluation of IR performance by legal professionals is an alternative to the quantitative evaluation usually applied in informatics. The development of new ways to visualize and browse results seems particularly well suited to this approach, where we want to know whether users perceive new interfaces to be genuine improvements. Considering the history of legal IR, qualitative evaluation may be as important as traditional IR evaluation metrics of precision and recall. (Precision is the number of relevant items retrieved out of the total number of items retrieved, and recall is the number of relevant items retrieved out of the total number of relevant items in a collection). However, it should not be the sole basis for evaluation.
A well-known study by Blair and Maron makes this point plain. The authors showed that expert legal researchers retrieve less than 20% of relevant documents when they believe they have found over 75%. In other words, even experts can be very poor judges of retrieval performance.
Context in legal retrieval
Setting this aside, where do we go from here? Dan Dabney has argued at the American Association of Law Libraries (AALL) 2005 Annual Meeting that free text search decontextualizes information, and he is right. One thing to notice about current methods in open domain IR, including vector space models, probabilistic models and language models, is that the only context they are taking into account is proximate terms (phrases). At heart, they treat all terms as independent.
However, it’s risky to conclude what was reported from the same meeting: “Using indexes improves accuracy, eliminates false positive results, and leads to completion in ways that full-text searching simply cannot.” I would be interested to know if this continues to be a general perception amongst legal librarians despite a more recent emphasis on innovating with technologies that don’t encroach upon the sacred ground of indexing. Perhaps there’s a misconception that capitalizing on full-text search methods would necessarily replace the use of index terms. This isn’t the case; inference networks used in commercial legal IR are not applied in the open domain, and one of their advantages is that they can incorporate any number of types of evidence.
Specifically, index numbers, terms, phrases, citations, topics and any other desired information are treated as representation nodes in a directed acyclic graph (the network). This graph is used to estimate the probability of a user’s information need being met given a document.
For the time being lawyers, unaware of technology under the hood, default to using inference networks in a way that is familiar, via a search interface that easily incorporates index terms and looks like a Boolean search. (Inference nets are not Boolean but they can be made to behave in the same way.) While Boolean search does tend to be more precise than other methods, the more data there is to search the less well the system performs. Further, it’s not all about precision. Recall of relevant documents is also important and this can be a weak point for Boolean retrieval. Eliminating false positives is no accolade when true positives are eliminated at the same time.
Since the current predicament is an explosion of data, arguing for indexing by contrasting it with full-text retrieval without considering how they might work together seems counterproductive.
Perhaps instead we should be looking at revamping legal reliance on a Boolean-style interface so that we can make better use of full-text search. This will be difficult. Lawyers who are charged, and charge, per search, must be able to demonstrate the value of each search to clients; they can’t afford the iterative nature of what characterizes open domain browsing. Further, if the intelligence is inside the retrieval system, rather than held by legal researchers in the form of knowledge about how to put complex queries together, how are search costs justified? Although Boolean queries are no longer well-adapted, at least value is easy to demonstrate. A push towards free-text search by either legal professionals or commercial search providers will demand a rethink of billing structures.
Given our current systems, are there incremental ways we can improve results from full-text search? Query expansion is a natural consideration and incidentally overlaps with much of the technology underlying graphical means of data exploration such as word clouds and wonderwheels; the difference is that query expansion goes on behind the scenes, whereas in graphical methods the user is allowed to control the process. Query expansion helps the user find terms they hadn’t thought of, but this doesn’t help with the decontextualization problem identified by Dabney; it simply adds more independent terms or phrases.
In order to contextualize information we can marry search using text terms and index numbers as is already applied. Even better would be to do some linguistic analysis of a query to really narrow down the situations in which we want terms to appear. In this way we might get at questions such as “What happened in a case?” or “Why did it happen?” rather than just, “What is this document about?”.
Language processing and IR
Use of linguistic information in IR isn’t a novel idea. In the 1980s, IR researchers started to think about incorporating NLP as an intrinsic part of retrieval. Many of the early approaches attempted to use syntactic information for improving term indexing or weighting. For example, Fagan improved performance by applying syntactic rules to extract similar phrases from queries and documents and then using them for direct matching, but it was held that this was comparable to a less complex, and therefore preferable, statistical approach to language analysis. In fact, Fagan’s work demonstrated early on what is now generally accepted: statistical methods that do not assume any knowledge of word meaning or linguistic role are surprisingly (some would say depressingly) hard to beat for retrieval performance.
Since then there have been a number of attempts to incorporate NLP in IR, but depending on the task involved, there can be a lack of highly accurate methods for automatic linguistic analysis of text that are also robust enough to handle unexpected and complex language constructions. (There are exceptions, for example, part-of-speech tagging is highly accurate.) The result is that improved retrieval performance is often offset by negative effects, resulting in a minimal positive, or even a negative impact on overall performance. This makes NLP techniques not worth the price of additional computational overheads in time and data storage.
However, just because the task isn’t easy doesn’t mean we should give up. Researchers, including myself, are looking afresh at ways to incorporate NLP into IR. This is being encouraged by organizations such as the NII Test Collection for IR Systems Project (NTCIR), who from 2003 to 2006 amassed excellent test and training data for patent retrieval with corpora in Japanese and English and queries in five languages. Their focus has recently shifted towards NLP tasks associated with retrieval, such as patent data mining and classification. Further, their corpora enable study of cross-language retrieval issues that become important in e-discovery since only a minority fraction of a global corporation’s electronic information will be in English.
We stand on the threshold of what will be a period of rapid innovation in legal search driven by the integration of existing knowledge bases with emerging full-text processing technologies. Let’s explore the options.
K. Tamsin Maxwell is a PhD candidate in informatics at the University of Edinburgh, focusing on information retrieval in law. She has a MSc in cognitive science and natural language engineering, and has given guest talks at the University of Bologna, UMass Amherst, and NAIST in Japan. Her areas of interest include text processing, machine learning, information retrieval, data mining and information extraction. Her passion outside academia is Arabic dance.
VoxPopuLII is edited by Judith Pratt.
[…] » Pushing the envelope: Innovation in legal search VoxPopuLII blog.law.cornell.edu/voxpop/2009/09/17/pushing-the-envelope-innovation-in-legal-search – view page – cached « If the mountain will not come to the prophet, the prophet will go to the — From the page […]
[…] of Edinburgh’s Institute for Communicating and Collaborative Systems (ICCS), has published a very informative overview of legal information retrieval research, entitled Pushing the Envelope: …, at the VoxPopuLII blog. Ms. Maxwell focuses on the possibilities of incorporating natural language […]
Been writing and thinking about this subject for years. This might be most comprehensive treatment of the subject I’ve seen (search + legal + semantic elements). I’ll be reading this over later, but so far I’m impressed. Here’s the real question – why haven’t Westlaw and Lexis incorporated these points into their products? Is it too simple or too cynical to observe that more efficient searches = less time searching = less revenue? Cynical, but true.
I think the problems you identify with the uptake of new technologies within the legal community (and the very real lack of funding for development) stem from several different areas;
1. The e-disclosure / discovery market is US led and so necessarily focus’s on US issues. When I worked for a major US E-Discovery service provider in the UK simple and important European issues were routinely pushed to the back of the development list (cyrillic and arabic language searching capacities to name but one).
2. Lawyers make a LOT of money reviewing documents, it is not in their interests to have a document set which only contains documents relevant to their case.
3. Lawyers are by their nature pragmatists and are distrustful of computers and technology I have had meetings with partners in City law firms who have their secretaries print out their e-mails, they then hand write the reply below the mail and have their secretaries type and send the reply (genuinely) couple that with having to sign a disclosure statement which says;
“I certify that I understand the duty of disclosure and to the best of my knowledge I have carried out that duty. I certify that the list above is a complete list of all documents which are or have been in my control and which I am obliged under the said order to disclose.”
and you’ll find that many lawyers simply will not put their trust in new technology they would far rather use older “tried and tested” technology which means they have to do slightly more document review (coicidentally taking up a few more billable hours) and sign the statement in good faith, avoiding potential for satellite litigation around the disclosure process itself.
4. The CPR in the UK encourages parties to co-operate in the disclosure process and so for new technologies to be adopted both parties need to understand and trust them to agree to their use, this means that parties tend to agree more basic search technologies stifling the uptake of new technologies.
Over the past 10 years things have moved on but no-one in the legal community wants to be at the bleeding edge but in a far more comfortable position just behind it, this naturally slows everything down.
My thoughts on a Friday night!
regards
Mike Taylor
Great points Mike, thanks.
Re: why haven’t Westlaw and Lexis incorporated these points into their products?
This is a good question. The answer is that commercial operators did initiate a move in this direction way back in 1993, with the availability of ‘natural language‘ search (the term ‘natural language’ is used here loosely; the method involved removing frequent words from free text, converting remaining words into their canonical base form and recognising phrases through database statistics and syntactic means). However, even though this was clearly shown to be superior to Boolean retrieval it remained and remains less popular with legal professionals. With no commercial reason to invest in further related R&D, the focus shifted to other areas; for example, citation searching and bolt-on features that could be billed as extras.
There is also plenty of discussion about the objective of a legal IR system which has interesting implications. Dabney and Berring have argued that advocates require nothing less than recall of every relevant case in a database; as Mike pointed out, lawyers are liable for not being fully informed, so from a given perspective this statement is plainly true. However, it has also been argued that the most pressing task in online legal research is quickly and easily finding a few on-point cases from which other cases can be traced by traditional means.
Uptake of new technologies may be more important if you hold the latter belief, since precision is particularly problematic for current database systems. A move towards this position may be observed with the Google generation who expect a search to immediately find top relevant documents from which they can follow links to other documents that interest them.
Tamsin Maxwell
[…] US National Institute of Standards (NIST) has untaken yearly studies evaluating the application of Information Retrieval (IR) methods to e-discovery in the context of U.S. civil litigation. In this short paper, we distill some of […]
[…] researcher must still manually examine the documents to find the substantive information. Moreover, current legal search mechanisms do not support more meaningful searches such as for properties or relationships, where, for […]
[…] allocation in online retrieval projects. Therefore, legal research is the ideal test area for technical innovations in search and retrieval. Taking into account that lawyers are an attractive target group, investors will to a larger degree […]
[…] has been much discussion on this blog about law-related information retrieval systems, ontologies, and metadata. Today, I’d like to take you into another corner of legal […]
Hey! I simply wish to give a huge thumbs up for the great data you’ve here on this post. I can be coming back to your weblog for extra soon.