skip navigation

Artisanal Algorithms

Down here in Durham, NC, we have artisanal everything: bread, cheese, pizza, peanut butter, and of course coffee, coffee, and more coffee. It’s great—fantastic food and coffee, that is, and there is no doubt some psychological kick from knowing that it’s been made carefully by skilled craftspeople for my enjoyment. The old ways are better, at least until they’re co-opted by major multinational corporations.

Artisanal Cheese. Source: Wikimedia Commons

Aside from making you either hungry or jealous, or perhaps both, why am I talking about fancy foodstuffs on a blog about legal information? It’s because I’d like to argue that algorithms are not computerized, unknowable, mysterious things—they are produced by people, often painstakingly, with a great deal of care. Food metaphors abound, helpfully I think. Algorithms are the “special sauce” of many online research services. They are sets of instructions to be followed and completed, leading to a final product, just like a recipe. Above all, they are the stuff of life for the research systems of the near future.

Human Mediation Never Went Away

When we talk about algorithms in the research community, we are generally talking about search or information retrieval (IR) algorithms. A recent and fascinating VoxPopuLII post by Qiang Lu and Jack Conrad, “Next Generation Legal Search – It’s Already Here,” discusses how these algorithms have become more complicated by considering factors beyond document-based, topical relevance. But I’d like to step back for a moment and head into the past for a bit to talk about the beginnings of search, and the framework that we have viewed it within for the past half-century.

Many early information-retrieval systems worked like this: a researcher would come to you, the information professional, with an information need, that vague and negotiable idea which you would try to reduce to a single question or set of questions. With your understanding of Boolean search techniques and your knowledge of how the document corpus you were searching was indexed, you would then craft a search for the computer to run. Several hours later, when the search was finished, you would be presented with a list of results, sometimes ranked in order of relevance and limited in size because of a lack of computing power. Presumably you would then share these results with the researcher, or perhaps just turn over the relevant documents and send him on his way. In the academic literature, this was called “delegated search,” and it formed the background for the most influential information retrieval studies and research projects for many years—the Cranfield Experiments. See also “On the History of Evaluation in IR” by Stephen Robertson (2008).

In this system, literally everything—the document corpus, the index, the query, and the results—were mediated. There was a medium, a middle-man. The dream was to some day dis-intermediate, which does not mean to exhume the body of the dead news industry. (I feel entitled to this terrible joke as a former journalist… please forgive me.) When the World Wide Web and its ever-expanding document corpus came on the scene, many thought that search engines—huge algorithms, basically—would remove any barrier between the searcher and the information she sought. This is “end-user” search, and as algorithms improved, so too would the system, without requiring the searcher to possess any special skills. The searcher would plug a query, any query, into the search box, and the algorithm would present a ranked list of results, high on both recall and precision. Now, the lack of human attention, evidenced by the fact that few people ever look below result 3 on the list, became the limiting factor, instead of the lack of computing power.

A search for delegated search

A search for delegated search

The only problem with this is that search engines did not remove the middle-man—they became the middle-man. Why? Because everything, whether we like it or not, is editorial, especially in reference or information retrieval. Everything, every decision, every step in the algorithm, everything everywhere, involves choice. Search engines, then, are never neutral. They embody the priorities of the people who created them and, as search logs are analyzed and incorporated, of the people who use them. It is in these senses that algorithms are inherently human.

Empowering the Searcher by Failing Consistently

In the context of legal research, then, it makes sense to consider algorithms as secondary sources. Law librarians and legal research instructors can explain the advantages of controlled vocabularies like the Topic and Key Number System®, of annotated statutes, and of citators. In several legal research textbooks, full-text keyword searching is anathema because, I suppose, no one knows what happens directly after you type the words into the box and click search. It seems frightening. We are leaping without looking, trusting our searches to some kind of computer voodoo magic.

This makes sense—search algorithms are often highly guarded secrets, even if what they select for (timeliness, popularity, and dwell time, to name a few) is made known. They are opaque. They apparently do not behave reliably, at least in some cases. But can’t the same be said for non-algorithmic information tools, too? Do we really know which types of factors figure in to the highly vaunted editorial judgment of professionals?

To take the examples listed above—yes, we know what the Topics and Key Numbers are, but do we really know them well enough to explain why the work the way they do, what biases are baked-in from over a century of growth and change? Without greater transparency, I can’t tell you.

How about annotated statutes: who knows how many of the cases cited on online platforms are holdovers from the soon-to-be print publications of yesteryear? In selecting those cases, surely the editors had to choose to omit some, or perhaps many, because of space constraints. How, then, did the editors determine which cases were most on-point in interpreting a given statutory section, that is, which were most relevant? What algorithms are being used today to rank the list of annotations? Again, without greater transparency, I can’t tell you.

And when it comes to citators, why is there so much discrepancy between a case’s classification and which later-citing cases are presented as evidence of this classification? There have been several recent studies, like this one and this one, looking into the issue, but more research is certainly needed.

Finally, research in many fields is telling us that human judgments of relevance are highly subjective in the first place. At least one court has said that algorithmic predictive coding is better at finding relevant documents during pretrial e-discovery than humans are.

Where are the relevant documents? Source: CC BY 2.0, flickr user gosheshe

I am not presenting these examples to discredit subjectivity in the creation of information tools. What I am saying is that the dichotomy between editorial and algorithmic, between human and machine, is largely a false one. Both are subjective. But why is this important?

Search algorithms, when they are made transparent to researchers, librarians, and software developers (i.e. they are “open source”), do have at least one distinct advantage over other forms of secondary sources—when they fail, they fail consistently. After the fact or even in close to real-time, it’s possible to re-program the algorithm when it is not behaving as expected.

Another advantage to thinking of algorithms as just another secondary source is that, demystified, they can become a less privileged (or, depending on your point of view, less demonized) part of the research process. The assumption that the magic box will do all of the work for you is just as dangerous as the assumption that the magic box will do nothing for you. Teaching about search algorithms allows for an understanding of them, especially if the search algorithms are clear about which editorial judgments have been prioritized.

Beyond Search, Or How I Learned to Stop Worrying and Love Automated Research Tools

As an employee at Fastcase, Inc. this past summer, I had the opportunity to work on several innovative uses of algorithms in legal research, most notably on the new automated citation-analysis tool Bad Law Bot. Bad Law Bot, at least in its current iteration, works by searching the case law corpus for significant signals—words, phrases, or citations to legal documents—and, based on criteria selected in advance, determines whether a case has been given negative treatment in subsequent cases. The tool is certainly automated, but the algorithm is artisanal—it was massaged and kneaded by caring craftsmen to deliver a premium product. The results it delivered were also tested meticulously to find out where the algorithm had failed. And then the process started over again.

This is just one example of what I think the future of much general legal research will look like—smart algorithms built and tested by people, taking advantage of near unlimited storage space and ever-increasing computing power to process huge datasets extremely fast. Secondary sources, at least the ones organizing, classifying, and grouping primary law, will no longer be static things. Rather, they will change quickly when new documents are available or new uses for those documents are dreamed up. It will take hard work and a realistic set of expectations to do it well.

Computer assisted legal research cannot be about merely returning ranked lists of relevant results, even as today’s algorithms get better and better at producing these lists. Search must be only one component of a holistic research experience in which the searcher consults many tools which, used together, are greater than the sum of their parts. Many of those tools will be built by information professionals and software engineers using algorithms, and will be capable of being updated and changed as the corpus and user need changes.

It’s time that we stop thinking of algorithms as alien, or other, or too complicated, or scary. Instead, we should think of them as familiar and human, as sets of instructions hand-crafted to help us solve problems with research tools that we have not yet been able to solve, or that we did not know were problems in the first place.

Aaron KirschenfeldAaron Kirschenfeld is currently pursuing a dual J.D. / M.S.I.S. at the University of North Carolina at Chapel Hill. His main research interests are legal research instruction, the philosophy and aesthetics of legal citation analysis, and privacy law. You can reach him on Twitter @kirschsubjudice.

His views do not represent those of his part-time employer, Fastcase, Inc. Also, he has never hand-crafted an algorithm, let alone a wheel of cheese, but appreciates the work of those who do immensely.


VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

[Editor’s Note: We are pleased to publish this piece from Qiang Lu and Jack Conrad, both of whom worked with Thomson Reuters R&D on the WestlawNext research team. Jack Conrad continues to work with Thomson Reuters, though currently on loan to the Catalyst Lab at Thomson Reuters Global Resources in Switzerland. Qiang Lu is now based at Kore Federal in the Washington, D.C. area. We read with interest their 2012 paper from the International Conference on Knowledge Engineering and Ontology Development (KEOD), “Bringing order to legal documents: An issue-based recommendation system via cluster association”, and are grateful that they have agreed to offer some system-specific context for their work in this area. Their current contribution represents a practical description of the advances that have been made between the initial and current versions of Westlaw, and what differentiates a contemporary legal search engine from its predecessors.  -sd]

In her blog on “Pushing the Envelope: Innovation in Legal Search” (2009) [1], Edinburgh Informatics Ph.D. candidate K. Tamsin Maxwell presents her perspective of the state of legal search at the time. The variations of legal information retrieval (IR) that she reviews − everything from natural language search (e.g., vector space models, Bayesian inference net models, and language models) to NLP and term weighting − refer to techniques that are now 10, 15, even 20 years old. She also refers to the release of the first natural language legal search engine by West back in 1993−WIN (Westlaw Is Natural) [2]. Adding to this on-going conversation about legal search, we would like to check back in, a full 20 years after the release of that first natural language legal search engine. The objective we hope to achieve in this posting is to provide a useful overview of state-of-the-art legal search today.

What Maxwell’s article could not have predicted, even five years ago, are some of the chief factors that distinguish state-of-the-art search engines today from their earlier counterparts. One of the most notable distinctions is that unlike their predecessors, contemporary search engines, including today’s state-of-the-art legal search engine, WestlawNext , separate the function of document retrieval from document ranking. Whereas the first retrieval function primarily addresses recall, ensuring that all potentially relevant documents are retrieved, the second and ensuing function focuses on the ideal ranking of those results, addressing precision at the highest ranks. By contrast, search engines of the past effectively treated these two search functions as one and the same. So what is the difference? Whereas the document retrieval piece may not be dramatically different from what it was when WIN was first released in 1993, what is dramatically different lies in the evidence that is considered in the ranking piece, which allows potentially dozens of weighted features to be taken into account and tracked as part of the optimal ranking process.

Figure 1: Views

Figure 1. The set of evidence (views) that can be used by modern legal search engines.

In traditional search, the principal evidence considered was the main text of the document in question. In the case of traditional legal search, those documents would be cases, briefs, statutes, regulations, law reviews and other forms of primary and secondary (a.k.a. analytical) legal publications. This textual set of evidence can be termed the document view of the world. In the case of legal search engines like Westlaw, there also exists the ability to exploit expert-generated annotations or metadata. These annotations come in the form of attorney-editor generated synopses, points of law (a.k.a. headnotes), and attorney-classifier assigned topical classifications that rely on a legal taxonomy such as West’s Key Number System [3]. The set of evidence based on such metadata can be termed the annotation view. Furthermore, in a manner loosely analogous to today’s World Wide Web and the lattice of inter-referencing documents that reside there, today’s legal search can also exploit the multiplicity of both out-bound (cited) sources and in-bound (citing) sources with respect to a document in question, and, frequently, the granularity of these citations is not merely at a document-level but at the sub-document or topic level. Such a set of evidence can be termed the citation network view. More sophisticated engines can examine not only the popularity of a given cited or citing document based on the citation frequency, but also the polarity and scope of the arguments they wager as well.

In addition to the “views” described thus far, a modern search engine can also harness what has come to be known as aggregated user behavior. While individual users and their individual behavior are not considered, in instances where there is sufficient accumulated evidence, the search function can consider document popularity thanks to a user view. That is to say, in addition to a document being returned in a result set for a certain kind of query, the search provider can also tabulate how often a given document was opened for viewing, how often it was printed, or how often it was checked for its legal validity (e.g., through citator services such as KeyCite [4]). (See Figure 1) This form of marshaling and weighting of evidence only scratches the surface, for one can also track evidence between two documents within the same research session, e.g., noting that when one highly relevant document appears in result sets for a given query-type, another document typically appears in the same result sets. In summary, such a user view represents a rich and powerful additional means of leveraging document relevance as indicated through professional user interactions with legal corpora such as those mentioned above.

It is also worth noting that today’s search engines may factor in a user’s preferences, for example, by knowingVOX.LegalResearch what jurisdiction a particular attorney-user practices in, and what kinds of sources that user has historically preferred, over time and across numerous result sets.

While the materials or data relied upon in the document view and citation network view are authored by judges, law clerks, legislators, attorneys and law professors, the summary data present in the annotation view is produced by attorney-editors. By contrast, the aggregated user behavior data represented in the user view is produced by the professional researchers who interact with the retrieval system. The result of this rich and diverse set of views is that the power and effectiveness of a modern legal search engine comes not only from its underlying technology but also from the collective intelligence of all of the domain expertise represented in the generation of its data (documents) and metadata (citations, annotations, popularity and interaction information). Thus, the legal search engine offered by WestlawNext (WLN) represents an optimal blend of advanced artificial intelligence techniques and human expertise [5].

Given this wealth of diverse material representing various forms of relevance information and tractable connections between queries and documents, the ranking function executed by modern legal search engines can be optimized through a series of training rounds that “teach” the machine what forms of evidence make the greatest contribution for certain types of queries and available documents, along with their associated content and metadata. In other words, the re-ranking portion of the machine learns how to weigh the “features” representing this evidence in a manner that will produce the best (i.e., highest precision) ranking of the documents retrieved.

Nevertheless, a search engine is still highly influenced by the user queries it has to process, and for some legal research questions, an independent set of documents grouped by legal issue would be a tremendous complementary resource for the legal researcher, one at least as effective as trying to assemble the set of relevant documents through a sequence of individual queries. For this reason, WLN offers in parallel a complement to search entitled “Related Materials” which in essence is a document recommendation mechanism. These materials are clustered around the primary, secondary and sometimes tertiary legal issues in the case under consideration.

Legal documents are complex and multi-topical in nature. By detecting the top-level legal issues underlying the original document and delivering recommended documents grouped according to these issues, a modern legal search engine can provide a more effective research experience to a user when providing such comprehensive coverage [6,7]. Illustrations of some of the approaches to generating such related material are discussed below.

Take, for example, an attorney who is running a set of queries that seeks to identify a group of relevant documents involving “attractive nuisance” for a party that witnessed a child nearly drowned in a swimming pool. After a number of attempts using several different key terms in her queries, the attorney selects the “Related Materials” option that subsequently provides access to the spectrum of “attractive nuisance”-related documents. Such sets of issue-based documents can represent a mother lode of relevant materials. In this instance, pursuing this navigational path rather than a query-based one turns out to be a good choice. Indeed, the query-based approach could take time and would lead to a gradually evolving set of relevant documents. By contrast, harnessing the cluster of documents produced for “attractive nuisance” may turn out to be the most efficient approach to total recall and the desired degree of relevance.

To further illustrate the benefit of a modern legal search engine, we will conclude our discussion with an instructive search using WestlawNext, and its subsequent exploration by way of this recommendation resource available through “Related Materials.”

The underlying legal issue in this example is “church support for specific candidates”, and a corresponding query is issued in the search box. Figure 2 provides an illustration of the top cases retrieved.


Figure 2: Search result from WestlawNext

Let’s assume that the user decides to closely examine the first case. By clicking the link to the document, the content of the case is rendered, as in Figure 3. Note that on the right-hand side of the panel, the major legal issues of the case “Canyon Ferry Road Baptist Church … v. Unsworth” have been automatically identified and presented with hierarchically structured labels, such as “Freedom of Speech / State Regulation of Campaign Speech” and “Freedom of Speech / View of Federal Election Campaign Act / Definition of Political Committee,” … By presenting these closely related topics, a user is empowered with the ability to dive deep into the relevant cases and other relevant documents without explicitly crafting any additional or refined queries.


Figure 3: A view of a case and complementary materials from WestlawNext

By selecting these sets of relevant topics, a set of recommended cases will be rendered under that particular label. Figure 4, for example, shows the related topic view of the case under the label of “Freedom of Speech / View of Federal Election Campaign Act / Definition of Political Committee.” Note that this process can be repeated based on the particular needs of a user, starting with a document in the original results set.


Figure 4: Related Topic view of a case

In summary, by utilizing the combination of human expert-generated resources and sophisticated machine-learning algorithms, modern legal search engines bring the legal research experience to an unprecedented and powerful new level. For those seeking the next generation in legal search, it’s no longer on the horizon. It’s already here.


[1] K. Tamsin Maxwell, “Pushing the Envelope: Innovation in Legal Search,” in VoxPopuLII, Legal Information Institute, Cornell University Law School, 17 Sept. 2009.
[2] Howard Turtle, “Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance,” In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research & Development in Information Retrieval (SIGIR 1994) (Dublin, Ireland), Springer-Verlag, London, pp. 212-220, 1994.
[3] West’s Key Number System:
[4] West’s KeyCite Citator Service:
[5] Peter Jackson and Khalid Al-Kofahi, “Human Expertise and Artificial Intelligence in Legal Search,” in Structuring of Legal Semantics, A. Geist, C. R. Brunschwig, F. Lachmayer, G. Schefbeck Eds., Festschrift ed. for Erich Schweighofer, Editions Weblaw, Bern, pp. 417-427, 2011.
[6] On Cluster definition and population: Qiang Lu, Jack G. Conrad, Khalid Al-Kofahi, William Keenan, “Legal Document Clustering with Build-in Topic Segmentation,” In Proceedings of the 2011 ACM-CIKM Twentieth International Conference on Information and Knowledge Management (CIKM 2011)(Glasgow, Scotland), ACM Press, pp. 383-392, 2011.
[7] On Cluster association with individual documents: Qiang Lu and Jack G. Conrad, “Bringing order to legal documents: An Issue-based Recommendation System via Cluster Association,” In Proceedings of the 4th International Conference on Knowledge Engineering and Ontology Development  (KEOD 2012) (Barcelona, Spain), SciTePress DL, pp. 76-88, 2012.

Jack G. Conrad currently serves as Lead Research Scientist with the Catalyst Lab at Thomson Reuters Global Resources in Baar, Switzerland. He was formerly a Senior Research Scientist with the Thomson Reuters Corporate Research & Development department. His research areas fall under a broad spectrum of Information Retrieval, Data Mining and NLP topics. Some of these include e-Discovery, document clustering and deduplication for knowledge management systems. Jack has researched and implemented key components for WestlawNext, West‘s next-generation legal search engine, and PeopleMap, a very large scale Public Record aggregation system. Jack completed his graduate studies in Computer Science at the University of Massachusetts–Amherst and in Linguistics at the University of British Columbia–Vancouver.

Qiang Lu was a Senior Research Scientist with Thomson Reuters Corporate Research & Development department. His research interests include data mining, text mining, information retrieval, and machine learning. He has extensive experience of applying various NLP technologies in various data sources, such as news, legal, financial, and law enforcement data. Qiang was a key member of WestlawNext research team. He has a Ph.D. in computer science and engineering from State University of New York at Buffalo. He is now a managing associate at Kore Federal in Washington D.C. area.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.


To take the words of Walt Whitman, when it comes to improving legal information retrieval (IR), lawyers, legal librarians and informaticians are all to some extent, “Both in and out of the game, and watching and wondering at it“. The reason is that each group holds only a piece of the solution to the puzzle, and as pointed out in an earlier post, they’re talking past each other.

In addition, there appears to be a conservative contingent in each group who actively hinder the kind of cooperation that could generate real progress: lawyers do not always take up technical advances when they are made available, thus discouraging further research, legal librarians cling to indexing when all modern search technologies use free-text search, and informaticians are frequently impatient with, and misunderstand, the needs of legal professionals.

What’s holding progress back?

At root, what has held legal IR back may be the lack of cross-training of professionals in law and informatics, although I’m impressed with the open-mindedness I observe at law and artificial intelligence conferences, and there are some who are breaking out of their comfort zone and neat disciplinary boundaries to address issues in legal informatics.

I recently came back from a visit to the National Institute of Informatics in Japan where I met Ken Satoh, a logician who, late in his professional career, has just graduated from law school. This is not just hard work. I believe it takes a great deal of character for a seasoned academic to maintain students’ respect when they are his seniors in a secondary degree. But the end result is worth it: a lab with an exemplary balance of lawyers and computer scientists, experience and enthusiasm, pulling side-by-side.

Still, I occasionally get the feeling we’re all hoping for some sort of miracle to deliver us from the current predicament posed by the explosion of legal information. Legal professionals hope to be saved by technical wizardry, and informaticians like myself are waiting for data provision, methodical legal feedback on system requirements and performance, and in some cases research funding. In other words, we all want someone other than ourselves to get the ball rolling.

Miracle Occurs

The need to evaluate

Take for example, the lack of large corpora for study, which is one of the biggest stumbling blocks in informatics. Both IR and natural language processing (NLP) currently thrive on experimentation with vast amounts of data, which is used in statistical processing. More data means better statistical estimates and the fewer `guesses’ at relevant probabilities. Even commercial legal case retrieval systems, which give the appearance of being Boolean, use statistics and have done so for around 15 years. (They are based on inference networks that simulate Boolean retrieval with weighted indexing by reducing the rigidness associated with conditional probability estimates for Boolean operators `and’, `or’ and `not’. In this way, document ranking increasingly depends on the number of query constraints met).

The problem is that to evaluate new techniques in IR (and thus improve the field), you need not only a corpus of documents to search but also a sample of legal queries and a list of all the relevant documents in response to those queries that exist in your corpus, perhaps even with some indication of how relevant they are. This is not easy to come by. In theory a lot of case data is publicly available, but accumulating and cleaning legal text downloaded from the internet, making it amenable to search, is nothing short of tortuous. Then relevance judgments must be given by legal professionals, which is difficult given that we are talking about a community of people who charge their time by the hour.

Of course, the cooperation of commercial search providers, who own both the data and training sets with relevance judgments, would make everyone’s life much easier, but for obvious commercial reasons they keep their data to themselves.

To see the productive effects of a good data set we need only look at the research boom now occurring in e-discovery (discovery of electronic evidence, or DESI). In 2006 the TREC Legal Track, including a large evaluation corpus, was established in response to the number of trials requiring e-discovery: 75% of Fortune 500 company trials, with more than 90% of company information now stored electronically. This has generated so much interest that an annual DESI workshop has been established since 2007.

Qualitative evaluation of IR performance by legal professionals is an alternative to the quantitative evaluation usually applied in informatics. The development of new ways to visualize and browse results seems particularly well suited to this approach, where we want to know whether users perceive new interfaces to be genuine improvements. Considering the history of legal IR, qualitative evaluation may be as important as traditional IR evaluation metrics of precision and recall. (Precision is the number of relevant items retrieved out of the total number of items retrieved, and recall is the number of relevant items retrieved out of the total number of relevant items in a collection). However, it should not be the sole basis for evaluation.

A well-known study by Blair and Maron makes this point plain. The authors showed that expert legal researchers retrieve less than 20% of relevant documents when they believe they have found over 75%. In other words, even experts can be very poor judges of retrieval performance.

Context in legal retrieval


Setting this aside, where do we go from here? Dan Dabney has argued at the American Association of Law Libraries (AALL) 2005 Annual Meeting that free text search decontextualizes information, and he is right. One thing to notice about current methods in open domain IR, including vector space models, probabilistic models and language models, is that the only context they are taking into account is proximate terms (phrases). At heart, they treat all terms as independent.

However, it’s risky to conclude what was reported from the same meeting: “Using indexes improves accuracy, eliminates false positive results, and leads to completion in ways that full-text searching simply cannot.” I would be interested to know if this continues to be a general perception amongst legal librarians despite a more recent emphasis on innovating with technologies that don’t encroach upon the sacred ground of indexing. Perhaps there’s a misconception that capitalizing on full-text search methods would necessarily replace the use of index terms. This isn’t the case; inference networks used in commercial legal IR are not applied in the open domain, and one of their advantages is that they can incorporate any number of types of evidence.

Specifically, index numbers, terms, phrases, citations, topics and any other desired information are treated as representation nodes in a directed acyclic graph (the network). This graph is used to estimate the probability of a user’s information need being met given a document.

For the time being lawyers, unaware of technology under the hood, default to using inference networks in a way that is familiar, via a search interface that easily incorporates index terms and looks like a Boolean search. (Inference nets are not Boolean but they can be made to behave in the same way.) While Boolean search does tend to be more precise than other methods, the more data there is to search the less well the system performs. Further, it’s not all about precision. Recall of relevant documents is also important and this can be a weak point for Boolean retrieval. Eliminating false positives is no accolade when true positives are eliminated at the same time.

Since the current predicament is an explosion of data, arguing for indexing by contrasting it with full-text retrieval without considering how they might work together seems counterproductive.

Perhaps instead we should be looking at revamping legal reliance on a Boolean-style interface so that we can make better use of full-text search. This will be difficult. Lawyers who are charged, and charge, per search, must be able to demonstrate the value of each search to clients; they can’t afford the iterative nature of what characterizes open domain browsing. Further, if the intelligence is inside the retrieval system, rather than held by legal researchers in the form of knowledge about how to put complex queries together, how are search costs justified? Although Boolean queries are no longer well-adapted, at least value is easy to demonstrate. A push towards free-text search by either legal professionals or commercial search providers will demand a rethink of billing structures.

Given our current systems, are there incremental ways we can improve results from full-text search? Query expansion is a natural consideration and incidentally overlaps with much of the technology underlying graphical means of data exploration such as word clouds and wonderwheels; the difference is that query expansion goes on behind the scenes, whereas in graphical methods the user is allowed to control the process. Query expansion helps the user find terms they hadn’t thought of, but this doesn’t help with the decontextualization problem identified by Dabney; it simply adds more independent terms or phrases.

In order to contextualize information we can marry search using text terms and index numbers as is already applied. Even better would be to do some linguistic analysis of a query to really narrow down the situations in which we want terms to appear. In this way we might get at questions such as “What happened in a case?” or “Why did it happen?” rather than just, “What is this document about?”.

Language processing and IR

Use of linguistic information in IR isn’t a novel idea. In the 1980s, IR researchers started to think about incorporating NLP as an intrinsic part of retrieval. Many of the early approaches attempted to use syntactic information for improving term indexing or weighting. For example, Fagan improved performance by applying syntactic rules to extract similar phrases from queries and documents and then using them for direct matching, but it was held that this was comparable to a less complex, and therefore preferable, statistical approach to language analysis. In fact, Fagan’s work demonstrated early on what is now generally accepted: statistical methods that do not assume any knowledge of word meaning or linguistic role are surprisingly (some would say depressingly) hard to beat for retrieval performance.

Since then there have been a number of attempts to incorporate NLP in IR, but depending on the task involved, there can be a lack of highly accurate methods for automatic linguistic analysis of text that are also robust enough to handle unexpected and complex language constructions. (There are exceptions, for example, part-of-speech tagging is highly accurate.) The result is that improved retrieval performance is often offset by negative effects, resulting in a minimal positive, or even a negative impact on overall performance. This makes NLP techniques not worth the price of additional computational overheads in time and data storage.

However, just because the task isn’t easy doesn’t mean we should give up. Researchers, including myself, are looking afresh at ways to incorporate NLP into IR. This is being encouraged by organizations such as the NII Test Collection for IR Systems Project (NTCIR), who from 2003 to 2006 amassed excellent test and training data for patent retrieval with corpora in Japanese and English and queries in five languages. Their focus has recently shifted towards NLP tasks associated with retrieval, such as patent data mining and classification. Further, their corpora enable study of cross-language retrieval issues that become important in e-discovery since only a minority fraction of a global corporation’s electronic information will be in English.

We stand on the threshold of what will be a period of rapid innovation in legal search driven by the integration of existing knowledge bases with emerging full-text processing technologies. Let’s explore the options.

Tamsin MaxwellK. Tamsin Maxwell is a PhD candidate in informatics at the University of Edinburgh, focusing on information retrieval in law. She has a MSc in cognitive science and natural language engineering, and has given guest talks at the University of Bologna, UMass Amherst, and NAIST in Japan. Her areas of interest include text processing, machine learning, information retrieval, data mining and information extraction. Her passion outside academia is Arabic dance.

VoxPopuLII is edited by Judith Pratt.

Most legal publishers, both free and fee, are primarily concerned with content. Regardless of whether they are academic or corporate entities providing electronic access to monographs, the free providers of the world giving primary source access, Westlaw or Lexis (hereinafter Wexis) providing access to both primary and secondary sources, or any other legal information deliverer, content has ruled the day. The focus has remained on the information in the database. The content. The words themselves.

If trends remain stable, primary source content, at least among politically stable jurisdictions, will be a given. Everyone will have equal access to the laws, regulations, and court decisions of their country online. In the U.S., new free open source access points are emerging every day. Here, the public currently has their choice of LII, Justia, Public Library of Law, AltLaw, FindLaw, PreCYdent, and most recently, OpenJurist, to discover the law. And hopefully, that content will be official and authentic.

The issue then refocuses to secondary sources and user interfaces. These will be where the battle lines will be drawn among legal publishers. Both assist in making meaning out of primary sources, though in fundamentally different ways. Secondary sources explain, analyze, and provide commentary on the law. They can be highly academic and esoteric, or provide nuts and bolts instructions and guidance. They also include finding aids to primary sources, like annotations to statutes, indexes, headnotes, citator services, and the like. While access to government-produced primary sources is a right, access to secondary sources is not, although for lay persons and lawyers alike, primary sources alone are typically insufficient to fully understand the law. I leave the not insignificant issue secondary sources for another day, and focus here on content access and the user interface.

“The eye is the best way for the brain to understand the world around us.”

— Quote reported identically by multiple users on Twitter from a recent talk by Dr. Ben Shneiderman at the #nycupa.

Despite the advances made in adding legal online content, equal attention has not been given to how users may optimally access that content to fulfill their information-seeking needs. We continue to use the same basic Boolean search parameters that we have used for nearly fifty years. We continue to presume that sorting through search result lists is the best way to find information. We continue to presume that research is simply a matter of finding the right keywords and plugging them into a search box. We presume wrong. Even though keyword searching is beloved by many because it provides the illusion of working, it consistently fails.

There is, in fact, another method of finding information that is inherently contextual, and that educates the user contemporaneously with the discovery process. This method is called browsing. Wexis, through their user interfaces, encourage searching over browsing because they are profit centers whose essential product is the search. It is commonly assumed that their product is the database, i.e. the content, because they negotiate access to specific databases with their customers.   And while some databases are worth more than others, they charge by the number of searches, not by the number of documents retrieved, not by the amount of content extracted. (This describes the transactional costs, which are probably most frequently employed. Of course, the per search charge varies by database. Users may alternately choose to be charged by time instead. )

Therefore, their profits are maximized by creating a search product that is not too good and not too bad. They are, in fact, rewarded for their search mediocrity. If it is too good, users will find what they need too quickly, decreasing the number of searches and amount of time spent researching, and profits will decline. If it is too bad, users will get frustrated, complain, and, perhaps eventually, try a different vendor. Though with our current two-party system, there is little real choice for legal professionals who have sophisticated legal research needs not satisfied by the open access options available. (And then there is the distasteful possibility that law firms themselves want to keep legal research costs inflated to serve as their own profit centers.)

As such, Wexis will not be optimally motivated to improve their user interfaces and enhance the information-seeking process to increase efficiency for their customers. This leaves the door wide open for others in the online legal information ecology to innovate and force needed change, create a better product themselves, and apply pressure on the Ferraris and Lamborghinis of the legal world to do the same.

“A picture is worth a thousand words. An interface is worth a thousand pictures.”

— Quote reported identically by multiple users on Twitter from a recent talk by Dr. Ben Shneiderman at the #nycupa.

The time is ripe to create a new information discovery paradigm for legal materials based on semantics. Outside the legal world, advances are being made in more contextual information discovery platforms. Instead of a user issuing keywords and a computer server spitting back results, adjusting input via trial and error ad infinitum, graphic interfaces allow the user to comprehend and adjust their conception and results visually with related parameters. These interfaces encourage an environment where research is more like a conversation between the researcher and the data, rather than dueling monologues.

Lee Rainie, Director of the Pew Internet & American Life Project, recently discussed the emerging information literacies relevant to the evolving online ecology. These literacies should inform how search engines adapt themselves to human needs. Their application in the legal world is a natural fit. Four literacies most applicable to legal research include:

Graphic Literacy. People think visually and process data better with visual representations of information. Translation: make database interfaces and search results graphic.

Navigation Literacy. People have to maneuver online information in a disorganized nonlinear text screen. This creates comprehension and memory problems. We want our lawyers and legal researchers to have good comprehension and memory when serving clients.

Skepticism Literacy. Normally referring to basic critical thinking skills, this should apply to critically assessing user interfaces, particularly in a profit-seeking environment like Wexis where the interface can affect how and what you search, as well as your wallet.

Context Literacy. People need to see connections both between and within information in a hyperlinked environment. Simply providing hyperlinks is good, but graphically visualizing the connections is better.

Some subscription databases and internet search features serve these literacies well. Many of these are in early stages and not necessarily fit for legal research, but can give an idea of possibilities. I’ll discuss a few, and consider how these might apply in the legal context.

wonderwheelGoogle has recently re-released their wonder wheel which helps users figure out what they are looking for. This is a frequent stumbling block for novices to legal research, and even for seasoned attorneys faced with a new subject. The researcher simply doesn’t know enough to know what exactly to look for. A tool like this helps the researcher find terms and concepts that they might not have otherwise considered (of course, secondary sources are excellent for this as well). Pictured here, the small faded hub at the bottom was for my original search of “legal research.” I then clicked on the “legal research methodology” spoke which expanded above the first wheel with different spokes and further ideas.

A common problem with keyword searching is finding the right words in the correct combination that exemplify a concept and are not over or under inclusive. Wexis offers thesauri which can be helpful, though they require actual searching to test. Some free sites, like PreCYdent, have this feature as well. They work to greater and lesser degrees. A recent search for “Title VII sex retaliation” resulted in a suggestion to also search for Title III, which is clearly not my intended subject. And while helpful, thesauri and other word and concept suggestors are still tied to the search paradigm which we want to move away from.

FactivaFactiva is a subscription database provider supplying news and business information. It provides a graphical “discovery pane” with “intelligent indexing” that clusters results by subjects related to search terms. This allows the user to select the most relevant results to their purpose. It also features word clouds (not pictured here) with text size indicating prominence of these terms in search results. Date graphs indicate when search results were published, so the user can visually assess when a topic is most frequently covered in the news.

Subject-based indexing is an excellent contextual tool to guide the user to relevant content without searching. Legal context literacy is supported by indexes to subject-based compilations, such as statutes and regulations. It’s great to have the full text of statutes available for free online, but some kind of subject-based entry port to that collection is needed to render it maximally useful. For databases like these, given the non-natural language used by legislators and lobbyists alike in constructing laws, keyword searching is frequently an inefficient and frustrating discovery method. Currently, Westlaw is the only legal information provider that provides online subject indexing to state and federal codes (though they like to hide that fact in their interface because their product is the search, not the content).

weighting wordsWeighting words, graphically represented by the size of the term, is another method users can employ to improve their results with keyword searching. Factiva uses weighted word clouds to indicate the frequency of terms in search results. SearchCloud allows users to manually weight search terms to indicate their importance within the search and adjusts results accordingly. For example, a researcher may need to find documents with five different words in them, but three are essential in symbolizing the idea sought, and the other two are needed, but not as important. As pictured here, I searched for copyright legal research guides, giving most importance to the words copyright and guide, and less to the words legal and research to ensure that I retrieved guides on copyright and not just any list of research guides that might mention copyright, and that it was in fact a legal research guide and not some other document that just mentions the word guide. Results were significantly more relevant here than the same un-weighted search on Google.

Weighted words can easily be employed in legal research. For example, with case law search results and citator reports, instead of a list of cases and other documents arranged either by date, jurisdiction, or algorithmic relevancy, citator information can be graphically indicated. Cases that are cited the most would appear near the top of the list in the largest fonts. Cases cited the least would appear in a smaller font at the bottom of the list. It adds immediate meaning-making visual cues to an otherwise non-contextual list, letting the researcher know at a glance which are the most important cases.

It would be a boon to researchers if the connection between results was made apparent graphically. KartooKartOO attempts this with their search engine which links various web pages in results with associated terms and similar pages. Mousing over links allows the user a preliminary peek at the search result to further determine its relevancy. The benefits to lawyers for this type of graphic display of search results for cases could be enormous. To be able to tell at a glance how a body of law is interconnected would give immeasurable context and meaning to what would otherwise be a simple list, each result visually disconnected from the other.

Some type of contextual map like the wonder wheel or a concept chart like KartOO, potentially combined with weighted words, could be employed that would illustrate the interconnectedness between all the cites to the case at issue, or to search results of cases. The biggest, most precedential, most frequently inter-cited cases would live near the center of the web with large hubs, less important cases would live at the peripheries. Most cases are never cited and are jurisprudentially less significant. This should be made clear through visual cues. Westlaw just launched something similar for patents.

These are just a few examples, based on developing technology, of how the legal search paradigm might develop. The beauty of our legal corpus is its fundamental interconnectedness. The web of cites within and between documents gives semantic developers a preconstructed map of relevancy and importance so that they need only create a way to symbolize that pattern graphically.

“Semantics rule, keywords drool.”

— Quote at See also

The future of legal information discovery interfaces combines searching and browsing, text and context, graphics and metadata. Because content without meaning thwarts understanding. Laws without context do not serve democracy. We need “interactive discovery.” Which is why search result lists are dead to me.

Julie JonesJulie Jones, formerly a librarian at the Cornell Law School,  is the “rising” Associate Director for Library Services at the University of Connecticut Law School, beginning later this month. She received her J.D. from Northwestern University School of Law, M.L.I.S. from Dominican University, and B.A. from U.C. Santa Barbara.

VoxPopuLII is edited by Judith Pratt