skip navigation
search

[Editor’s Note: We are pleased to publish this piece from Qiang Lu and Jack Conrad, both of whom worked with Thomson Reuters R&D on the WestlawNext research team. Jack Conrad continues to work with Thomson Reuters, though currently on loan to the Catalyst Lab at Thomson Reuters Global Resources in Switzerland. Qiang Lu is now based at Kore Federal in the Washington, D.C. area. We read with interest their 2012 paper from the International Conference on Knowledge Engineering and Ontology Development (KEOD), “Bringing order to legal documents: An issue-based recommendation system via cluster association”, and are grateful that they have agreed to offer some system-specific context for their work in this area. Their current contribution represents a practical description of the advances that have been made between the initial and current versions of Westlaw, and what differentiates a contemporary legal search engine from its predecessors.  -sd]

In her blog on “Pushing the Envelope: Innovation in Legal Search” (2009) [1], Edinburgh Informatics Ph.D. candidate K. Tamsin Maxwell presents her perspective of the state of legal search at the time. The variations of legal information retrieval (IR) that she reviews − everything from natural language search (e.g., vector space models, Bayesian inference net models, and language models) to NLP and term weighting − refer to techniques that are now 10, 15, even 20 years old. She also refers to the release of the first natural language legal search engine by West back in 1993−WIN (Westlaw Is Natural) [2]. Adding to this on-going conversation about legal search, we would like to check back in, a full 20 years after the release of that first natural language legal search engine. The objective we hope to achieve in this posting is to provide a useful overview of state-of-the-art legal search today.

What Maxwell’s article could not have predicted, even five years ago, are some of the chief factors that distinguish state-of-the-art search engines today from their earlier counterparts. One of the most notable distinctions is that unlike their predecessors, contemporary search engines, including today’s state-of-the-art legal search engine, WestlawNext , separate the function of document retrieval from document ranking. Whereas the first retrieval function primarily addresses recall, ensuring that all potentially relevant documents are retrieved, the second and ensuing function focuses on the ideal ranking of those results, addressing precision at the highest ranks. By contrast, search engines of the past effectively treated these two search functions as one and the same. So what is the difference? Whereas the document retrieval piece may not be dramatically different from what it was when WIN was first released in 1993, what is dramatically different lies in the evidence that is considered in the ranking piece, which allows potentially dozens of weighted features to be taken into account and tracked as part of the optimal ranking process.

Figure 1: Views

Figure 1. The set of evidence (views) that can be used by modern legal search engines.

In traditional search, the principal evidence considered was the main text of the document in question. In the case of traditional legal search, those documents would be cases, briefs, statutes, regulations, law reviews and other forms of primary and secondary (a.k.a. analytical) legal publications. This textual set of evidence can be termed the document view of the world. In the case of legal search engines like Westlaw, there also exists the ability to exploit expert-generated annotations or metadata. These annotations come in the form of attorney-editor generated synopses, points of law (a.k.a. headnotes), and attorney-classifier assigned topical classifications that rely on a legal taxonomy such as West’s Key Number System [3]. The set of evidence based on such metadata can be termed the annotation view. Furthermore, in a manner loosely analogous to today’s World Wide Web and the lattice of inter-referencing documents that reside there, today’s legal search can also exploit the multiplicity of both out-bound (cited) sources and in-bound (citing) sources with respect to a document in question, and, frequently, the granularity of these citations is not merely at a document-level but at the sub-document or topic level. Such a set of evidence can be termed the citation network view. More sophisticated engines can examine not only the popularity of a given cited or citing document based on the citation frequency, but also the polarity and scope of the arguments they wager as well.

In addition to the “views” described thus far, a modern search engine can also harness what has come to be known as aggregated user behavior. While individual users and their individual behavior are not considered, in instances where there is sufficient accumulated evidence, the search function can consider document popularity thanks to a user view. That is to say, in addition to a document being returned in a result set for a certain kind of query, the search provider can also tabulate how often a given document was opened for viewing, how often it was printed, or how often it was checked for its legal validity (e.g., through citator services such as KeyCite [4]). (See Figure 1) This form of marshaling and weighting of evidence only scratches the surface, for one can also track evidence between two documents within the same research session, e.g., noting that when one highly relevant document appears in result sets for a given query-type, another document typically appears in the same result sets. In summary, such a user view represents a rich and powerful additional means of leveraging document relevance as indicated through professional user interactions with legal corpora such as those mentioned above.

It is also worth noting that today’s search engines may factor in a user’s preferences, for example, by knowingVOX.LegalResearch what jurisdiction a particular attorney-user practices in, and what kinds of sources that user has historically preferred, over time and across numerous result sets.

While the materials or data relied upon in the document view and citation network view are authored by judges, law clerks, legislators, attorneys and law professors, the summary data present in the annotation view is produced by attorney-editors. By contrast, the aggregated user behavior data represented in the user view is produced by the professional researchers who interact with the retrieval system. The result of this rich and diverse set of views is that the power and effectiveness of a modern legal search engine comes not only from its underlying technology but also from the collective intelligence of all of the domain expertise represented in the generation of its data (documents) and metadata (citations, annotations, popularity and interaction information). Thus, the legal search engine offered by WestlawNext (WLN) represents an optimal blend of advanced artificial intelligence techniques and human expertise [5].

Given this wealth of diverse material representing various forms of relevance information and tractable connections between queries and documents, the ranking function executed by modern legal search engines can be optimized through a series of training rounds that “teach” the machine what forms of evidence make the greatest contribution for certain types of queries and available documents, along with their associated content and metadata. In other words, the re-ranking portion of the machine learns how to weigh the “features” representing this evidence in a manner that will produce the best (i.e., highest precision) ranking of the documents retrieved.

Nevertheless, a search engine is still highly influenced by the user queries it has to process, and for some legal research questions, an independent set of documents grouped by legal issue would be a tremendous complementary resource for the legal researcher, one at least as effective as trying to assemble the set of relevant documents through a sequence of individual queries. For this reason, WLN offers in parallel a complement to search entitled “Related Materials” which in essence is a document recommendation mechanism. These materials are clustered around the primary, secondary and sometimes tertiary legal issues in the case under consideration.

Legal documents are complex and multi-topical in nature. By detecting the top-level legal issues underlying the original document and delivering recommended documents grouped according to these issues, a modern legal search engine can provide a more effective research experience to a user when providing such comprehensive coverage [6,7]. Illustrations of some of the approaches to generating such related material are discussed below.

Take, for example, an attorney who is running a set of queries that seeks to identify a group of relevant documents involving “attractive nuisance” for a party that witnessed a child nearly drowned in a swimming pool. After a number of attempts using several different key terms in her queries, the attorney selects the “Related Materials” option that subsequently provides access to the spectrum of “attractive nuisance”-related documents. Such sets of issue-based documents can represent a mother lode of relevant materials. In this instance, pursuing this navigational path rather than a query-based one turns out to be a good choice. Indeed, the query-based approach could take time and would lead to a gradually evolving set of relevant documents. By contrast, harnessing the cluster of documents produced for “attractive nuisance” may turn out to be the most efficient approach to total recall and the desired degree of relevance.

To further illustrate the benefit of a modern legal search engine, we will conclude our discussion with an instructive search using WestlawNext, and its subsequent exploration by way of this recommendation resource available through “Related Materials.”

The underlying legal issue in this example is “church support for specific candidates”, and a corresponding query is issued in the search box. Figure 2 provides an illustration of the top cases retrieved.

image-2

Figure 2: Search result from WestlawNext

Let’s assume that the user decides to closely examine the first case. By clicking the link to the document, the content of the case is rendered, as in Figure 3. Note that on the right-hand side of the panel, the major legal issues of the case “Canyon Ferry Road Baptist Church … v. Unsworth” have been automatically identified and presented with hierarchically structured labels, such as “Freedom of Speech / State Regulation of Campaign Speech” and “Freedom of Speech / View of Federal Election Campaign Act / Definition of Political Committee,” … By presenting these closely related topics, a user is empowered with the ability to dive deep into the relevant cases and other relevant documents without explicitly crafting any additional or refined queries.

image-3

Figure 3: A view of a case and complementary materials from WestlawNext

By selecting these sets of relevant topics, a set of recommended cases will be rendered under that particular label. Figure 4, for example, shows the related topic view of the case under the label of “Freedom of Speech / View of Federal Election Campaign Act / Definition of Political Committee.” Note that this process can be repeated based on the particular needs of a user, starting with a document in the original results set.

image-4

Figure 4: Related Topic view of a case

In summary, by utilizing the combination of human expert-generated resources and sophisticated machine-learning algorithms, modern legal search engines bring the legal research experience to an unprecedented and powerful new level. For those seeking the next generation in legal search, it’s no longer on the horizon. It’s already here.

References

[1] K. Tamsin Maxwell, “Pushing the Envelope: Innovation in Legal Search,” in VoxPopuLII, Legal Information Institute, Cornell University Law School, 17 Sept. 2009. http://blog.law.cornell.edu/voxpop/2009/09/17/pushing-the-envelope-innovation-in-legal-search/
[2] Howard Turtle, “Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance,” In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research & Development in Information Retrieval (SIGIR 1994) (Dublin, Ireland), Springer-Verlag, London, pp. 212-220, 1994.
[3] West’s Key Number System: http://info.legalsolutions.thomsonreuters.com/pdf/wln2/L-374484.pdf
[4] West’s KeyCite Citator Service: http://info.legalsolutions.thomsonreuters.com/pdf/wln2/L-356347.pdf
[5] Peter Jackson and Khalid Al-Kofahi, “Human Expertise and Artificial Intelligence in Legal Search,” in Structuring of Legal Semantics, A. Geist, C. R. Brunschwig, F. Lachmayer, G. Schefbeck Eds., Festschrift ed. for Erich Schweighofer, Editions Weblaw, Bern, pp. 417-427, 2011.
[6] On Cluster definition and population: Qiang Lu, Jack G. Conrad, Khalid Al-Kofahi, William Keenan, “Legal Document Clustering with Build-in Topic Segmentation,” In Proceedings of the 2011 ACM-CIKM Twentieth International Conference on Information and Knowledge Management (CIKM 2011)(Glasgow, Scotland), ACM Press, pp. 383-392, 2011.
[7] On Cluster association with individual documents: Qiang Lu and Jack G. Conrad, “Bringing order to legal documents: An Issue-based Recommendation System via Cluster Association,” In Proceedings of the 4th International Conference on Knowledge Engineering and Ontology Development  (KEOD 2012) (Barcelona, Spain), SciTePress DL, pp. 76-88, 2012.

Jack G. Conrad currently serves as Lead Research Scientist with the Catalyst Lab at Thomson Reuters Global Resources in Baar, Switzerland. He was formerly a Senior Research Scientist with the Thomson Reuters Corporate Research & Development department. His research areas fall under a broad spectrum of Information Retrieval, Data Mining and NLP topics. Some of these include e-Discovery, document clustering and deduplication for knowledge management systems. Jack has researched and implemented key components for WestlawNext, West‘s next-generation legal search engine, and PeopleMap, a very large scale Public Record aggregation system. Jack completed his graduate studies in Computer Science at the University of Massachusetts–Amherst and in Linguistics at the University of British Columbia–Vancouver.

Qiang Lu was a Senior Research Scientist with Thomson Reuters Corporate Research & Development department. His research interests include data mining, text mining, information retrieval, and machine learning. He has extensive experience of applying various NLP technologies in various data sources, such as news, legal, financial, and law enforcement data. Qiang was a key member of WestlawNext research team. He has a Ph.D. in computer science and engineering from State University of New York at Buffalo. He is now a managing associate at Kore Federal in Washington D.C. area.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

VOX.latin. ignorance_of_the_law_excuses_no_one_card-p137531564186928160envwi_400At CourtListener, we are making a free database of court opinions with the ultimate goal of providing the entire U.S. case-law corpus to the world for free and combining it with cutting-edge search and research tools. We–like most readers of this blog–believe that for justice to truly prevail, the law must be open and equally accessible to everybody.

It is astonishing to think that the entire U.S. case-law corpus is not currently available to the world at no cost. Many have started down this path and stopped, so we know we’ve set a high goal for a humble open source project. From time to time it’s worth taking a moment to reflect on where we are and where we’d like to go in the coming years.

The current state of affairs

We’ve created a good search engine that can provide results based on a number of characteristics of legal cases. Our users can search for opinions by the case name, date, or any text that’s in the opinion, and can refine by court, by precedential status or by citation. The results are pretty good, but are limited based on the data we have and the “relevance signals” that we have in place.

A good legal search engine will use a number of factors (a.k.a. “relevance signals”) to promote documents to the top of their listings. Things like:

  • How recent is the opinion?
  • How many other opinions have cited it?
  • How many journals have cited it?
  • How long is it?
  • How important is the court that heard the case?
  • Is the case in the jurisdiction of the user?
  • Is the opinion one that the user has looked at before?
  • What was the subsequent treatment of the opinion?

And so forth. All of the above help to make search results better, and we’ve seen paid legal search tools make great strides in their products by integrating these and other factors. At CourtListener, we’re using a number of the above, but we need to go further. We need to use as many factors as possible, we need to learn how the factors interact with each other, which ones are the most important, and which lead to the best results.

A different problem we’re working to solve at CourtListener is getting primary legal materials freely onto the Web. What good is a search engine if the opinion you need isn’t there in the first place? We currently have about 800,000 federal opinions, including West’s second and third Federal Reporters, F.2d and F.3d, and the entire Supreme Court corpus. This is good and we’re very proud of the quality of our database–we think it’s the best free resource there is. Every day we add the opinions from the Circuit Courts in the federal system and the U.S. Supreme Court, nearly in real-time. But we need to go further: we need to add state opinions, and we need to add not just the latest opinions but all the historical ones as well.

This sounds daunting, but it’s a problem that we hope will be solved in the next few years. Although it’s taking longer than we would like, in time we are confident that all of the important historical legal data will make its way to the open Internet. Primary legal sources are already in the public domain, so now it’s just a matter of getting it into good electronic formats so that anyone can access it and anyone can re-use it. If an opinion only exists as unsearchable scanned versions, in bound books, or behind a pricey pay wall, then it’s closed to many people that should have access to it. As part of our citation identification project, which I’ll talk about next, we’re working to get the most important documents properly digitized.

Our citation identification project was developed last year by U.C. Berkeley School of Information students Rowyn McDonald and Karen Rustad to identify and cross-link any citations found in our database. This is a great feature that makes all the citations in our corpus link to the correct opinions, if we have them. For example, if you’re reading an opinion that has a reference to Roe v. Wade, you can click on the citation, and you’ll be off and reading Roe v. Wade. By the way, if you’re wondering how many Federal Appeals opinions cite Roe v. Wade, the number in our system is 801 opinions (and counting). If you’re wondering what the most-cited opinion in our system is, you may be bemused: With about 10,000 citations, it’s an opinion about ineffective assistance of legal counsel in death penalty cases, Strickland v. Washington, 466 U.S. 668 (1984).

A feature we’ll be working on soon will tie into our citation system to help us close any gaps in our corpus. Once the feature is done, whenever an opinion is cited that we don’t yet have, our users will be able to pay a small amount–one or two dollars–to sponsor the digitization of that opinion. We’ll do the work of digitizing it, and after that point the opinion will be available to the public for free.

This brings us to the next big feature we added last year: bulk data. Because we want to assist academic VOX.pile.of.paperresearchers and others who might have a use for a large database of court opinions, we provide free bulk downloads of everything we have. Like Carl Malamud’s Resource.org, (to whom we owe a great debt for his efforts to collect opinions and provide them to others for free and for his direct support of our efforts) we have giant files you can download that provide thousands of opinions in computer-readable format. These downloads are available by court and date, and include thousands of fixes to the Resource.org corpus. They also include something you can’t find anywhere else: the citation network. As part of the metadata associated with each opinion in our bulk download files, you can look and see which opinions it cites as well as which opinions cite it. This provides a valuable new source of data that we are very eager for others to work with. Of course, as new opinions are added to our system, we update our downloads with the new citations and the new information.

Finally, we would be remiss if we didn’t mention our hallmark feature: daily, weekly and monthly email alerts. For any query you put into CourtListener, you can request that we email you whenever there are new results. This feature was the first one we created, and one that we continue to be excited about. This year we haven’t made any big innovations to our email alerts system, but its popularity has continued to grow, with more than 500 alerts run each day. Next year, we hope to add a couple small enhancements to this feature so it’s smoother and easier to use.

The future

I’ve hinted at a lot of our upcoming work in the sections above, but what are the big-picture features that we think we need to achieve our goals?

We do all of our planning in the open, but we have a few things cooking in the background that we hope to eventually build. Among them are ideas for adding oral argument audio, case briefs, and data from PACER. Adding these new types of information to CourtListener is a must if we want to be more useful for research purposes, but doing so is a long-term goal, given the complexity of doing them well.

We also plan to build an opinion classifier that could automatically, and without human intervention, determine the subsequent treatment of opinions. Done right, this would allow our users to know at a glance if the opinion they’re reading was subsequently followed, criticized, or overruled, making our system even more valuable to our users.

In the next few years, we’ll continue building out these features, but as an open-source and open-data project, everything we do is in the open. You can see our plans on our feature tracker, our bugs in our bug tracker, and can get in touch in our forum. The next few years look to be very exciting as we continue building our collection and our platform for legal research. Let’s see what the new year brings!

 

lissnerMichael Lissner is the co-founder and lead developer of CourtListener, a project that works to make the law more accessible to all. He graduated from U.C. Berkeley’s School of Information, and when he’s not working on CourtListener he develops search and eDiscovery solutions for law firms. Michael is passionate about bringing greater access to our primary legal materials, about how technology can replace old legal models, and about open source, community-driven approaches to legal research.

 

carverBrian W. Carver is Assistant Professor at the U.C. Berkeley School of Information where he does ressearch on and teaches about intellectual property law and cyberlaw. He is also passionate about the public’s access to the law. In 2009 and 2010 he advised an I School Masters student, Michael Lissner, on the creation of CourtListener.com, an alert service covering the U.S. federal appellate courts. After Michael’s graduation, he and Brian continued working on the site and have grown the database of opinions to include over 750,000 documents. In 2011 and 2012, Brian advised I School Masters students Rowyn McDonald and Karen Rustad on the creation of a legal citator built on the CourtListener database.

 

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed. The information above should not be considered legal advice. If you require legal representation, please consult a lawyer.