skip navigation

[Editor’s Note: We are pleased to publish this piece from Qiang Lu and Jack Conrad, both of whom worked with Thomson Reuters R&D on the WestlawNext research team. Jack Conrad continues to work with Thomson Reuters, though currently on loan to the Catalyst Lab at Thomson Reuters Global Resources in Switzerland. Qiang Lu is now based at Kore Federal in the Washington, D.C. area. We read with interest their 2012 paper from the International Conference on Knowledge Engineering and Ontology Development (KEOD), “Bringing order to legal documents: An issue-based recommendation system via cluster association”, and are grateful that they have agreed to offer some system-specific context for their work in this area. Their current contribution represents a practical description of the advances that have been made between the initial and current versions of Westlaw, and what differentiates a contemporary legal search engine from its predecessors.  -sd]

In her blog on “Pushing the Envelope: Innovation in Legal Search” (2009) [1], Edinburgh Informatics Ph.D. candidate K. Tamsin Maxwell presents her perspective of the state of legal search at the time. The variations of legal information retrieval (IR) that she reviews − everything from natural language search (e.g., vector space models, Bayesian inference net models, and language models) to NLP and term weighting − refer to techniques that are now 10, 15, even 20 years old. She also refers to the release of the first natural language legal search engine by West back in 1993−WIN (Westlaw Is Natural) [2]. Adding to this on-going conversation about legal search, we would like to check back in, a full 20 years after the release of that first natural language legal search engine. The objective we hope to achieve in this posting is to provide a useful overview of state-of-the-art legal search today.

What Maxwell’s article could not have predicted, even five years ago, are some of the chief factors that distinguish state-of-the-art search engines today from their earlier counterparts. One of the most notable distinctions is that unlike their predecessors, contemporary search engines, including today’s state-of-the-art legal search engine, WestlawNext , separate the function of document retrieval from document ranking. Whereas the first retrieval function primarily addresses recall, ensuring that all potentially relevant documents are retrieved, the second and ensuing function focuses on the ideal ranking of those results, addressing precision at the highest ranks. By contrast, search engines of the past effectively treated these two search functions as one and the same. So what is the difference? Whereas the document retrieval piece may not be dramatically different from what it was when WIN was first released in 1993, what is dramatically different lies in the evidence that is considered in the ranking piece, which allows potentially dozens of weighted features to be taken into account and tracked as part of the optimal ranking process.

Figure 1: Views

Figure 1. The set of evidence (views) that can be used by modern legal search engines.

In traditional search, the principal evidence considered was the main text of the document in question. In the case of traditional legal search, those documents would be cases, briefs, statutes, regulations, law reviews and other forms of primary and secondary (a.k.a. analytical) legal publications. This textual set of evidence can be termed the document view of the world. In the case of legal search engines like Westlaw, there also exists the ability to exploit expert-generated annotations or metadata. These annotations come in the form of attorney-editor generated synopses, points of law (a.k.a. headnotes), and attorney-classifier assigned topical classifications that rely on a legal taxonomy such as West’s Key Number System [3]. The set of evidence based on such metadata can be termed the annotation view. Furthermore, in a manner loosely analogous to today’s World Wide Web and the lattice of inter-referencing documents that reside there, today’s legal search can also exploit the multiplicity of both out-bound (cited) sources and in-bound (citing) sources with respect to a document in question, and, frequently, the granularity of these citations is not merely at a document-level but at the sub-document or topic level. Such a set of evidence can be termed the citation network view. More sophisticated engines can examine not only the popularity of a given cited or citing document based on the citation frequency, but also the polarity and scope of the arguments they wager as well.

In addition to the “views” described thus far, a modern search engine can also harness what has come to be known as aggregated user behavior. While individual users and their individual behavior are not considered, in instances where there is sufficient accumulated evidence, the search function can consider document popularity thanks to a user view. That is to say, in addition to a document being returned in a result set for a certain kind of query, the search provider can also tabulate how often a given document was opened for viewing, how often it was printed, or how often it was checked for its legal validity (e.g., through citator services such as KeyCite [4]). (See Figure 1) This form of marshaling and weighting of evidence only scratches the surface, for one can also track evidence between two documents within the same research session, e.g., noting that when one highly relevant document appears in result sets for a given query-type, another document typically appears in the same result sets. In summary, such a user view represents a rich and powerful additional means of leveraging document relevance as indicated through professional user interactions with legal corpora such as those mentioned above.

It is also worth noting that today’s search engines may factor in a user’s preferences, for example, by knowingVOX.LegalResearch what jurisdiction a particular attorney-user practices in, and what kinds of sources that user has historically preferred, over time and across numerous result sets.

While the materials or data relied upon in the document view and citation network view are authored by judges, law clerks, legislators, attorneys and law professors, the summary data present in the annotation view is produced by attorney-editors. By contrast, the aggregated user behavior data represented in the user view is produced by the professional researchers who interact with the retrieval system. The result of this rich and diverse set of views is that the power and effectiveness of a modern legal search engine comes not only from its underlying technology but also from the collective intelligence of all of the domain expertise represented in the generation of its data (documents) and metadata (citations, annotations, popularity and interaction information). Thus, the legal search engine offered by WestlawNext (WLN) represents an optimal blend of advanced artificial intelligence techniques and human expertise [5].

Given this wealth of diverse material representing various forms of relevance information and tractable connections between queries and documents, the ranking function executed by modern legal search engines can be optimized through a series of training rounds that “teach” the machine what forms of evidence make the greatest contribution for certain types of queries and available documents, along with their associated content and metadata. In other words, the re-ranking portion of the machine learns how to weigh the “features” representing this evidence in a manner that will produce the best (i.e., highest precision) ranking of the documents retrieved.

Nevertheless, a search engine is still highly influenced by the user queries it has to process, and for some legal research questions, an independent set of documents grouped by legal issue would be a tremendous complementary resource for the legal researcher, one at least as effective as trying to assemble the set of relevant documents through a sequence of individual queries. For this reason, WLN offers in parallel a complement to search entitled “Related Materials” which in essence is a document recommendation mechanism. These materials are clustered around the primary, secondary and sometimes tertiary legal issues in the case under consideration.

Legal documents are complex and multi-topical in nature. By detecting the top-level legal issues underlying the original document and delivering recommended documents grouped according to these issues, a modern legal search engine can provide a more effective research experience to a user when providing such comprehensive coverage [6,7]. Illustrations of some of the approaches to generating such related material are discussed below.

Take, for example, an attorney who is running a set of queries that seeks to identify a group of relevant documents involving “attractive nuisance” for a party that witnessed a child nearly drowned in a swimming pool. After a number of attempts using several different key terms in her queries, the attorney selects the “Related Materials” option that subsequently provides access to the spectrum of “attractive nuisance”-related documents. Such sets of issue-based documents can represent a mother lode of relevant materials. In this instance, pursuing this navigational path rather than a query-based one turns out to be a good choice. Indeed, the query-based approach could take time and would lead to a gradually evolving set of relevant documents. By contrast, harnessing the cluster of documents produced for “attractive nuisance” may turn out to be the most efficient approach to total recall and the desired degree of relevance.

To further illustrate the benefit of a modern legal search engine, we will conclude our discussion with an instructive search using WestlawNext, and its subsequent exploration by way of this recommendation resource available through “Related Materials.”

The underlying legal issue in this example is “church support for specific candidates”, and a corresponding query is issued in the search box. Figure 2 provides an illustration of the top cases retrieved.


Figure 2: Search result from WestlawNext

Let’s assume that the user decides to closely examine the first case. By clicking the link to the document, the content of the case is rendered, as in Figure 3. Note that on the right-hand side of the panel, the major legal issues of the case “Canyon Ferry Road Baptist Church … v. Unsworth” have been automatically identified and presented with hierarchically structured labels, such as “Freedom of Speech / State Regulation of Campaign Speech” and “Freedom of Speech / View of Federal Election Campaign Act / Definition of Political Committee,” … By presenting these closely related topics, a user is empowered with the ability to dive deep into the relevant cases and other relevant documents without explicitly crafting any additional or refined queries.


Figure 3: A view of a case and complementary materials from WestlawNext

By selecting these sets of relevant topics, a set of recommended cases will be rendered under that particular label. Figure 4, for example, shows the related topic view of the case under the label of “Freedom of Speech / View of Federal Election Campaign Act / Definition of Political Committee.” Note that this process can be repeated based on the particular needs of a user, starting with a document in the original results set.


Figure 4: Related Topic view of a case

In summary, by utilizing the combination of human expert-generated resources and sophisticated machine-learning algorithms, modern legal search engines bring the legal research experience to an unprecedented and powerful new level. For those seeking the next generation in legal search, it’s no longer on the horizon. It’s already here.


[1] K. Tamsin Maxwell, “Pushing the Envelope: Innovation in Legal Search,” in VoxPopuLII, Legal Information Institute, Cornell University Law School, 17 Sept. 2009.
[2] Howard Turtle, “Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance,” In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research & Development in Information Retrieval (SIGIR 1994) (Dublin, Ireland), Springer-Verlag, London, pp. 212-220, 1994.
[3] West’s Key Number System:
[4] West’s KeyCite Citator Service:
[5] Peter Jackson and Khalid Al-Kofahi, “Human Expertise and Artificial Intelligence in Legal Search,” in Structuring of Legal Semantics, A. Geist, C. R. Brunschwig, F. Lachmayer, G. Schefbeck Eds., Festschrift ed. for Erich Schweighofer, Editions Weblaw, Bern, pp. 417-427, 2011.
[6] On Cluster definition and population: Qiang Lu, Jack G. Conrad, Khalid Al-Kofahi, William Keenan, “Legal Document Clustering with Build-in Topic Segmentation,” In Proceedings of the 2011 ACM-CIKM Twentieth International Conference on Information and Knowledge Management (CIKM 2011)(Glasgow, Scotland), ACM Press, pp. 383-392, 2011.
[7] On Cluster association with individual documents: Qiang Lu and Jack G. Conrad, “Bringing order to legal documents: An Issue-based Recommendation System via Cluster Association,” In Proceedings of the 4th International Conference on Knowledge Engineering and Ontology Development  (KEOD 2012) (Barcelona, Spain), SciTePress DL, pp. 76-88, 2012.

Jack G. Conrad currently serves as Lead Research Scientist with the Catalyst Lab at Thomson Reuters Global Resources in Baar, Switzerland. He was formerly a Senior Research Scientist with the Thomson Reuters Corporate Research & Development department. His research areas fall under a broad spectrum of Information Retrieval, Data Mining and NLP topics. Some of these include e-Discovery, document clustering and deduplication for knowledge management systems. Jack has researched and implemented key components for WestlawNext, West‘s next-generation legal search engine, and PeopleMap, a very large scale Public Record aggregation system. Jack completed his graduate studies in Computer Science at the University of Massachusetts–Amherst and in Linguistics at the University of British Columbia–Vancouver.

Qiang Lu was a Senior Research Scientist with Thomson Reuters Corporate Research & Development department. His research interests include data mining, text mining, information retrieval, and machine learning. He has extensive experience of applying various NLP technologies in various data sources, such as news, legal, financial, and law enforcement data. Qiang was a key member of WestlawNext research team. He has a Ph.D. in computer science and engineering from State University of New York at Buffalo. He is now a managing associate at Kore Federal in Washington D.C. area.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

This blog entry focuses on the need for more and better software to reap the benefits of the legal information treasures available. As you’ll see, this turns out to be more complex than one may think.

For commercial software developers, it is surprisingly hard to stay radically innovative, especially when they are successful. To start with, software development itself is a risky undertaking.  Despite five decades of research in managing this development process, projects frequently are late, over budget, and much less impressive than originally envisioned.  IBM once famously bet the company on a new computer platform, but the development of the associated operating system was so much behind schedule that it threatened IBMs’ existence. Management was tempted to throw ever more human resources at the development problem, only to discover that this in itself causes further delays  –  leaving us with the useful term “mythical man-month”.

But the difficulty in envisioning hurdles in a complex software engineering project is not the only source of risk for innovative software developers. Successful developers may pride themselves on a large and increasing user base.  Such success, however  creates its own unintended constraints.

Customers will dislike rapid change in the software they use, as they will have to relearn how to operate it, may have to expend efforts on converting data to new formats, and/or may need to adjust the preferences and customization options they utilized. This gets worse if the successful software is the platform for a thriving ecosystem of other developers and service providers. Any severe change in the underlying platform means that those living in it have to adapt their code. Each time a customer has to invest time in relearning a software product, it offers competing software providers a chance to nab a customer. This prompts software developers, especially very successful ones, to be relatively conservative in their plans for updates and upgrades. They don’t want to undermine their market success, and thus will be tempted to opt for gradual rather than radical innovation when designing the next version of their successful wares.

We have seen it over and over again: Microsoft’s Word, Powerpoint and Excel have gone through numerous iterations over the past decades, but the basic elements of the user experience have changed relatively little. Similarly, concerns for legacy code by third party developers have been a key holdback for Microsoft’s Windows product team. Don’t break something  –  even if it is utterly ancient and inefficient, buggy and broken  –  as long as it works for the customers.  That’s the understandable, but frustrating, mantra.

Or think of Google: the search engines’ user interface hasn’t seen any major changes since its inception more than a decade ago. Only Apple, it seems, has been getting away with radical innovation that breaks things and forces users to relearn, to convert data, and to expend time. That is the advantage of a small but fervently loyal user base. But even Apple has recently seen the need to take a breather in radical change with Snow Leopard.

And in the legal information context, think of Westlaw and Lexis/Nexis.  Despite direct competition with one another,  when was the last time we saw a truly radical innovation coming from either of these two companies?

Radical innovation requires the will to risk alienating users. As companies grow and pay attention to shareholder expectations, the will-to-risk often wanes. With radical innovation in the marketplace, the challenge lies in the time axis. If one is very successful with a radically new product at time T, it is hard to throw that product away, and try to risk radically reinventing it, for T+1.

On a macro level, we combat this conservative tendency against radical change by providing incentives for innovative entrepreneurs to develop and market competing offerings. If enough customers are unhappy with Excel, perhaps entrepreneurs with radically new and improved concepts of how to crunch and manage numbers in a structured way will seize the opportunity and develop a new killer app that they’ll pit against Excel. That’s enormously risky, but also offers the potential of very steep rewards. Angel investors and venture capitalists thrive on providing the lubricant (in the form of financial resources) for such high risk, high reward propositions. They flourish on the improbable. What they don’t like are “small ideas.”  (It happened to me, too, when I pitched innovative ideas to VCs; they thought my ideas had a very high likelihood of success, but not enough of a lever to reap massive returns. Obviously I was dismayed, but they were right: it is what we need if we want to incentivize radical innovation.)

This also implies, however, that for venture capital to work, markets need be large enough to offer high rewards for risky ventures. If the market is not large enough, venture capital may not be available for a sufficient number of radical innovators to keep pushing the limit. Therefore, existing providers may survive for a long time with incremental innovations. Perhaps that is why Westlaw and Lexis are still around, even though they could fight the tendency toward piecemeal development if they wanted to.

skunkOther large corporations, realizing the bias towards incremental innovation, have repeatedly resorted to radical steps to remedy the problem. They have established skunk works, departments  that are largely disconnected from the rest of the company, freeing the members to try revolutionary rather than evolutionary solutions. Sometimes companies acquire a group of radically innovative engineers from the outside, to inject some fresh thinking into internal development processes that may have become too stale.

Peer production models, almost always based on an open source foundation, are not dependent on market success. (On the drivers of peer production see Yochai Benkler’s  “The Wealth of Networks”). They are not profit driven, and thus may put less pressure on the developers to abstain from radical change. Because Firefox does not have to win in the marketplace, its developers can, at least in theory, be bolder than their commercial counterparts.

Unfortunately, open-source peer produced software may also lose its appetite for radical innovation over time  –  not because of monetary incentives, but because of the collaborative structures utilized in the design process. If a large number of volunteering bug reporters, testers, and coders with vastly differing values and preferences work on a joint project, it is likely that development will revert towards a common denominator of what needs to be done, and thus be inherently gradual and evolutionary, rather than radical. Of course, a majority of participants may at rare moments get together and agree on a revolution – much like those in what then was a British colony in 1776.  But that is the brilliant exception to a rather boring rule.

Indecisiveness that stems from too small a common ground, however, is not the only danger. On the other end of the spectrum, communities and groups with too many ties among each other cause a mental alignment, or “group think,” that equally stifles radical innovation. Northwestern University professor Brian Uzzi has written eloquently about this problem. Finding the right sweet spot between the two extremes is what’s necessary, but in the absence of an outside mechanism that balance is difficult to achieve for open source peer-producing

If we would like to remedy this situation, how could we offer incentives to peer producing communities to more often give radical rather than incremental innovation a try? What could be the mechanism that takes on the role of venture capitalists and skunk works in the peer production context?

It surely isn’t telling dissenters with a radically new idea to “fork out” of a project. That’s like asking a commercial design group to leave the company and try on their own, but without providing them with enough resources or incentives. Not a good idea if we want to make radical innovation – the experimentation with revolutionary rather than incremental ideas – easier, not harder.

But what is the venture capital/skunk works equivalent in the peer-producing world?

A few thoughts come to mind, but I invite you to add your ideas, because I may not be thinking radically enough.

(1) User: Users, from large to small, could volunteer,  perhaps through a website, to dedicate some modicum of their time to advancing an open source project not by contributing to its design, but by committing to being first adopters of more radical design solutions. One may imagine a website that helps link users (including law firms) willing to dedicate some “risk” to such riskier open source peer produced projects, perhaps on a sectoral basis (Could this be yet another mission for the LII?).

(2) Designers: Quite a number of corporations and organizations explicitly support open source peer producing projects, mostly by dedicating some of their human resources to improving the code base. These organizations could, if they wanted to improve the capability of such projects to push for more radical innovation, set up incentives for employees to select riskier projects.

(3) Tools: The very tools used to organize peer production of software code already offer many techniques for managing a diverse array of contributors. These tools could be altered to evaluate the a group’s level of diversity and willingness to take risks, based on the findings of social network theory. Such an approach would at least provide the community with a sense of its potential and propensity for radical innovation, and could help group organizers in influencing group composition and group dynamics.  (Yes, this is “” and the government IT dashboards applied to this context.)

These are nothing more than a few ideas.  Many more are necessary to identify the best ones to implement. But given the rise and importance of peer production, and the constraints inherent in how it is organizing itself, the conversation about how to best provide incentives for radical innovation in the legal information context – and beyond – is one we must have.

[NB:  What do you all think?  How does this apply to the world of legal information, and to specialized software applications that support it — things like point-in-time legislative systems, specialized processing tools, and so on?  Comments please…. (the ed.)]

ViktorViktor Mayer-Schönberger is Associate Professor of Public Policy and Director of the Information + Innovation Policy Research Centre at the LKY School of Public Policy / National University of Singapore. He is also a faculty affiliate of the Belfer Center of Science and International Affairs at Harvard University. He has published many books, most recently “Delete – The Virtue of Forgetting in the Digital Age.”He is a frequent public speaker, and sought expert for print and broadcast media worldwide. He is also on the boards of numerous foundations, think tanks and organizations focused on studying the foundations of the new economy, and advises governments, businesses and NGOs on new economy and information society issues.  In his spare time, he likes to travel, go to the movies, and learn about architecture.

VoxPopuLII is edited by Judith Pratt.