skip navigation
search

[Editor's Note: We are pleased to publish this piece from Qiang Lu and Jack Conrad, both of whom worked with Thomson Reuters R&D on the WestlawNext research team. Jack Conrad continues to work with Thomson Reuters, though currently on loan to the Catalyst Lab at Thomson Reuters Global Resources in Switzerland. Qiang Lu is now based at Kore Federal in the Washington, D.C. area. We read with interest their 2012 paper from the International Conference on Knowledge Engineering and Ontology Development (KEOD), “Bringing order to legal documents: An issue-based recommendation system via cluster association”, and are grateful that they have agreed to offer some system-specific context for their work in this area. Their current contribution represents a practical description of the advances that have been made between the initial and current versions of Westlaw, and what differentiates a contemporary legal search engine from its predecessors.  -sd]

In her blog on “Pushing the Envelope: Innovation in Legal Search” (2009) [1], Edinburgh Informatics Ph.D. candidate K. Tamsin Maxwell presents her perspective of the state of legal search at the time. The variations of legal information retrieval (IR) that she reviews − everything from natural language search (e.g., vector space models, Bayesian inference net models, and language models) to NLP and term weighting − refer to techniques that are now 10, 15, even 20 years old. She also refers to the release of the first natural language legal search engine by West back in 1993−WIN (Westlaw Is Natural) [2]. Adding to this on-going conversation about legal search, we would like to check back in, a full 20 years after the release of that first natural language legal search engine. The objective we hope to achieve in this posting is to provide a useful overview of state-of-the-art legal search today.

What Maxwell’s article could not have predicted, even five years ago, are some of the chief factors that distinguish state-of-the-art search engines today from their earlier counterparts. One of the most notable distinctions is that unlike their predecessors, contemporary search engines, including today’s state-of-the-art legal search engine, WestlawNext , separate the function of document retrieval from document ranking. Whereas the first retrieval function primarily addresses recall, ensuring that all potentially relevant documents are retrieved, the second and ensuing function focuses on the ideal ranking of those results, addressing precision at the highest ranks. By contrast, search engines of the past effectively treated these two search functions as one and the same. So what is the difference? Whereas the document retrieval piece may not be dramatically different from what it was when WIN was first released in 1993, what is dramatically different lies in the evidence that is considered in the ranking piece, which allows potentially dozens of weighted features to be taken into account and tracked as part of the optimal ranking process.

Figure 1: Views

Figure 1. The set of evidence (views) that can be used by modern legal search engines.

In traditional search, the principal evidence considered was the main text of the document in question. In the case of traditional legal search, those documents would be cases, briefs, statutes, regulations, law reviews and other forms of primary and secondary (a.k.a. analytical) legal publications. This textual set of evidence can be termed the document view of the world. In the case of legal search engines like Westlaw, there also exists the ability to exploit expert-generated annotations or metadata. These annotations come in the form of attorney-editor generated synopses, points of law (a.k.a. headnotes), and attorney-classifier assigned topical classifications that rely on a legal taxonomy such as West’s Key Number System [3]. The set of evidence based on such metadata can be termed the annotation view. Furthermore, in a manner loosely analogous to today’s World Wide Web and the lattice of inter-referencing documents that reside there, today’s legal search can also exploit the multiplicity of both out-bound (cited) sources and in-bound (citing) sources with respect to a document in question, and, frequently, the granularity of these citations is not merely at a document-level but at the sub-document or topic level. Such a set of evidence can be termed the citation network view. More sophisticated engines can examine not only the popularity of a given cited or citing document based on the citation frequency, but also the polarity and scope of the arguments they wager as well.

In addition to the “views” described thus far, a modern search engine can also harness what has come to be known as aggregated user behavior. While individual users and their individual behavior are not considered, in instances where there is sufficient accumulated evidence, the search function can consider document popularity thanks to a user view. That is to say, in addition to a document being returned in a result set for a certain kind of query, the search provider can also tabulate how often a given document was opened for viewing, how often it was printed, or how often it was checked for its legal validity (e.g., through citator services such as KeyCite [4]). (See Figure 1) This form of marshaling and weighting of evidence only scratches the surface, for one can also track evidence between two documents within the same research session, e.g., noting that when one highly relevant document appears in result sets for a given query-type, another document typically appears in the same result sets. In summary, such a user view represents a rich and powerful additional means of leveraging document relevance as indicated through professional user interactions with legal corpora such as those mentioned above.

It is also worth noting that today’s search engines may factor in a user’s preferences, for example, by knowingVOX.LegalResearch what jurisdiction a particular attorney-user practices in, and what kinds of sources that user has historically preferred, over time and across numerous result sets.

While the materials or data relied upon in the document view and citation network view are authored by judges, law clerks, legislators, attorneys and law professors, the summary data present in the annotation view is produced by attorney-editors. By contrast, the aggregated user behavior data represented in the user view is produced by the professional researchers who interact with the retrieval system. The result of this rich and diverse set of views is that the power and effectiveness of a modern legal search engine comes not only from its underlying technology but also from the collective intelligence of all of the domain expertise represented in the generation of its data (documents) and metadata (citations, annotations, popularity and interaction information). Thus, the legal search engine offered by WestlawNext (WLN) represents an optimal blend of advanced artificial intelligence techniques and human expertise [5].

Given this wealth of diverse material representing various forms of relevance information and tractable connections between queries and documents, the ranking function executed by modern legal search engines can be optimized through a series of training rounds that “teach” the machine what forms of evidence make the greatest contribution for certain types of queries and available documents, along with their associated content and metadata. In other words, the re-ranking portion of the machine learns how to weigh the “features” representing this evidence in a manner that will produce the best (i.e., highest precision) ranking of the documents retrieved.

Nevertheless, a search engine is still highly influenced by the user queries it has to process, and for some legal research questions, an independent set of documents grouped by legal issue would be a tremendous complementary resource for the legal researcher, one at least as effective as trying to assemble the set of relevant documents through a sequence of individual queries. For this reason, WLN offers in parallel a complement to search entitled “Related Materials” which in essence is a document recommendation mechanism. These materials are clustered around the primary, secondary and sometimes tertiary legal issues in the case under consideration.

Legal documents are complex and multi-topical in nature. By detecting the top-level legal issues underlying the original document and delivering recommended documents grouped according to these issues, a modern legal search engine can provide a more effective research experience to a user when providing such comprehensive coverage [6,7]. Illustrations of some of the approaches to generating such related material are discussed below.

Take, for example, an attorney who is running a set of queries that seeks to identify a group of relevant documents involving “attractive nuisance” for a party that witnessed a child nearly drowned in a swimming pool. After a number of attempts using several different key terms in her queries, the attorney selects the “Related Materials” option that subsequently provides access to the spectrum of "attractive nuisance"-related documents. Such sets of issue-based documents can represent a mother lode of relevant materials. In this instance, pursuing this navigational path rather than a query-based one turns out to be a good choice. Indeed, the query-based approach could take time and would lead to a gradually evolving set of relevant documents. By contrast, harnessing the cluster of documents produced for "attractive nuisance" may turn out to be the most efficient approach to total recall and the desired degree of relevance.

To further illustrate the benefit of a modern legal search engine, we will conclude our discussion with an instructive search using WestlawNext, and its subsequent exploration by way of this recommendation resource available through “Related Materials.”

The underlying legal issue in this example is “church support for specific candidates”, and a corresponding query is issued in the search box. Figure 2 provides an illustration of the top cases retrieved.

image-2

Figure 2: Search result from WestlawNext

Let’s assume that the user decides to closely examine the first case. By clicking the link to the document, the content of the case is rendered, as in Figure 3. Note that on the right-hand side of the panel, the major legal issues of the case “Canyon Ferry Road Baptist Church … v. Unsworth” have been automatically identified and presented with hierarchically structured labels, such as “Freedom of Speech / State Regulation of Campaign Speech” and “Freedom of Speech / View of Federal Election Campaign Act / Definition of Political Committee,” … By presenting these closely related topics, a user is empowered with the ability to dive deep into the relevant cases and other relevant documents without explicitly crafting any additional or refined queries.

image-3

Figure 3: A view of a case and complementary materials from WestlawNext

By selecting these sets of relevant topics, a set of recommended cases will be rendered under that particular label. Figure 4, for example, shows the related topic view of the case under the label of “Freedom of Speech / View of Federal Election Campaign Act / Definition of Political Committee.” Note that this process can be repeated based on the particular needs of a user, starting with a document in the original results set.

image-4

Figure 4: Related Topic view of a case

In summary, by utilizing the combination of human expert-generated resources and sophisticated machine-learning algorithms, modern legal search engines bring the legal research experience to an unprecedented and powerful new level. For those seeking the next generation in legal search, it’s no longer on the horizon. It’s already here.

References

[1] K. Tamsin Maxwell, “Pushing the Envelope: Innovation in Legal Search,” in VoxPopuLII, Legal Information Institute, Cornell University Law School, 17 Sept. 2009. http://blog.law.cornell.edu/voxpop/2009/09/17/pushing-the-envelope-innovation-in-legal-search/
[2] Howard Turtle, “Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance," In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research & Development in Information Retrieval (SIGIR 1994) (Dublin, Ireland), Springer-Verlag, London, pp. 212-220, 1994.
[3] West's Key Number System: http://info.legalsolutions.thomsonreuters.com/pdf/wln2/L-374484.pdf
[4] West's KeyCite Citator Service: http://info.legalsolutions.thomsonreuters.com/pdf/wln2/L-356347.pdf
[5] Peter Jackson and Khalid Al-Kofahi, “Human Expertise and Artificial Intelligence in Legal Search,” in Structuring of Legal Semantics, A. Geist, C. R. Brunschwig, F. Lachmayer, G. Schefbeck Eds., Festschrift ed. for Erich Schweighofer, Editions Weblaw, Bern, pp. 417-427, 2011.
[6] On Cluster definition and population: Qiang Lu, Jack G. Conrad, Khalid Al-Kofahi, William Keenan, "Legal Document Clustering with Build-in Topic Segmentation," In Proceedings of the 2011 ACM-CIKM Twentieth International Conference on Information and Knowledge Management (CIKM 2011)(Glasgow, Scotland), ACM Press, pp. 383-392, 2011.
[7] On Cluster association with individual documents: Qiang Lu and Jack G. Conrad, “Bringing order to legal documents: An Issue-based Recommendation System via Cluster Association,” In Proceedings of the 4th International Conference on Knowledge Engineering and Ontology Development  (KEOD 2012) (Barcelona, Spain), SciTePress DL, pp. 76-88, 2012.

Jack G. Conrad currently serves as Lead Research Scientist with the Catalyst Lab at Thomson Reuters Global Resources in Baar, Switzerland. He was formerly a Senior Research Scientist with the Thomson Reuters Corporate Research & Development department. His research areas fall under a broad spectrum of Information Retrieval, Data Mining and NLP topics. Some of these include e-Discovery, document clustering and deduplication for knowledge management systems. Jack has researched and implemented key components for WestlawNext, West‘s next-generation legal search engine, and PeopleMap, a very large scale Public Record aggregation system. Jack completed his graduate studies in Computer Science at the University of Massachusetts–Amherst and in Linguistics at the University of British Columbia–Vancouver.

Qiang Lu was a Senior Research Scientist with Thomson Reuters Corporate Research & Development department. His research interests include data mining, text mining, information retrieval, and machine learning. He has extensive experience of applying various NLP technologies in various data sources, such as news, legal, financial, and law enforcement data. Qiang was a key member of WestlawNext research team. He has a Ph.D. in computer science and engineering from State University of New York at Buffalo. He is now a managing associate at Kore Federal in Washington D.C. area.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

Kangaroo BoxingIt's been a rocky year for West's relationship with law librarians.

First, the company declined to participate in this year's American Association of Law Libraries Price Index for Legal Publications. This led AALL to return West's sponsorship check for the 2009 AALL Annual Meeting. For attendees, this decision was somewhat academic, as West still occupied a large space in the Exhibitor Hall and once again hosted a well-attended Customer Appreciation Party.

Shortly after the conference, West issued an email promotion to customers that asked:

Are you on a first name basis with the librarian? If so, chances are, you're spending too much time at the library. What you need is fast, reliable research you can access right in your office.

Many law librarians felt publicly insulted by West, expressing their outrage on listservs, blogs, Twitter, Facebook and anywhere legal information professionals could be found that week.

Most recently, West released a video of University of California, Berkeley professor and law librarian Bob Berring explaining the advantages of "free market" premium legal databases over free legal information websites run by "volunteers:"

It's not like legal information is going to the Safeway or to buy food. You're not buying a packaged thing. If you say I need to find statutes about this, or what's the administrative regulations on that, or have the courts spoken about this, you have to go find it. And just saying it's all out there -- I mean, the ocean is all out there, but you need a map, and you need a compass, and... you need a GPS system now. You need someone to tell you how to get there. That's why librarians are even more important now, because they've got the GPS system. But you have to be working with organized information. The value added by folks like West, where the information is edited as it goes in, and it's classified, and the hooks are put in -- easy hooks for the people who I think are sloppy researchers just playing around on the tops, really sophisticated hooks for the people who take the time to learn how to really use the system and understand it. You just can't say enough about those kind of things, because to say to the average person, "Well, it's all out there, the law is all out there," well, it's a big bunch of goo.

Adding value to the goo

Unfortunately, the West/Lexis duopoly doesn't provide consumers with the expected advantages of a free market economy. Neither vendor uses price as a marketing strategy, and both negotiate electronic database contracts with customers rather than charge a flat rate. Considering that West has increased its own annual profit margin to 30% or higher in recent years, while raising the cost of supplements at a rate far exceeding inflation, prices are hardly being driven by free market trends, making a price war seem unlikely. (This doesn't mean consumers aren't hopping mad about the price of legal information. They are.)

Instead, at least in the database market, both companies rely on content and features to market their products. Each July at the AALL Annual Meeting, both Lexis and Westlaw use their exhibitor space to educate attendees about whatever new databases and customer conveniences will be rolled out in the coming months.

Thomas Edison and carI often compare these annual feature introductions to the evolution of automobile engines, thanks to a childhood spent watching my father work on the family cars. At first Dad knew every nook and cranny of our vehicles, and there was little he couldn't repair himself over the course of a few nights. As we traded in cars for newer models, his job became more difficult as engines became more complex. None of the automakers seemed to consider ease of access when adding new parts to an automobile engine. They were simply slapped on top of the existing ones, making it harder to perform simple tasks, like replacing belts or spark plugs.

Lexis and Westlaw also add new components on top of the old ones. To generalize, Lexis tends to add new features in the form of tabs (think "Total Litigator") while Westlaw adds them in sidebars (think "Results Plus"), to the point where once clean interfaces are now littered with disparate elements sharing adjacent screen real estate.

Finding fault with filters

In a talk at last year's Web 2.0 Expo in New York, author Clay Shirky stated that the fundamental information problem is not "information overload," but "filter failure." Shirky summarized this position in a recent interview with Yale Law School's Jason Eiseman:

As I've often said, there's no such thing as information overload. It's filter failure, right? From the minute we had more books to read than the average literate person could read in a lifetime, which depending on the region you're talking about happened someplace between the 16th and 19th century, from that moment on we've always had information overload. That's the modern condition. What's happening, I think, to our sense that we're suffering acutely from information overload now is that the old professional filters have broken. They're simply not adequate to contain a world in which anyone can put material out in the public.

Whether or not you agree with Shirky's assessment, it provides an interesting framework with which to view the Lexis/Westlaw information problem. If the primary legal information within these systems are "a big bunch of goo," then secondary resources, headnotes, subject-specific organization, and other finding aids are the filters necessary to cope with information overload.

For West's "Are you on a first name basis with the librarian?" promotion to work, Westlaw has to provide the "fast, reliable research you can access right in your office" that it advertises. Assuming for purposes of this essay that the presence of relevant content isn't an issue (an assumption with which many will quibble), this means the system's filters need to provide reliable information quickly.

There's no question that both West and Lexis provide an abundance of subject-specific organization, particularly for case law. Headnotes, topics, digests, tables of authority, citators and cross-references to secondary resources all go above and beyond what researchers find in most freely available resources. But these add-ons, or filters, are only effective if presented in a usable manner.

Bridge CollapseFor an assignment in one of my legal research classes this semester, I provided a fact pattern and asked students to perform a Natural Language search in Westlaw of American Law Reports to find a relevant annotation. In a class of only 19 students, six of them answered with citations to resources other than ALR, including articles from American Jurisprudence, Am.Jur. Proof of Facts, and Shepards' Causes of Action. The problem, it turned out, wasn't that they had searched the wrong database. Every one of them searched ALR correctly, but those six students mistook Westlaw's Results Plus, placed at the top of a sidebar on the results page, for their actual search results. Filter failure, indeed.

On another assignment, students were expected to find a particular statutory code section using a secondary resource, view the code section, then navigate to the code's table of contents to browse related sections codified nearby. This proved nearly impossible for most of them, as the code section they accessed loaded in a pop-up window with no sidebar, thus providing no visible link to the table of contents. The problems didn't stop there. Even once I told them to click the "Maximize" button at the bottom of the pop-up window, which reloads the code section into the main window with a sidebar, upon clicking the TOC link, anyone using Firefox for Windows loaded a blank page. (To resolve this error, you have to right-click on the frame where the TOC should've loaded and select "This Frame -> Reload This Frame.")

While completing another portion of the statutory code assignment in Lexis, nearly half the students in the class became confused because numerous clickable links throughout the system display as plain black text which only appear as links when the user hovers over them. Also, within statutory code sections, the navigation links provided within the case annotation index routinely loaded an error page rather than navigating to the proper section further down the page.

This doesn't even address basic usability issues such as broken back button functionality, heavy usage of frames, lack of permanent document URLs (Lexis and Westlaw each have external workarounds for this), and reliance on pop-up windows (something blocked by default on most browsers). In addition, Lexis still doesn't support users accessing the system with Firefox for Mac.

The wide availability of secondary resources, annotated codes, and numerous other value-added content provides a clear advantage for Lexis and Westlaw over free and mid-level legal information services, and that's why everyone continues to pay their steep prices. But so long as the systems themselves don't provide usable access, each still suffers from filter failure.

Is there an incentive to improve?

VAB Under ConstructionThere is evidence that the companies have the expertise to provide a better user experience. West has two electronic versions (one for desktop computers and one for the iPhone) of Black's Law Dictionary available that offer more intuitive functionality than what's provided for the same text in Westlaw. Don't expect a price break, however. The desktop version of Black's has a list price of $99, while the iPhone version costs $49.99. By comparison, the print version of Black's Standard Ninth Edition, which likely has substantially higher production costs than the electronic equivalents, carries a list price of $75, meaning iPhone users receive a slightly lower price while desktop users pay even more. Worse still, both electronic versions as well as the content in Westlaw contain the text of the outdated 8th Edition.

Lexis also has an iPhone app, and it's a free download that requires an existing Lexis password to function. Substantially simplified from its traditional web interface, the user experience is clean and easy to understand. Yet while one can retrieve both primary and secondary documents, as well as Shepardize documents, none of the documents in this interface contain links, only plain citations that must be copied and pasted into the search form to be retrieved.

Of course, the bigger problem with these progressive moves is that they don't address any of the existing problems in the web interfaces for either product. No one is redesigning the engine, so to speak. These are simply variations of the now traditional roll-out of new features and functionality on top of existing ones that still have the same significant issues.

This is the problem with a duopoly. There aren't enough producers in the economy to assert significant pressure on either to improve usability. Consumer power is also limited because multi-year contracts prevent easy product substitution, and there's only one true product substitute available. The producers dictate the competition, and thus far they have dictated a content competition ("The Tabs and Sidebars War"), rather than a usability one -- or even a price one.

There are events on the horizon that could impact this stalemate. Bloomberg continues to develop its own legal research product, allegedly designed to be a Westlaw/Lexis competitor. Perhaps this third producer will see value in using price or usability to gain market share. Lewis & Clark law student (and VoxPopuLII author) Robb Shecter recently introduced OregonLaws.org, a free repository of Oregon law that currently features the entire Oregon Revised Statutes and a legal glossary. The site's simple, logical navigation reflects current web usability norms more accurately than either Lexis or Westlaw, and for a "micro-fee" users can bookmark code sections for quick access and save unlimited "human readable" research trails. And, of course, Google Scholar just added "Legal opinions and journals." It's far too early to know if it will become a true player in legal information, but Google always has the potential to be a game changer with anything it does.

What can legal research instructors DO?

Despite the presence of these interesting new projects, consumers can't expect a quick usability turnaround from Lexis and Westlaw, nor the sudden presence of a competitor with the same depth and breadth of content. History doesn't support such an expectation, leaving legal research instructors in a precarious position.

Many schools leave Lexis/Westlaw training solely in the hands of the companies' representatives. While a company rep will be knowledgeable about the system, he will also paint the product in the best possible light for the company, glossing over usability issues and emphasizing new features. After all, law students are future customers, so this instruction is part of a long-term sales pitch.

In order to provide a balanced picture of these systems, legal research instructors need to provide their own Lexis and Westlaw training. This can either be in place of or in addition to what's provided by company reps, but students need to hear the voice of an experienced researcher who doesn't rely on either company for a paycheck. Some may see this as an implied institutional endorsement of the high-priced systems, but the reality is most students will end up working with one or both of these systems on a daily basis after graduation. Ignoring this would be an educational disservice. Any sense of endorsement can be addressed through thorough coverage of the usability limitations and a short education on the price realities. Instructors can also discuss the availability of lower priced databases for lawyers who simply want access to primary legal materials.

If the market is going to change, it won't be because Lexis and Westlaw spontaneously decide to improve products that generate significant profits already. Until then, legal researchers need to be better educated on the limitations of these systems so that their work product isn't compromised by over-reliance on a duopoly disguised as a free market.

Tom BooneTom Boone is a reference librarian and adjunct professor at Loyola Law School in Los Angeles. He's also webmaster and a contributing editor for Henderson Valley Eggs, a "themed information collective" website covering law library issues.

VoxPopuLII is edited by Judith Pratt

A new style of legal research

An attorney/author in Baltimore is writing an article about state bans of teachers' religious clothing. She finds one of the tersely written statutes online. The website then does a query of its own and tells her about a useful statute she wasn't aware of---one setting out the permitted disciplinary actions. When she views it, the site makes the connection clear by showing her the where the second statute references the original. This new information makes her article's thesis stronger.Recipe card

Meanwhile, 2800 miles away in Oregon, a law student is researching the relationship between the civil and criminal state codes. Browsing a research site, he notices a pattern of civil laws making use of the criminal code, often to enact civil punishments or enable adverse actions. He then engages the website in an interactive text-based dialog, modifying his queries as he considers the previous results. He finally arrives at an interesting discovery: the offenses with the least additional civil burdens are white collar crimes.

A new kind of research system

A new field of computer-assisted legal research is emerging: one that encompasses research in both the academic and the practical “legal research” senses. The two scenarios above both took place earlier this year, enabled by the OregonLaws.org research system that I created and which typifies these new developments.

Interestingly, this kind of work is very recent; it's distinct from previous uses of computers for researching the law and assisting with legal work. In the past, techniques drawn from computer science have been most often applied to areas such as document management, court administration, and inter-jurisdiction communication. Working to improve administrative systems’ efficiency, people have approached these problem domains through the development of common document formats and methods of data interchange.

The new trend, in contrast, looks in the opposite direction: divergently tackling new problems as opposed to convergently working towards a few focused goals. This organic type of development is occurring because programming and computer science research is vastly cheaper—and much more fun—than it has ever been in the past. Here are a couple of examples of this new trend:

“Computer Programming and the Law”

Law professor Paul Ohm recently wrote a proposal for a new “interdisciplinary research agenda” which he calls “Computer Programming and the Law.” (The law review article is itself also a functioning computer program, written in the literate programming style.) He envisions “researcher-programmers,” enabled by the steadily declining cost of computer-aided research, using computers in revolutionary ways for empirical legal scholarship. He illustrates four new methods for this kind of research: developing computer programs to “gather, create, visualize, and mine data” that can be found in diverse and far-flung sources.

“Computational Legal Studies”

Grad students Daniel Katz and Michael Bommarito (researcher-programmers, as Paul Ohm would call them) created the Computational Legal Studies Blog in March, 2009. The web site is a growing collection of visualization applied to diverse legal and policy issues. The site is part showcase for the authors’ own work and part catalog of the current work of others.

OregonLaws.org

I started the OregonLaws.org project because I wanted faster and and easier access to the 2007 Oregon Revised Statutes (ORS) and other primary and secondary sources. I had a couple of very statute-heavy courses (Wills & Trusts, and Criminal Law) and I frequently needed to quickly find an ORS section. But as I got further into the development, I realized that it could become a platform for experimenting with computational analysis of legal information, similar to the work being done on the Computational Legal Studies Blog.

I developed the system using pretty much the the steps that Paul Ohm discussed:

  1. Gathering data: I downloaded and cleaned up the ORS source documents, converting them from MS Word/HTML to plain text;
  2. Creating: I parsed the texts, creating a database model reflecting the taxonomy of the ORS: Volumes, Titles, Chapters, etc.;
  3. Creating: I created higher-level database entities based on insights into the documents. For example, by modeling textual references between sections explicitly as reference objects which capture a relationship between a referrer and a referent, and;
  4. Mining and Visualizing: Finally, I've begun making web-based views of these newly found objects and relationships.Object Model

The object database is the key to intelligent research

By taking the time to go through the steps listed above, powerful new features can be created. Below are some additions to the features described in the introductory scenarios:

We can search smarter. In a previous VoxPopulii post, Julie Jones advocates dropping our usual search methods, and applying techniques like subject-based indexing (a la Factiva's) to legal content.

This is straightforward to implement with an object model. The Oregon Legislature created the ORS with a conceptual structure similar to most states:  The actual content is found in Sections.  These are grouped into Chapters, which are in turn grouped into Titles.  I was impressed by the organization and the architecture that I was discovering---insights that are obscured by the ways statutes are traditionally presented.

search-filter.png

And so I sought out ways to make use of the legislature's efforts whenever it made sense.  In the case of search results, the Title organization and naming were extremely useful.  Each Section returned by the search engine "knows" what Chapter and Title it belongs to. A small piece of code can then calculate what Titles are represented in the results, and how frequently. The resulting bar graph doubles as an easy way for users to specify filtering by "subject area". The screenshot above shows a search for forest.

The ORS's framework of Volumes, Titles, and Chapters was essentially a tag cloud waiting to be discovered.

We can get better authentication. In another VoxPopulii post, John Joergensen discussed the need for authentication of digital resources. One aspect of this is showing the user the chain of custody from the original source to the current presentation. His ideas about using digital signatures are excellent: a scenario of being able to verify an electronic document's legitimacy with complete assurance.

glossary-citations.png

We can get a good start towards this goal by explicitly modeling content sources. A source is given attributes for everything we'd want to know to create a citation; date last accessed, URL available at, etc.  Every content object in the database is linked to one of these source objects.  Now, every time we display a document, we can create properly formatted citations to the original sources.

The gather/create/mine/visualize and object-based approaches open up so many new possibilities, they can't all be discussed in one short article. It sometimes seems that each new step taken enables previously unforeseen features. A few these others are new documents created by re-sorting and aggregating content, web service APIs, and extra annotations that enhance clarity. I believe that in the end, the biggest accomplishment of projects like this will be to raise our expectations for electronic legal research services, increase their quality, and lower their cost.

Robb Shecter is a software engineer and third year law student at Lewis & Clark Law School in  Portland, Oregon.   He is Managing Editor for the Animal Law Review, plays jazz bass, and has published articles in Linux Journal, Dr. Dobbs Journal, and Java Report.

VoxPopuLII is edited by Judith Pratt.