skip navigation

[Editor's Note: We are pleased to publish this piece from Qiang Lu and Jack Conrad, both of whom worked with Thomson Reuters R&D on the WestlawNext research team. Jack Conrad continues to work with Thomson Reuters, though currently on loan to the Catalyst Lab at Thomson Reuters Global Resources in Switzerland. Qiang Lu is now based at Kore Federal in the Washington, D.C. area. We read with interest their 2012 paper from the International Conference on Knowledge Engineering and Ontology Development (KEOD), “Bringing order to legal documents: An issue-based recommendation system via cluster association”, and are grateful that they have agreed to offer some system-specific context for their work in this area. Their current contribution represents a practical description of the advances that have been made between the initial and current versions of Westlaw, and what differentiates a contemporary legal search engine from its predecessors.  -sd]

In her blog on “Pushing the Envelope: Innovation in Legal Search” (2009) [1], Edinburgh Informatics Ph.D. candidate K. Tamsin Maxwell presents her perspective of the state of legal search at the time. The variations of legal information retrieval (IR) that she reviews − everything from natural language search (e.g., vector space models, Bayesian inference net models, and language models) to NLP and term weighting − refer to techniques that are now 10, 15, even 20 years old. She also refers to the release of the first natural language legal search engine by West back in 1993−WIN (Westlaw Is Natural) [2]. Adding to this on-going conversation about legal search, we would like to check back in, a full 20 years after the release of that first natural language legal search engine. The objective we hope to achieve in this posting is to provide a useful overview of state-of-the-art legal search today.

What Maxwell’s article could not have predicted, even five years ago, are some of the chief factors that distinguish state-of-the-art search engines today from their earlier counterparts. One of the most notable distinctions is that unlike their predecessors, contemporary search engines, including today’s state-of-the-art legal search engine, WestlawNext , separate the function of document retrieval from document ranking. Whereas the first retrieval function primarily addresses recall, ensuring that all potentially relevant documents are retrieved, the second and ensuing function focuses on the ideal ranking of those results, addressing precision at the highest ranks. By contrast, search engines of the past effectively treated these two search functions as one and the same. So what is the difference? Whereas the document retrieval piece may not be dramatically different from what it was when WIN was first released in 1993, what is dramatically different lies in the evidence that is considered in the ranking piece, which allows potentially dozens of weighted features to be taken into account and tracked as part of the optimal ranking process.

Figure 1: Views

Figure 1. The set of evidence (views) that can be used by modern legal search engines.

In traditional search, the principal evidence considered was the main text of the document in question. In the case of traditional legal search, those documents would be cases, briefs, statutes, regulations, law reviews and other forms of primary and secondary (a.k.a. analytical) legal publications. This textual set of evidence can be termed the document view of the world. In the case of legal search engines like Westlaw, there also exists the ability to exploit expert-generated annotations or metadata. These annotations come in the form of attorney-editor generated synopses, points of law (a.k.a. headnotes), and attorney-classifier assigned topical classifications that rely on a legal taxonomy such as West’s Key Number System [3]. The set of evidence based on such metadata can be termed the annotation view. Furthermore, in a manner loosely analogous to today’s World Wide Web and the lattice of inter-referencing documents that reside there, today’s legal search can also exploit the multiplicity of both out-bound (cited) sources and in-bound (citing) sources with respect to a document in question, and, frequently, the granularity of these citations is not merely at a document-level but at the sub-document or topic level. Such a set of evidence can be termed the citation network view. More sophisticated engines can examine not only the popularity of a given cited or citing document based on the citation frequency, but also the polarity and scope of the arguments they wager as well.

In addition to the “views” described thus far, a modern search engine can also harness what has come to be known as aggregated user behavior. While individual users and their individual behavior are not considered, in instances where there is sufficient accumulated evidence, the search function can consider document popularity thanks to a user view. That is to say, in addition to a document being returned in a result set for a certain kind of query, the search provider can also tabulate how often a given document was opened for viewing, how often it was printed, or how often it was checked for its legal validity (e.g., through citator services such as KeyCite [4]). (See Figure 1) This form of marshaling and weighting of evidence only scratches the surface, for one can also track evidence between two documents within the same research session, e.g., noting that when one highly relevant document appears in result sets for a given query-type, another document typically appears in the same result sets. In summary, such a user view represents a rich and powerful additional means of leveraging document relevance as indicated through professional user interactions with legal corpora such as those mentioned above.

It is also worth noting that today’s search engines may factor in a user’s preferences, for example, by knowingVOX.LegalResearch what jurisdiction a particular attorney-user practices in, and what kinds of sources that user has historically preferred, over time and across numerous result sets.

While the materials or data relied upon in the document view and citation network view are authored by judges, law clerks, legislators, attorneys and law professors, the summary data present in the annotation view is produced by attorney-editors. By contrast, the aggregated user behavior data represented in the user view is produced by the professional researchers who interact with the retrieval system. The result of this rich and diverse set of views is that the power and effectiveness of a modern legal search engine comes not only from its underlying technology but also from the collective intelligence of all of the domain expertise represented in the generation of its data (documents) and metadata (citations, annotations, popularity and interaction information). Thus, the legal search engine offered by WestlawNext (WLN) represents an optimal blend of advanced artificial intelligence techniques and human expertise [5].

Given this wealth of diverse material representing various forms of relevance information and tractable connections between queries and documents, the ranking function executed by modern legal search engines can be optimized through a series of training rounds that “teach” the machine what forms of evidence make the greatest contribution for certain types of queries and available documents, along with their associated content and metadata. In other words, the re-ranking portion of the machine learns how to weigh the “features” representing this evidence in a manner that will produce the best (i.e., highest precision) ranking of the documents retrieved.

Nevertheless, a search engine is still highly influenced by the user queries it has to process, and for some legal research questions, an independent set of documents grouped by legal issue would be a tremendous complementary resource for the legal researcher, one at least as effective as trying to assemble the set of relevant documents through a sequence of individual queries. For this reason, WLN offers in parallel a complement to search entitled “Related Materials” which in essence is a document recommendation mechanism. These materials are clustered around the primary, secondary and sometimes tertiary legal issues in the case under consideration.

Legal documents are complex and multi-topical in nature. By detecting the top-level legal issues underlying the original document and delivering recommended documents grouped according to these issues, a modern legal search engine can provide a more effective research experience to a user when providing such comprehensive coverage [6,7]. Illustrations of some of the approaches to generating such related material are discussed below.

Take, for example, an attorney who is running a set of queries that seeks to identify a group of relevant documents involving “attractive nuisance” for a party that witnessed a child nearly drowned in a swimming pool. After a number of attempts using several different key terms in her queries, the attorney selects the “Related Materials” option that subsequently provides access to the spectrum of “attractive nuisance”-related documents. Such sets of issue-based documents can represent a mother lode of relevant materials. In this instance, pursuing this navigational path rather than a query-based one turns out to be a good choice. Indeed, the query-based approach could take time and would lead to a gradually evolving set of relevant documents. By contrast, harnessing the cluster of documents produced for “attractive nuisance” may turn out to be the most efficient approach to total recall and the desired degree of relevance.

To further illustrate the benefit of a modern legal search engine, we will conclude our discussion with an instructive search using WestlawNext, and its subsequent exploration by way of this recommendation resource available through “Related Materials.”

The underlying legal issue in this example is “church support for specific candidates”, and a corresponding query is issued in the search box. Figure 2 provides an illustration of the top cases retrieved.


Figure 2: Search result from WestlawNext

Let’s assume that the user decides to closely examine the first case. By clicking the link to the document, the content of the case is rendered, as in Figure 3. Note that on the right-hand side of the panel, the major legal issues of the case “Canyon Ferry Road Baptist Church … v. Unsworth” have been automatically identified and presented with hierarchically structured labels, such as “Freedom of Speech / State Regulation of Campaign Speech” and “Freedom of Speech / View of Federal Election Campaign Act / Definition of Political Committee,” … By presenting these closely related topics, a user is empowered with the ability to dive deep into the relevant cases and other relevant documents without explicitly crafting any additional or refined queries.


Figure 3: A view of a case and complementary materials from WestlawNext

By selecting these sets of relevant topics, a set of recommended cases will be rendered under that particular label. Figure 4, for example, shows the related topic view of the case under the label of “Freedom of Speech / View of Federal Election Campaign Act / Definition of Political Committee.” Note that this process can be repeated based on the particular needs of a user, starting with a document in the original results set.


Figure 4: Related Topic view of a case

In summary, by utilizing the combination of human expert-generated resources and sophisticated machine-learning algorithms, modern legal search engines bring the legal research experience to an unprecedented and powerful new level. For those seeking the next generation in legal search, it’s no longer on the horizon. It’s already here.


[1] K. Tamsin Maxwell, “Pushing the Envelope: Innovation in Legal Search,” in VoxPopuLII, Legal Information Institute, Cornell University Law School, 17 Sept. 2009.
[2] Howard Turtle, “Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance,” In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research & Development in Information Retrieval (SIGIR 1994) (Dublin, Ireland), Springer-Verlag, London, pp. 212-220, 1994.
[3] West’s Key Number System:
[4] West’s KeyCite Citator Service:
[5] Peter Jackson and Khalid Al-Kofahi, “Human Expertise and Artificial Intelligence in Legal Search,” in Structuring of Legal Semantics, A. Geist, C. R. Brunschwig, F. Lachmayer, G. Schefbeck Eds., Festschrift ed. for Erich Schweighofer, Editions Weblaw, Bern, pp. 417-427, 2011.
[6] On Cluster definition and population: Qiang Lu, Jack G. Conrad, Khalid Al-Kofahi, William Keenan, “Legal Document Clustering with Build-in Topic Segmentation,” In Proceedings of the 2011 ACM-CIKM Twentieth International Conference on Information and Knowledge Management (CIKM 2011)(Glasgow, Scotland), ACM Press, pp. 383-392, 2011.
[7] On Cluster association with individual documents: Qiang Lu and Jack G. Conrad, “Bringing order to legal documents: An Issue-based Recommendation System via Cluster Association,” In Proceedings of the 4th International Conference on Knowledge Engineering and Ontology Development  (KEOD 2012) (Barcelona, Spain), SciTePress DL, pp. 76-88, 2012.

Jack G. Conrad currently serves as Lead Research Scientist with the Catalyst Lab at Thomson Reuters Global Resources in Baar, Switzerland. He was formerly a Senior Research Scientist with the Thomson Reuters Corporate Research & Development department. His research areas fall under a broad spectrum of Information Retrieval, Data Mining and NLP topics. Some of these include e-Discovery, document clustering and deduplication for knowledge management systems. Jack has researched and implemented key components for WestlawNext, West‘s next-generation legal search engine, and PeopleMap, a very large scale Public Record aggregation system. Jack completed his graduate studies in Computer Science at the University of Massachusetts–Amherst and in Linguistics at the University of British Columbia–Vancouver.

Qiang Lu was a Senior Research Scientist with Thomson Reuters Corporate Research & Development department. His research interests include data mining, text mining, information retrieval, and machine learning. He has extensive experience of applying various NLP technologies in various data sources, such as news, legal, financial, and law enforcement data. Qiang was a key member of WestlawNext research team. He has a Ph.D. in computer science and engineering from State University of New York at Buffalo. He is now a managing associate at Kore Federal in Washington D.C. area.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

VOX.latin. ignorance_of_the_law_excuses_no_one_card-p137531564186928160envwi_400At CourtListener, we are making a free database of court opinions with the ultimate goal of providing the entire U.S. case-law corpus to the world for free and combining it with cutting-edge search and research tools. We–like most readers of this blog–believe that for justice to truly prevail, the law must be open and equally accessible to everybody.

It is astonishing to think that the entire U.S. case-law corpus is not currently available to the world at no cost. Many have started down this path and stopped, so we know we’ve set a high goal for a humble open source project. From time to time it’s worth taking a moment to reflect on where we are and where we’d like to go in the coming years.

The current state of affairs

We’ve created a good search engine that can provide results based on a number of characteristics of legal cases. Our users can search for opinions by the case name, date, or any text that’s in the opinion, and can refine by court, by precedential status or by citation. The results are pretty good, but are limited based on the data we have and the “relevance signals” that we have in place.

A good legal search engine will use a number of factors (a.k.a. “relevance signals”) to promote documents to the top of their listings. Things like:

  • How recent is the opinion?
  • How many other opinions have cited it?
  • How many journals have cited it?
  • How long is it?
  • How important is the court that heard the case?
  • Is the case in the jurisdiction of the user?
  • Is the opinion one that the user has looked at before?
  • What was the subsequent treatment of the opinion?

And so forth. All of the above help to make search results better, and we’ve seen paid legal search tools make great strides in their products by integrating these and other factors. At CourtListener, we’re using a number of the above, but we need to go further. We need to use as many factors as possible, we need to learn how the factors interact with each other, which ones are the most important, and which lead to the best results.

A different problem we’re working to solve at CourtListener is getting primary legal materials freely onto the Web. What good is a search engine if the opinion you need isn’t there in the first place? We currently have about 800,000 federal opinions, including West’s second and third Federal Reporters, F.2d and F.3d, and the entire Supreme Court corpus. This is good and we’re very proud of the quality of our database–we think it’s the best free resource there is. Every day we add the opinions from the Circuit Courts in the federal system and the U.S. Supreme Court, nearly in real-time. But we need to go further: we need to add state opinions, and we need to add not just the latest opinions but all the historical ones as well.

This sounds daunting, but it’s a problem that we hope will be solved in the next few years. Although it’s taking longer than we would like, in time we are confident that all of the important historical legal data will make its way to the open Internet. Primary legal sources are already in the public domain, so now it’s just a matter of getting it into good electronic formats so that anyone can access it and anyone can re-use it. If an opinion only exists as unsearchable scanned versions, in bound books, or behind a pricey pay wall, then it’s closed to many people that should have access to it. As part of our citation identification project, which I’ll talk about next, we’re working to get the most important documents properly digitized.

Our citation identification project was developed last year by U.C. Berkeley School of Information students Rowyn McDonald and Karen Rustad to identify and cross-link any citations found in our database. This is a great feature that makes all the citations in our corpus link to the correct opinions, if we have them. For example, if you’re reading an opinion that has a reference to Roe v. Wade, you can click on the citation, and you’ll be off and reading Roe v. Wade. By the way, if you’re wondering how many Federal Appeals opinions cite Roe v. Wade, the number in our system is 801 opinions (and counting). If you’re wondering what the most-cited opinion in our system is, you may be bemused: With about 10,000 citations, it’s an opinion about ineffective assistance of legal counsel in death penalty cases, Strickland v. Washington, 466 U.S. 668 (1984).

A feature we’ll be working on soon will tie into our citation system to help us close any gaps in our corpus. Once the feature is done, whenever an opinion is cited that we don’t yet have, our users will be able to pay a small amount–one or two dollars–to sponsor the digitization of that opinion. We’ll do the work of digitizing it, and after that point the opinion will be available to the public for free.

This brings us to the next big feature we added last year: bulk data. Because we want to assist academic VOX.pile.of.paperresearchers and others who might have a use for a large database of court opinions, we provide free bulk downloads of everything we have. Like Carl Malamud’s, (to whom we owe a great debt for his efforts to collect opinions and provide them to others for free and for his direct support of our efforts) we have giant files you can download that provide thousands of opinions in computer-readable format. These downloads are available by court and date, and include thousands of fixes to the corpus. They also include something you can’t find anywhere else: the citation network. As part of the metadata associated with each opinion in our bulk download files, you can look and see which opinions it cites as well as which opinions cite it. This provides a valuable new source of data that we are very eager for others to work with. Of course, as new opinions are added to our system, we update our downloads with the new citations and the new information.

Finally, we would be remiss if we didn’t mention our hallmark feature: daily, weekly and monthly email alerts. For any query you put into CourtListener, you can request that we email you whenever there are new results. This feature was the first one we created, and one that we continue to be excited about. This year we haven’t made any big innovations to our email alerts system, but its popularity has continued to grow, with more than 500 alerts run each day. Next year, we hope to add a couple small enhancements to this feature so it’s smoother and easier to use.

The future

I’ve hinted at a lot of our upcoming work in the sections above, but what are the big-picture features that we think we need to achieve our goals?

We do all of our planning in the open, but we have a few things cooking in the background that we hope to eventually build. Among them are ideas for adding oral argument audio, case briefs, and data from PACER. Adding these new types of information to CourtListener is a must if we want to be more useful for research purposes, but doing so is a long-term goal, given the complexity of doing them well.

We also plan to build an opinion classifier that could automatically, and without human intervention, determine the subsequent treatment of opinions. Done right, this would allow our users to know at a glance if the opinion they’re reading was subsequently followed, criticized, or overruled, making our system even more valuable to our users.

In the next few years, we’ll continue building out these features, but as an open-source and open-data project, everything we do is in the open. You can see our plans on our feature tracker, our bugs in our bug tracker, and can get in touch in our forum. The next few years look to be very exciting as we continue building our collection and our platform for legal research. Let’s see what the new year brings!


lissnerMichael Lissner is the co-founder and lead developer of CourtListener, a project that works to make the law more accessible to all. He graduated from U.C. Berkeley’s School of Information, and when he’s not working on CourtListener he develops search and eDiscovery solutions for law firms. Michael is passionate about bringing greater access to our primary legal materials, about how technology can replace old legal models, and about open source, community-driven approaches to legal research.


carverBrian W. Carver is Assistant Professor at the U.C. Berkeley School of Information where he does ressearch on and teaches about intellectual property law and cyberlaw. He is also passionate about the public’s access to the law. In 2009 and 2010 he advised an I School Masters student, Michael Lissner, on the creation of, an alert service covering the U.S. federal appellate courts. After Michael’s graduation, he and Brian continued working on the site and have grown the database of opinions to include over 750,000 documents. In 2011 and 2012, Brian advised I School Masters students Rowyn McDonald and Karen Rustad on the creation of a legal citator built on the CourtListener database.


VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed. The information above should not be considered legal advice. If you require legal representation, please consult a lawyer.


Van Winkle wakes

In this post, we return to a topic we first visited in a book chapter in 2004.  At that time, one of us (Bruce) was an electronic publisher of Federal court cases and statutes, and the other (Hillmann, herself a former law cataloger) was working with large, aggregated repositories of scientific papers as part of the National Sciences Digital Library project.  Then, as now, we were concerned that little attention was being paid to the practical tradeoffs involved in publishing high quality metadata at low cost.  There was a tendency to design metadata schemas that said absolutely everything that could be said about an object, often at the expense of obscuring what needed to be said about it while running up unacceptable costs.  Though we did not have a name for it at the time, we were already deeply interested in least-cost, use-case-driven approaches to the design of metadata models, and that naturally led us to wonder what “good” metadata might be.  The result was “The Continuum of Metadata Quality: Defining, Expressing, Exploiting”, published as a chapter in an ALA publication, Metadata in Practice.

In that chapter, we attempted to create a framework for talking about (and evaluating) metadata quality.  We were concerned primarily with metadata as we were then encountering it: in aggregations of repositories containing scientific preprints, educational resources, and in caselaw and other primary legal materials published on the Web.   We hoped we could create something that would be both domain-independent and useful to those who manage and evaluate metadata projects.  Whether or not we succeeded is for others to judge.

The Original Framework

At that time, we identified seven major components of metadata quality. Here, we reproduce a part of a summary table that we used to characterize the seven measures. We suggested questions that might be used to draw a bead on the various measures we proposed:

Quality Measure Quality Criteria
Completeness Does the element set completely describe the objects?
Are all relevant elements used for each object?
Provenance Who is responsible for creating, extracting, or transforming the metadata?
How was the metadata created or extracted?
What transformations have been done on the data since its creation?
Accuracy Have accepted methods been used for creation or extraction?
What has been done to ensure valid values and structure?
Are default values appropriate, and have they been appropriately used?
Conformance to expectations Does metadata describe what it claims to?
Are controlled vocabularies aligned with audience characteristics and understanding of the objects?
Are compromises documented and in line with community expectations?
Logical consistency and coherence Is data in elements consistent throughout?
How does it compare with other data within the community?
Timeliness Is metadata regularly updated as the resources change?
Are controlled vocabularies updated when relevant?
Accessibility Is an appropriate element set for audience and community being used?
Is it affordable to use and maintain?
Does it permit further value-adds?


There are, of course, many possible elaborations of these criteria, and many other questions that help get at them.  Almost nine years later, we believe that the framework remains both relevant and highly useful, although (as we will discuss in a later section) we need to think carefully about whether and how it relates to the quality standards that the Linked Open Data (LOD) community is discovering for itself, and how it and other standards should affect library and publisher practices and policies.

… and the environment in which it was created

Our work was necessarily shaped by the environment we were in.  Though we never really said so explicitly, we were looking for quality not only in the data itself, but in the methods used to organize, transform and aggregate it across federated collections.  We did not, however, anticipate the speed or scale at which standards-based methods of data organization would be applied.  Commonly-used standards like FOAF, models such as those contained in, and lightweight modelling apparatus like SKOS are all things that have emerged into common use since, and of course the use of Dublin Core — our main focus eight years ago — has continued even as the standard itself has been refined.  These days, an expanded toolset makes it even more important that we have a way to talk about how well the tools fit the job at hand, and how well they have been applied. An expanded set of design choices accentuates the need to talk about how well choices have been made in particular cases.

Although our work took its inspiration from quality standards developed by a government statistical service, we had not really thought through the sheer multiplicity of information services that were available even then.  We were concerned primarily with work that had been done with descriptive metadata in digital libraries, but of course there were, and are, many more people publishing and consuming data in both the governmental and private sectors (to name just two).  Indeed, there was already a substantial literature on data quality that arose from within the management information systems (MIS) community, driven by concerns about the reliability and quality of  mission-critical data used and traded by businesses.  In today’s wider world, where work with library metadata will be strongly informed by the Linked Open Data techniques developed for a diverse array of data publishers, we need to take a broader view.  

Finally, we were driven then, as we are now, by managerial and operational concerns. As practitioners, we were well aware that metadata carries costs, and that human judgment is expensive.  We were looking for a set of indicators that would spark and sustain discussion about costs and tradeoffs.  At that time, we were mostly worried that libraries were not giving costs enough attention, and were designing metadata projects that were unrealistic given the level of detail or human intervention they required.  That is still true.  The world of Linked Data requires well-understood metadata policies and operational practices simply so publishers can know what is expected of them and consumers can know what they are getting. Those policies and practices in turn rely on quality measures that producers and consumers of metadata can understand and agree on.  In today’s world — one in which institutional resources are shrinking rather than expanding —  human intervention in the metadata quality assessment process at any level more granular than that of the entire data collection being offered will become the exception rather than the rule.   

While the methods we suggested at the time were self-consciously domain-independent, they did rest on background assumptions about the nature of the services involved and the means by which they were delivered. Our experience had been with data aggregated by communities where the data producers and consumers were to some extent known to one another, using a fairly simple technology that was easy to run and maintain.  In 2013, that is not the case; producers and consumers are increasingly remote from each other, and the technologies used are both more complex and less mature, though that is changing rapidly.

The remainder of this blog post is an attempt to reconsider our framework in that context.

The New World

The Linked Open Data (LOD) community has begun to consider quality issues; there are some noteworthy online discussions, as well as workshops resulting in a number of published papers and online resources.  It is interesting to see where the work that has come from within the LOD community contrasts with the thinking of the library community on such matters, and where it does not.  

In general, the material we have seen leans toward the traditional data-quality concerns of the MIS community.  LOD practitioners seem to have started out by putting far more emphasis than we might on criteria that are essentially audience-dependent, and on operational concerns having to do with the reliability of publishing and consumption apparatus.   As it has evolved, the discussion features an intellectual move away from those audience-dependent criteria, which are usually expressed as “fitness for use”, “relevance”, or something of the sort (we ourselves used the phrase “community expectations”). Instead, most realize that both audience and usage  are likely to be (at best) partially unknown to the publisher, at least at system design time.  In other words, the larger community has begun to grapple with something librarians have known for a while: future uses and the extent of dissemination are impossible to predict.  There is a creative tension here that is not likely to go away.  On the one hand, data developed for a particular community is likely to be much more useful to that community; thus our initial recognition of the role of “community expectations”.  On the other, dissemination of the data may reach far past the boundaries of the community that develops and publishes it.  The hope is that this tension can be resolved by integrating large data pools from diverse sources, or by taking other approaches that result in data models sufficiently large and diverse that “community expectations” can be implemented, essentially, by filtering.

For the LOD community, the path that began with  “fitness-for-use” criteria led quickly to the idea of maintaining a “neutral perspective”. Christian Fürber describes that perspective as the idea that “Data quality is the degree to which data meets quality requirements no matter who is making the requirements”.  To librarians, who have long since given up on the idea of cataloger objectivity, a phrase like “neutral perspective” may seem naive.  But it is a step forward in dealing with data whose dissemination and user community is unknown. And it is important to remember that the larger LOD community is concerned with quality in data publishing in general, and not solely with descriptive metadata, for which objectivity may no longer be of much value.  For that reason, it would be natural to expect the larger community to place greater weight on objectivity in their quality criteria than the library community feels that it can, with a strong preference for quantitative assessment wherever possible.  Librarians and others concerned with data that involves human judgment are theoretically more likely to be concerned with issues of provenance, particularly as they concern who has created and handled the data.  And indeed that is the case.

The new quality criteria, and how they stack up

Here is a simplified comparison of our 2004 criteria with three views taken from the LOD community.

Bruce & Hillmann Dodds, McDonald Flemming
Completeness Completeness
Amount of data
Provenance History
Accuracy Accuracy
Validity of documents
Conformance to expectations Modeling correctness
Modeling granularity
Logical consistency and coherence Directionality
Modeling correctness
Internal consistency
Referential correspondence
Timeliness Currency Timeliness
Accessibility Intelligibility
Accessibility (technical)
Performance (technical)

Placing the “new” criteria into our framework was no great challenge; it appears that we were, and are, talking about many of the same things. A few explanatory remarks:

  • Boundedness has roughly the same relationship to completeness that precision does to recall in information-retrieval metrics. The data is complete when we have everything we want; its boundedness shows high quality when we have only what we want.
  • Flemming’s amount of data criterion talks about numbers of triples and links, and about the interconnectedness and granularity of the data.  These seem to us to be largely completeness criteria, though things to do with linkage would more likely fall under “Logical coherence” in our world. Note, again, a certain preoccupation with things that are easy to count.  In this case it is somewhat unsatisfying; it’s not clear what the number of triples in a triplestore says about quality, or how it might be related to completeness if indeed that is what is intended.
  • Everyone lists criteria that fit well with our notions about provenance. In that connection, the most significant development has been a great deal of work on formalizing the ways in which provenance is expressed.  This is still an active level of research, with a lot to be decided.  In particular, attempts at true domain independence are not fully successful, and will probably never be so.  It appears to us that those working on the problem at DCMI are monitoring the other efforts and incorporating the most worthwhile features.
  • Dodds’ typing criterion — which basically says that dereferenceable URIs should be preferred to string literals  – participates equally in completeness and accuracy categories.  While we prefer URIs in our models, we are a little uneasy with the idea that the presence of string literals is always a sign of low quality.  Under some circumstances, for example, they might simply indicate an early stage of vocabulary evolution.
  • Flemming’s verifiability and validity criteria need a little explanation, because the terms used are easily confused with formal usages and so are a little misleading.  Verifiability bundles a set of concerns we think of as provenance.  Validity of documents is about accuracy as it is found in things like class and property usage.  Curiously, none of Flemming’s criteria have anything to do with whether the information being expressed by the data is correct in what it says about the real world; they are all designed to convey technical criteria.  The concern is not with what the data says, but with how it says it.
  • Dodds’ modeling correctness criterion seems to be about two things: whether or not the model is correctly constructed in formal terms, and whether or not it covers the subject domain in an expected way.  Thus, we assign it to both “Community expectations” and “Logical coherence” categories.
  • Isomorphism has to do with the ability to join datasets together, when they describe the same things.  In effect, it is a more formal statement of the idea that a given community will expect different models to treat similar things similarly. But there are also some very tricky (and often abused) concepts of equivalence involved; these are just beginning to receive some attention from Semantic Web researchers.
  • Licensing has become more important to everyone. That is in part because Linked Data as published in the private sector may exhibit some of the proprietary characteristics we saw as access barriers in 2004, and also because even public-sector data publishers are worried about cost recovery and appropriate-use issues.  We say more about this in a later section.
  • A number of criteria listed under Accessibility have to do with the reliability of data publishing and consumption apparatus as used in production.  Linked Data consumers want to know that the endpoints and triple stores they rely on for data are going to be up and running when they are needed.  That brings a whole set of accessibility and technical performance issues into play.  At least one website exists for the sole purpose of monitoring endpoint reliability, an obvious concern of those who build services that rely on Linked Data sources. Recently, the LII made a decision to run its own mirror of the DrugBank triplestore to eliminate problems with uptime and to guarantee low latency; performance and accessibility had become major concerns. For consumers, due diligence is important.

For us, there is a distinctly different feel to the examples that Dodds, Flemming, and others have used to illustrate their criteria; they seem to be looking at a set of phenomena that has substantial overlap with ours, but is not quite the same.  Part of it is simply the fact, mentioned earlier, that data publishers in distinct domains have distinct biases. For example, those who can’t fully believe in objectivity are forced to put greater emphasis on provenance. Others who are not publishing descriptive data that relies on human judgment feel they can rely on more  “objective” assessment methods.  But the biggest difference in the “new quality” is that it puts a great deal of emphasis on technical quality in the construction of the data model, and much less on how well the data that populates the model describes real things in the real world.  

There are three reasons for that.  The first has to do with the nature of the discussion itself. All quality discussions, simply as discussions, seem to neglect notions of factual accuracy because factual accuracy seems self-evidently a Good Thing; there’s not much to talk about.  Second, the people discussing quality in the LOD world are modelers first, and so quality is seen as adhering primarily to the model itself.  Finally, the world of the Semantic Web rests on the assumption that “anyone can say anything about anything”, For some, the egalitarian interpretation of that statement reaches the level of religion, making it very difficult to measure quality by judging whether something is factual or not; from a purist’s perspective, it’s opinions all the way down.  There is, then, a tendency to rely on formalisms and modeling technique to hold back the tide.

In 2004, we suggested a set of metadata-quality indicators suitable for managers to use in assessing projects and datasets.  An updated version of that table would look like this:

Quality Measure Quality Criteria
Completeness Does the element set completely describe the objects?
Are all relevant elements used for each object?
Does the data contain everything you expect?
Does the data contain only what you expect?
Provenance Who is responsible for creating, extracting, or transforming the metadata?
How was the metadata created or extracted?
What transformations have been done on the data since its creation?
Has a dedicated provenance vocabulary been used?
Are there authenticity measures (eg. digital signatures) in place?
Accuracy Have accepted methods been used for creation or extraction?
What has been done to ensure valid values and structure?
Are default values appropriate, and have they been appropriately used?
Are all properties and values valid/defined?
Conformance to expectations Does metadata describe what it claims to?
Does the data model describe what it claims to?
Are controlled vocabularies aligned with audience characteristics and understanding of the objects?
Are compromises documented and in line with community expectations?
Logical consistency and coherence Is data in elements consistent throughout?
How does it compare with other data within the community?
Is the data model technically correct and well structured?
Is the data model aligned with other models in the same domain?
Is the model consistent in the direction of relations?
Timeliness Is metadata regularly updated as the resources change?
Are controlled vocabularies updated when relevant?
Accessibility Is an appropriate element set for audience and community being used?
Is the data and its access methods well-documented, with exemplary queries and URIs?
Do things have human-readable labels?
Is it affordable to use and maintain?
Does it permit further value-adds?
Does it permit republication?
Is attribution required if the data is redistributed?
Are human- and machine-readable licenses available?
Accessibility — technical Are reliable, performant endpoints available?
Will the provider guarantee service (eg. via a service level agreement)?
Is the data available in bulk?
Are URIs stable?


The differences in the example questions reflect the differences of approach that we discussed earlier. Also, the new approach separates criteria related to technical accessibility from questions that relate to intellectual accessibility. Indeed, we suspect that “accessibility” may have been too broad a notion in the first place. Wider deployment of metadata systems and a much greater, still-evolving variety of producer-consumer scenarios and relationships have created a need to break it down further.  There are as many aspects to accessibility as there are types of barriers — economic, technical, and so on.

As before, our list is not a checklist or a set of must-haves, nor does it contain all the questions that might be asked.  Rather, we intend it as a list of representative questions that might be asked when a new Linked Data source is under consideration.  They are also questions that should inform policy discussion around the uses of Linked Data by consuming libraries and publishers.  

That is work that can be formalized and taken further. One intriguing recent development is work toward a Data Quality Management Vocabulary.   Its stated aims are to

  • support the expression of quality requirements in the same language, at web scale;
  • support the creation of consensual agreements about quality requirements
  • increase transparency around quality requirements and measures
  • enable checking for consistency among quality requirements, and
  • generally reduce the effort needed for data quality management activities


The apparatus to be used is a formal representation of “quality-relevant” information.   We imagine that the researchers in this area are looking forward to something like automated e-commerce in Linked Data, or at least a greater ability to do corpus-level quality assessment at a distance.  Of course, “fitness-for-use” and other criteria that can really only be seen from the perspective of the user will remain important, and there will be interplay between standardized quality and performance measures (on the one hand) and audience-relevant features on the other.   One is rather reminded of the interplay of technical specifications and “curb appeal” in choosing a new car.  That would be an important development in a Semantic Web industry that has not completely settled on what a car is really supposed to be, let alone how to steer or where one might want to go with it.


Libraries have always been concerned with quality criteria in their work as a creators of descriptive metadata.  One of our purposes here has been to show how those criteria will evolve as libraries become publishers of Linked Data, as we believe that they must. That much seems fairly straightforward, and there are many processes and methods by which quality criteria can be embedded in the process of metadata creation and management.

More difficult, perhaps, is deciding how these criteria can be used to construct policies for Linked Data consumption.  As we have said many times elsewhere, we believe that there are tremendous advantages and efficiencies that can be realized by linking to data and descriptions created by others, notably in connecting up information about the people and places that are mentioned in legislative information with outside information pools.   That will require care and judgement, and quality criteria such as these will be the basis for those discussions.  Not all of these criteria have matured — or ever will mature — to the point where hard-and-fast metrics exist.  We are unlikely to ever see rigid checklists or contractual clauses with bullet-pointed performance targets, at least for many of the factors we have discussed here. Some of the new accessibility criteria might be the subject of service-level agreements or other mechanisms used in electronic publishing or database-access contracts.  But the real use of these criteria is in assessments that will be made long before contracts are negotiated and signed.  In that setting, these criteria are simply the lenses that help us know quality when we see it.



Thomas R. Bruce is the Director of the Legal Information Institute at the Cornell Law School.

Diane Hillmann is a principal in Metadata Management Associates, and a long-time collaborator with the Legal Information Institute.  She is currently a member of the Advisory Board for the Dublin Core Metadata Initiative (DCMI), and was co-chair of the DCMI/RDA Task Group.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

There have been a series of efforts to create a national legislative data standard – one master XML format to which all states will adhere for bills, laws, and regulations.Those efforts have gone poorly.

Few states provide bulk downloads of their laws. None provide APIs. Although nearly all states provide websites for people to read state laws, they are all objectively terrible, in ways that demonstrate that they were probably pretty impressive in 1995. Despite the clear need for improved online display of laws, the lack of a standard data format and the general lack of bulk data has enabled precious few efforts in the private sector. (Notably, there is Robb Schecter’s, which provides vastly improved experiences for the laws of California, Oregon, and New York. There was also a site built experimentally by Ari Hershowitz that was used as a platform for last year’s California Laws Hackathon.)

A significant obstacle to prior efforts has been the perceived need to create a single standard, one that will accommodate the various textual legal structures that are employed throughout government. This is a significant practical hurdle on its own, but failure is all but guaranteed by also engaging major stakeholders and governments to establish a standard that will enjoy wide support and adoption.

What if we could stop letting the perfect be the enemy of the good? What if we ignore the needs of the outliers, and establish a “good enough” system, one that will at first simply work for most governments? And what if we completely skip the step of establishing a standard XML format? Wouldn’t that get us something, a thing superior to the nothing that we currently have?

The State Decoded
This is the philosophy behind The State Decoded. Funded by the John S. and James L. Knight Foundation, The State Decoded is a free, open source program to put legal codes online, and it does so by simply skipping over the problems that have hampered prior efforts. The project does not aspire to create any state law websites on its own but, instead, to provide the software to enable others to do so.

Still in its development (it’s at version 0.4), The State Decoded leaves it to each implementer to gather up the contents of the legal code in question and interface it with the program’s internal API. This could be done via screen-scraping off of an existing state code website, modifying the parser to deal with a bulk XML file, converting input data into the program’s simple XML import format, or by a few other methods. While a non-trivial task, it’s something that can be knocked out in an afternoon, thus avoiding the need to create a universal data format and to persuade Wexis to provide their data in that format.

The magic happens after the initial data import. The State Decoded takes that raw legal text and uses it to populate a complete, fully functional website for end-users to search and browse those laws. By packaging the Solr search engine and employing some basic textual analysis, every law is cross-referenced with other laws that cite it and laws that are textually similar. If there exists a repository of legal decisions for the jurisdiction in question, that can be incorporated, too, displaying a list of the court cases that cite each section. Definitions are detected, loaded into a dictionary, and make the laws self-documenting. End users can post comments to each law. Bulk downloads are created, letting people get a copy of the entire legal code, its structural elements, or the automatically assembled dictionary. And there’s a REST-ful, JSON-based API, ready to be used by third parties. All of this is done automatically, quickly, and seamlessly. The time elapsed varies, depending on server power and the length of the legal code, but it generally takes about twenty minutes from start to finish.

The State Decoded is a free program, released under the GNU Public License. Anybody can use it to make legal codes more accessible online. There are no strings attached.

It has already been deployed in two states, Virginia and Florida, despite not actually being a finished project yet.

State Variations
The striking variations in the structures of legal codes within the U.S. required the establishment of an appropriately flexible system to store and render those codes. Some legal codes are broad and shallow (e.g., Louisiana, Oklahoma), while others are narrow and deep (e.g., Connecticut, Delaware). Some list their sections by natural sort order, some in decimal, a few arbitrarily switch between the two. Many have quirks that will require further work to accommodate.

For example, California does not provide a catch line for their laws, but just a section number. One must read through a law to know what it actually does, rather than being able to glance at the title and get the general idea. Because this is a wildly impractical approach for a state code, the private sector has picked up the slack – Westlaw and LexisNexis each write their own titles for those laws, neatly solving the problem for those with the financial resources to pay for those companies’ offerings. To handle a problem like this, The State Decoded either needs to be able to display legal codes that lack section titles, or pointedly not support this inferior approach, and instead support the incorporation of third-party sources of title. In California, this might mean mining the section titles used internally by the California Law Revision Commission, and populating the section titles with those. (And then providing a bulk download of that data, allowing it to become a common standard for California’s section titles.)

Many state codes have oddities like this. The State Decoded combines flexibility with open source code to make it possible to deal with these quirks on a case-by-case basis. The alternative approach is too convoluted and quixotic to consider.

There is strong interest in seeing this software adapted to handle regulations, especially from cash-strapped state governments looking to modernize their regulatory delivery process. Although this process is still in an early stage, it looks like rather few modifications will be required to support the storage and display of regulations within The State Decoded.

More significant modifications would be needed to integrate registers of regulations, but the substantial public benefits that would provide make it an obvious and necessary enhancement. The present process required to identify the latest version of a regulation is the stuff of parody. To select a state at random, here are the instructions provided on Kansas’s website:

To find the latest version of a regulation online, a person should first check the table of contents in the most current Kansas Register, then the Index to Regulations in the most current Kansas Register, then the current K.A.R. Supplement, then the Kansas Administrative Regulations. If the regulation is found at any of these sequential steps, stop and consider that version the most recent.

If Kansas has electronic versions of all this data, it seems almost punitive not to put it all in one place, rather than forcing people to look in four places. It seems self-evident that the current Kansas Register, the Index to Regulations, the K.A.R. Supplement, and the Kansas Administrative Regulations should have APIs, with a common API atop all four, which would make it trivial to present somebody with the current version of a regulation with a single request. By indexing registers of regulations in the manner that The State Decoded indexes court opinions, it would at least be possible to show people all activity around a given regulation, if not simply show them the present version of it, since surely that is all that most people want.

A Tapestry of Data
In a way, what makes The State Decoded interesting is not anything that it actually does, but instead what others might do with the data that it emits. By capitalizing on the program’s API and healthy collection of bulk downloads, clever individuals will surely devise uses for state legal data that cannot presently be envisioned.

The structural value of state laws is evident when considered within the context of other open government data.

Major open government efforts are confined largely to the upper-right quadrant of this diagram – those matters concerned with elections and legislation. There is also some excellent work being done in opening up access to court rulings, indexing scholarly publications, and nascent work in indexing the official opinions of attorneys general. But the latter group cannot be connected to the former group without opening up access to state laws. Courts do not make rulings about bills, of course – it is laws with which they concern themselves. Law journals cite far more laws than they do bills. To weave a seamless tapestry of data that connects court decisions to state laws to legislation to election results to campaign contributions, it is necessary to have a source of rich data about state laws. The State Decoded aims to provide that data.

Next Steps
The most important next step for The State Decoded is to complete it, releasing a version 1.0 of the software. It has dozens of outstanding issues – both bug fixes and new features – so this process will require some months. In that period, the project will continue to work with individuals and organizations in states throughout the nation who are interested in deploying The State Decoded to help them get started.

Ideally, The State Decoded will be obviated by states providing both bulk data and better websites for their codes and regulations. But in the current economic climate, neither are likely to be prioritized within state budgets, so unfortunately there’s liable to remain a need for the data provided by The State Decoded for some years to come. The day when it is rendered useless will be a good day.

Waldo Jaquith is a website developer with the Miller Center at the University of Virginia in Charlottesville, Virginia. He is a News Challenge Fellow with the John S. and James L. Knight Foundation and runs Richmond Sunlight, an open legislative service for Virginia. Jaquith previously worked for the White House Office of Science and Technology Policy, for which he developed, and is now a member of the White House Open Data Working Group.
[Editor's Note: For topic-related VoxPopuLII posts please see: Ari Hershowitz & Grant Vergottini, Standardizing the World's Legal Information - One Hackathon At a Time; Courtney Minick, Universal Citation for State Codes; John Sheridan,; and Robb Schecter, The Recipe for Better Legal Information Services. ]

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

In 1494, Luca Pacioli published “Particularis de Computis et Scripturis,” which is widely regarded as the first written treatise on bookkeeping. In the 600+ years since that event, we have become completely accustomed to the concepts of ledgers, journals and double-entry bookkeeping. Like all profound ideas, the concept of a transaction ledger nowadays seems to be completely natural, as if always existed as some sort of natural law.

Whenever there is a need for detailed, defensible records of how a financial state of affairs (such as a company balance sheet or a profit and loss statement) came to be, we employ Pacioli’s concepts without even thinking about them any more. Of course you need ledgers of transactions as the raw material from which to derive the financial state of affairs at whatever chosen point in time is of interest. How else could you possibly do it?

Back in Pacioli’s day, there was nothing convenient about ledgers. After all, back then, all ledger entries had to be painstakingly made by hand into paper volumes. Care was needed to pair up the debits and credits. Ledger page totals and sub-totals had to checked and re-checked. Very labor intensive stuff. But then computers came along and lo! all the benefits of ledgers in terms of the rigorous audit trail could be enjoyed without all the hard labor.

Doubtless, somewhere along the line in the early days of the computerization of financial ledgers, it occurred to somebody that ledger entries need not be immutable. That is to say, there is no technical reason to carry forward the “limitation” that pen and ink imposes on ledger writers, that an entry – once made – cannot be changed without leaving marks on the page that evidence the change. Indeed, bookkeeping has long had the concept of a “contra-entry” to handle the immutability of pen and ink. For example, if a debit of a dollar is made to a ledger by mistake, then another ledger entry is made – this time a credit – for a dollar to counter-balance the mistake while preserving the completeness of the audit-trail.

Far from being a limitation of the paper-centric world, the concept of an “append-only” ledger turns out, in my opinion, to be the key to the trustworthiness and transparency of financial statements. Accountants and auditors can take different approaches to how information from the ledgers is grouped/treated, but the ledgers are the ledgers are the ledgers. Any doubt that the various summations accurately reflect the ledgers can readily be checked.

Now let us turn to the world of law. Well, law is so much more complicated! Laws are not simple little numerical values that fit nicely into transaction rows either in paper ledgers or in database tables. True, but does it follow that the many benefits of the ledger-centric approach cannot be enjoyed in our modern day digital world where we do not have the paper-centric ledger limitations of fixed size lines to fit our information into? Is the world of legal corpus management really so different from the world of financial accounting?

What happens if we look at, say, legal corpus management in a prototypical U.S. legislature, from the perspective of an accountant? What would an accountant see? Well, there is this asset called the statute. That is the “opening balance” inventory of the business in accounting parlance. There is a time concept called a Biennium which is an “accounting period”. All changes to the statute that happen in the accounting period are recorded in the form of bills. bills are basically accounting transactions. The bills are accumulated into a form of ledger typically known as Session Laws. At the end of the accounting period – the Biennium – changes to the statute are rolled forward from the Session Laws into the statute. In accounting parlance, this is the period-end accounting culminating in a new set of opening balances (statute), for the start of the next Biennium. At the start of the Biennium, all the ledger transactions are archived off and a fresh set of ledgers is created; that is, bill numbers/session law numbers are reset, the active Biennium name changes etc.

I could go on and on extending the analogy (chamber journals are analogous to board of directors meeting minutes; bill status reporting is analogous to management accounting, etc.) but you get the idea. Legal corpus management in a legislature can be conceptualized in accounting terms. Is it useful to do so? I would argue that it is incredibly useful to do so. Thanks to computerization, we do not have to limit the application of Luca Pacioli’s brilliant insight to things that fit neatly into little rows of boxes in paper ledgers. We can treat bills as transactions and record them architecturally as 21st century digital ledger transactions. We can manage statute as a “balance” to be carried forward to the next Biennium. We can treat engrossments of bills and statute alike as forms of trail balance generation and so on.

Now I am not for a moment suggesting that a digital legislative architecture be based on any existing accounting system. What I am saying is that the concepts that make up an accounting system can – and I would argue should – be used. A range of compelling benefits accrue from this. A tremendous amount of the back-office work that goes on in many legislatures can be traced back to work-in-progress (WIP) reporting and period-end accounting of what is happening with the legal corpus. Everything from tracking bill status to the engrossment of committee reports becomes significantly easier once all the transactions are recorded in legislative ledgers. The ledgers then becomes the master repository from which all reports are generated. The reduction in overall IT moving parts, reduction in human effort, reduction in latency and the increase in information consistency that can be achieved by doing this is striking.

For many hundreds of years we have had ledger-based accounting. For hundreds of years the courts have taken the view that, for example, a company cannot simply announce a Gross Revenue figure to tax officials or to investors, without having the transaction ledgers to back it up. Isn’t in interesting that we do not do the same for the legal corpus? We have all sorts of publishers in the legal world, from public bodies to private sector, who produce legislative outputs that we have to trust because we do not have any convenient form of access to the transaction ledgers. Somewhere along the line, we seem to have convinced ourselves that the level of rigorous audit trail routinely applied in financial accounting cannot be applied to law. This is simply not true.

We can and should fix that. The prize is great, the need is great and the time is now. The reason the time is now is that all around us, I see institutions that are ceasing to produce paper copies of critical legal materials in the interests of saving costs and streamlining workflows. I am all in favour of both of these goals, but I am concerned that many of the legal institutions going fully paperless today are doing so without implementing a ledger-based approach to legal corpus management. Without that, the paper versions of everything from registers to regulations to session laws to chamber journals to statute books – for all their flaws – are the nearest thing to an immutable set of ledgers that exist. Take away what little audit trail we have and replace it will a rolling corpus of born digital documents without a comprehensive audit trail of who changed what and when?…Not good.

Once an enterprise-level ledger-based approach is utilised, another great prize can be readily won; namely, the creation of a fully digital yet fully authenticated and authoritative corpus of law. To see why, let us step back into the shoes of the accountant for a moment. When computers came along and the financial paper ledgers were replaced with digital ledgers, the world of accounting did not find itself in a crisis concerning authenticity in the way the legal world has. Why so?

I would argue that the reason for this is that ledgers – Luca Pacioli’s great gift to the world – are the true source of authenticity for any artifact derived from the ledgers. Digital authenticity of balance sheets or Statute sections does not come from digital signatures or thumb-print readers or any of the modern high tech gadgetry of the IT security landscape. Authenticity come from knowing that what you are looking at was mechanically and deterministically derived from a set of ledgers and that those ledgers are available for inspection. What do financial auditors do for living? They check authenticity of financial statements. How do they do it? They do it by inspecting the ledgers. Why is authenticity of legal materials such a tough nut to crack? Because there are typically no ledgers!

From time to time we hear an outburst of emotion about the state of the legal corpus. From time to time we hear how some off-the-shelf widget will fix the problem. Technology absolutely holds the solutions, but it can only work, in my opinion, when the problem of legal corpus management is conceptualized as ledger-centric problem where we put manic focus on the audit trail. Then, and only then, can we put the legal corpus on a rigorous digital footing and move forward to a fully paperless world with confidence.

From time to time, we hear an outburst of enthusiasm to create standards for legal materials and solve our problems that way. I am all in favour of standards but we need to be smart about what we standardize. Finding common ground in the industry for representing legislative ledgers would be an excellent place to start, in my opinion.

Is this something that some standards body such as OASIS or NIEM might take on? I would hope so and hopeful that it will happen at some point. Part of why I am hopeful is that I see an increasing recognition of the value of ledger-based approaches in the broader world of GRC (Governance, Risk and Compliance). For too long now, the world of law has existed on the periphery of the information sciences. It can, and should be, an exemplar of how a critical piece of societal infrastructure has fully embraced what it means to be “born digital”. We have known conceptually how to do it since 1494. The technology all exists today to make it happen. A number of examples already exist in production use in legislatures in Europe and in the USA. What is needed now, is for the idea to spread like wildfire the same way that Pacioli’s ideas spread like wildfire into the world of finance all those years ago.

Perhaps some day, when the ledger-centric approach to legal corpus management had removed doubts about authenticity/reliability, we will look back and think digital law was always done with ledgers, just as today we think that accounting was always done that way.

Sean McGrath is co-founder and CTO of Propylon, based in Lawrence, Kansas. He has 30 years of experience in the IT industry, most of it in the legal and regulatory publishing space. He holds a first class honors degree in Computer Science from Trinity College Dublin and served as an invited expert to the W3C special interest group that created the XML standard in 1996. He is the author of three books on markup languages published by Prentice Hall in the Dr Charles F. Goldfarb Series on Open Information Management. He is a regular speaker at industry conferences and runs a technology-oriented blog at


VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

Ending up in legal informatics was probably more or less inevitable for me, as I wanted to study both law and electrical engineering from early on, and I just hoped that the combination would start making some sense sooner or later. ICT law (which I still pursue sporadically) emerged as an obvious choice, but AI and law seemed a better fit for my inner engineer.

dada fuzzy teacupThe topic for my (still ongoing-ish) doctoral project just sort of emerged. Reading through the books recommended by my master’s thesis supervisor (professor Peter Blume in Copenhagen) a sentence in Cecilia Magnusson Sjöberg‘s dissertation caught my eye: “According to Bench-Capon and Sergot, fuzzy logic is unsuitable for modelling vagueness in law.” (translation ar) Having had some previous experiences with fuzzy control, this seemed like an interesting question to study in more detail. To me, vagueness and uncertainty did indeed seem like good places to use fuzzy logic, even in the legal domain.

After going through loads of relevant literature, I started looking for an example domain to do some experiments. The result was MOSONG, a fairly simple model of trademark similarity that used Type-2 fuzzy logic to represent both vagueness and uncertainty at the same time. Testing MOSONG yielded perfect results on the first validation set as well, which to me seemed more suspicious than positive. If the user/coder could decide the cases correctly without the help of the system, would it not affect the coding process as well? As a consequence I also started testing the system on a non-expert population (undergraduates, of course), and the performance started to conform better to my expectations.

My original idea for the thesis was to look at different aspects of legal knowledge by building a few working prototypes like MOSONG and then explaining them in terms of established legal theory (the usual suspects, starting from Hart, Dworkin, and Ross). Testing MOSONG had, however, made me perhaps more attuned to the perspective of an extremely naive reasoner, certainly a closer match for an AI system than a trained professional. From this perspective I found conventional legal theory thoroughly lacking, and so I turned to the more general psychological literature on reasoning and decision-making. After all, there is considerable overlap between cognitive science and artificial intelligence as multidisciplinary ventures. Around this time, the planned title of my thesis also received a subtitle, thus becoming Fuzzy Systems and Legal Knowledge: Prolegomena to a Cognitive Theory of Law, and a planned monograph morphed into an article-based dissertation instead.

Illustration by David PlunkettOne particularly useful thing I found was the dual-process theory of cognition, on which I presented a paper at IVR-2011 just a couple of months before Daniel Kahneman’s Thinking, Fast and Slow came out and everyone started thinking they understood what System 1 and System 2 meant. In my opinion, the dual-process theory has important implications for AI and law, and also explains why it has struggled to create widespread systems of practical utility. Representing legal reasoning only in classically rational System 2 terms may be adequate for expert human reasoners (and simple prototype systems), but AI needs to represent the ecological rationality (as opposed to the cognitive biases) of System 1 as well, and to do this properly, different methods are needed, and on a different scale. Hello, Big Dada!

In practice this means that the ultimate way to properly test one’s theories of legal reasoning computationally is through a full-scale R&D process of an AI system that hopefully does something useful. In an academic setting, doing the R part is no problem, but the D part is a different matter altogether, both because much of the work required can be fairly routine and too uninteresting from a publication standpoint, and because the muchness itself makes the project incompatible with normal levels of research funding. Instead, typically, an interested external recipient is required in order to get adequate funding. A relevant problem domain and a base of critical test users should also follow as a part of the bargain.

In the case of legal technology, the judiciary and the public administration are obvious potential recipients. Unfortunately, there are at least two major obstacles for this. One is attitudinal, as exemplified by the recent case of a Swedish candidate judge whose career path was cut short after creating a more usable IR system for case law on his own initiative. The other one is structural, with public sector software procurement in general in a state of crisis due to both a limited understanding of how to successfully develop software systems that result in efficiency rather than frustration, and the constraints of procurement law and associated practices which make such projects almost impossible to carry out successfully even if the required will and know-how were there.

The private sector is of course the other alternative. With law firms, the prevailing business model based on hourly billing offers no financial incentives for technological innovation, as most notably pointed out by Richard Susskind, and the attitudinal problems may not be all that different. Legal publishers are generally not much better, either. And overall, in large companies the organizational culture is usually geared towards an optimal execution of plans from above, making it too rigid to properly foster innovation, and for established small companies the required investment and the associated financial risk are too great.

So what is the solution? To all early-stage legal informatics researchers out there: Find yourselves a start-up! Either start one yourself (with a few other people with complementary skillsets) or find an existing one that is already trying to do something where your skills and knowledge should come in handy, maybe just on a consultancy basis. In the US, there are already over a hundred start-ups in the legal technology field. The number of start-ups doing intelligent legal technology (and European start-ups in the legal field in general) is already much smaller, so it should not be too difficult to gain a considerable advantage over the competition with the right idea and a solid implementation. I myself am fortunate enough to have found a way to leverage all the work I have done on MOSONG by co-founding Onomatics earlier this year.

This is not to say that just any idea, even one that is good enough to be the foundation for a doctoral thesis, will make for a successful business. This is indeed a common pitfall with the commercialization of academic research in general. Just starting with an existing idea, a prototype or even a complete system and then trying to find problems it (as such) could solve is a proven way to failure. If all you have is a hammer, all your problems start to look like nails. This is also very much the case with more sophisticated tools. A better approach is to first find a market need and then start working towards a marketable technological solution for it, of course using all one’s existing knowledge and technology whenever applicable, but without being constrained by them, when other methods work better.

Testing one’s theories by seeing whether they can actually be used to solve real-world problems is the best way forward towards broader relevance for one’s own work. Doing so typically involves considerable amounts of work that is neither scientifically interesting nor economically justifiable in an academic context, but which all the same is necessary to see if things work as they should. Because of this, such real-world integration is more feasible when done on a commercial basis. In this lies a considerable risk for the findings of this type of applied research to remain entirely confidential and proprietary as trade secrets, rather than becoming published at least to some degree, thus fuelling future research also in the broader research community and not just the individual company. To avoid this, active cooperation between the industry and academia should be encouraged.

Anna Ronkainen is currently working as the Chief Scientist of Onomatics, Inc., a legal technology start-up of which she is a co-founder. Previously she has worked with language technology both commercially and academically for over fifteen years. She is a serial dropout with (somehow) a LL.M. from the University of Copenhagen, and she expects to defend her LL.D. thesis Fuzzy Systems and Legal Knowledge: Prolegomena to a Cognitive Theory of Law at the University of Helsinki during the 2013/14 academic year. She blogs at (with Anniina Huttunen) and


VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

At my organization, the Sunlight Foundation, we follow the rules. I don’t just mean that we obey the law — we literally track the law from inception to enactment to enforcement. After all, we are a non-partisan advocacy group dedicated to increasing government transparency, so we have to do this if we mean to serve one of our main functions: creating and guarding good laws, and stopping or amending bad ones.

Freedom of InformationOne of the laws we work to protect is the Freedom of Information Act. Last year, after a Supreme Court ruling provided Congress with motivation to broaden the FOIA’s exemption clauses, we wanted to catch any attempts to do this as soon as they were made. As many reading this blog will know, one powerful way to watch for changes to existing law is to look for mentions of where that law has been codified in the United States Code. In the case of the FOIA, it’s placed at 5 U.S.C. § 552. So, what we wanted was a system that would automatically sift through the full text of all legislation, as soon as it was introduced or revised, and email us if such a citation appeared.

With modern web technology, and the fact that the Government Printing Office publishes nearly every bill in Congress in XML, this was actually a fairly straightforward thing to build internally. In fact, it was so straightforward that the next question felt obvious: why not do this for more kinds of information, and make it available freely to the public?

That’s why we built Scout, our search and notification system for government action. Scout searches the bills and speeches of Congress, and every federal regulation as they’re drafted and proposed. Through the awe-tacular power of our Open States project, Scout also tracks legislation as it emerges in statehouses all over the country. It offers simple and advanced search operators, and any search can be turned into an email alert or an RSS feed. If your search turns up a bill worth following, you can subscribe to bill-specific alerts, like when a vote on it is coming up.

This has practical applications for, really, just about everyone. If you care about an issue, be it as an environmental activist, a hunting enthusiast, a high (or low) powered lawyer, or a government affairs director for a company – finding needles in the giant haystack of government is a vital function. Since launching, Scout’s been used by thousands of people from a wide variety of backgrounds, by professionals and ordinary citizens alike.

Scout search for 5 USC 601Search and notifications are simple stuff, but simple can be powerful. Soon after Scout was operational, our original FOIA exemption alerts, keyed to mentions of 5 U.S.C. § 552, tipped us off to a proposal that any information a government passed to the Food and Drug Administration be given blanket immunity to FOIA if the passing government requested it.

If that sounds crazily broad, that’s because it is, and when we in turn passed this information onto the public interest groups who’d helped negotiate the legislation, they too were shocked. As is so often the case, the bill had been negotiated for 18 months behind closed doors, the provision was inserted immediately and anonymously before formal introduction, and was scheduled for a vote as soon as Senate processes would allow.

Because of Scout’s advance warning, there was just barely enough time to get the provision amended to something far narrower, through a unanimous floor vote hours before final passage. Without it, it’s entirely possible the provision would not have been noticed, much less changed.

This is the power of information; it’s why many newspapers, lobbying shops, law firms, and even government offices themselves pay good money for services like this. We believe everyone should have access to basic political intelligence, and are proud to offer something for free that levels the playing field even a little.

Of particular interest to the readers of this blog is that, since we understand the value of searching for legal citations, we’ve gone the extra mile to make US Code citation searches extra smart. If you search on Scout for a phrase that looks like a citation, such as “section 552 of title 5″, we’ll find and highlight that citation in any form, even if it’s worded differently or referencing a subsection (such as “5 U.S.C. 552(b)(3)”). If you’re curious about how we do this, check out our open source citation extraction engine – and feel free to help make it better!

It’s worth emphasizing that all of this is possible because of publicly available government information. In 2012, our legislative branch (particularly GPO and the House Clerk) and executive branch (particularly the Federal Register) provide a wealth of foundational information, and in open, machine-readable formats. Our code for processing it and making it available in Scout is all public and open source.

Anyone reading this blog is probably familiar with how easily legal information, even when ostensibly in the public domain, can be held back from public access. The judicial branch is particularly badly afflicted by this, where access to legal documents and data is dominated by an oligopoly of pay services both official (PACER) and private-sector (Westlaw, LexisNexis).

It’s easy to argue that legal information is arcane and boring to the everyday person, and that the only people who actually understand the law work at a place with the money to buy access to it. It’s also easy to see that as it stands now, this is a self-fulfilling prophecy. If this information is worth this much money, services that gate it amplify the political privilege and advantage that money brings.

The Sunlight Foundation stands for the idea that when government information is made public, no matter how arcane, it opens the door for that information to be made accessible and compelling to a broader swathe of our democracy than any one of us imagines. We hope that through Scout, and other projects like Open States and Capitol Words, we’re demonstrating a few important reasons to believe that.

Eric Mill


Eric Mill is a software developer and international program officer for the Sunlight Foundation. He works on a number of Sunlight’s applications and data services, including Scout and the Congress app for Android.


VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

[Editor’s Note: For topic-related VoxPopuLII posts please see, among others: Nick Holmes, Accessible Law; Matt Baca & Olin Parker, Collaborative, Open Democracy with LexPop; and John Sheridan,

The growing usage of apps meant it was only a matter of time until they would find their way into legal education. Following up on a previously published article on LaaS – Law as a Service, this post discusses different ways that apps can be included into the law degree curriculum.

1 Changing legal education through the use of apps

There are different ways in which apps can be used in legal education in order to better prepare students for the legal profession. In this post we suggest three different possibilities for the usage of apps, reflecting different pedagogical styles and learning outcomes. What each of the suggestions has in common is to bring legal education closer to the real-life work of lawyers.

Through identifying aspects in which we perceive legal education as lacking quality or quantity, we apply and implement these to our suggestions for changed legal education. The aspects we view as lacking are: identifying and managing risks, the interaction between different areas of law, and proactive problem-based learning. To take each of these briefly in turn:

  • managing risks is something that practicing lawyers and other legal service professionals must do on a daily basis. Law is not only about applying legal rules but also about weighing options, estimating possible outcomes and deciding upon which risks to accept. Legal education has not traditionally included this in the curriculum, and students have arguably very little experience of such training in their studies.
  • interaction between different areas of law is often hard to incorporate in legal studies, which follow a block or module structure. Each course provides students with in-depth knowledge of that particular legal area. However, the interaction between such modules is lacking, with teachers often unaware of the content of preceding or succeeding courses. For students, a problem with this module structure can be that they forget the content of a course studied at an earlier stage in their education.
  • problem-based learning is generally encouraged and applied in legal education. However, most problem-based learning (PBL) is reactive, asking students to evaluate the legal consequences of a scenario that has already played out, instead of training students purely in after-the-fact solutions, in other words “clearing up the legal mess.” PBL should be made more proactive, aiming to train students in identifying and counteracting problems before they arise. This can also be viewed as an implementation of the first aspect, managing risks.

In aiming to include these aspects in legal education, we view technology as playing an important role. Perhaps ideally, the whole legal education could be re-structured in order to include such practical aspects that reflect the current legal profession; however, such change is perhaps too complex and viewed as somewhat unnecessary by those who are able to make such changes – if it ain’t broke, don’t fix it, as the saying goes. An app is not necessarily the sole possible implementation method, but it serves as an example of how these aspects can relatively easily be brought within legal education.

The first teaching approach looks at legal aspects of apps themselves, where the apps are viewed as objects within law. Here students are provided with a set problem and are encouraged to consider how different areas of law may apply to the app in question, and how the various areas impact on each other. This approach implements both PBL and the interaction of different legal fields. Proactiveness may also be included by asking students to identify legal risks of the app, and how such risks could by reduced through the use of law.

The second approach brings together technology and law, and is as such a suitable suggestion for inclusion in a legal informatics course, or as part of more general jurisprudence. Students are given the task of developing a legal service app, and thus must implement law through a technological tool. Students must first identify a need for a service within an area of law of their choosing, and then develop an app which provides the service. This approach implements both PBL and proactiveness, and can also require students to consider both legal and technical risks.

The third approach aims to add value to the legal education as a whole, by making available an app to students to be used alongside teaching, complementing the existing education. Students are provided with the opportunity to test their knowledge, and combine different areas of study through interactive learning. Depending on the design of the app, this approach has the possibility of implementing all aspects: PBL, interaction of legal fields, and proactiveness or risk management.

2 Legal aspects of apps

Legal education in many countries around the world is set up as linear blocks of different legal fields and subject areas. As law is often divided into various sub-fields–such as private law, public law, administrative law, environmental law, or information technology law–it appears only natural to discuss and teach the subjects one by one. The amount of material to be learned by the student would otherwise be overwhelming. While in some countries, exams might encompass multiple fields of law, subjects are being taught in a consecutive order.

Though the pedagogical reasons for the linearity in legal education are convincing, some improvements are still possible. One idea that we would like to discuss here are legal aspects of apps that intertwine different legal fields and challenge the students to analyze one particular phenomena from various different legal angles. We are not suggesting any particular fora for this exercise; these might stretch from traditional in-class seminars to online e-learning platforms to a mixture of the two and be included in law school curricula either as compulsory or selective modules.

Apps and information communication technologies, in general, do not adhere to geographical, physical or time related boundaries. They inherently challenge the traditional legal system based on bricks and mortar. In this regard they are, therefore, well-suited for legal analysis.

Another reason to use apps as the object for analysis by students is their popularity among the younger (and older) generation and therefore the close relationship students have to them to start with. As an example, one can compare it to using Facebook when discussing privacy, as opposed to showing a large company’s employee database.

In order to reflect the real-life experience of the exercise even more, the students would be allocated a certain expert area. As at law firms, one student would be an expert in intellectual property rights, another in contract law, another in privacy, international law, consumer rights issues, etc.

The students would –from the perspective of their expert area–firstly investigate possible legal issues with a specific gaming app, for example. They would analyze the application of the rules and norms within their field and identify potential conflicts or loopholes within these rules. Their investigation would include testing the app itself, as well as looking at possible end-user agreements and other applicable contractual agreements between the user, the app store and the developer of the app.

The next step would be to identify and discuss possible overlaps, discrepancies and conflicts between the different areas of law in relation to the app. The exercise should result in a written and/or oral report of the different legal issues involved and solutions to potential conflicts between the law and the app.

Adding another layer of real-life scenario, each group could be asked to present their findings to an imaginative client who is the producer of the app. This simulation would allow the students not only to develop a legal analysis based on correlating fields of law but also to present the analysis to non-lawyers, translating legal jargon into understandable everyday language.

The exercise–analyzing an existing app–very much fits into the idea often conveyed in legal education that law is applied after an incident occurs. In order to add a level of proactivity, students could be asked to analyze an app under production, before it is launched. This would guarantee more proactive thinking by the students asking them to foresee potential conflicts and avoid them, rather than discussing legal issues after they have arisen.

While the exercise as such might not be a revolutionary idea, we think that the increased inclusion of such exercises in legal education would contribute to better preparation of students for their life as young lawyers.

3 Law’s implementation in apps

While the previous exercise fits well within the traditional legal education by asking students to deliver a legal analysis, a topic less discussed in undergraduate legal studies is how to employ technology for delivering law. With a few exceptions, students generally focus on analyzing the law rather than implementing law in technology.Change Priorities

Until several years ago legal analysis was the main business for lawyers, so legal education well reflected the profession. In the last few years, however, legal services delivered via and as technology have increased and opened up a new market for lawyers and legal professionals. This change should be reflected in legal education in order to prepare students for their future.

While the idea is not to replace lawyers with apps or software, an app or another technology could either help lawyers in their working tasks or deliver law as a valuable service for consumers, citizens, companies or organizations. Examples of such apps, both for lawyers and end-users, are mentioned in a post at iinek’s blog and Slaw; shorter lists can be found on iinek’s Delicious page and the iPad4lawyers blog.

In the exercise, students would look at law from a different perspective, i.e. how legal regulations affect the individual or organization. Going away from a linear text approach, students would have to translate law into a format that users or apps can read. In other words, law would have to suit the user/app, and not the other way around. Students would, therefore, have to go beyond text and translate rules into flowcharts, diagrams, mind maps and other visual tools in order for the app to be able to follow the law’s instructions.

Implementing legal rules into technology, therefore, not only encourages students to think proactively but it also motivates them to identify solutions for the application of the law and how rules could be transformed into practice. From a pedagogical point of view the exercise would allow the students to think about different aspects of law beyond the traditional case or contract. It would also encourage a wider viewpoint of law as a tool in society.

Again, how the exercise is included in the curriculum is a matter of taste. Technical assistance is of importance, in order for students to know what aspects to take into account and what schematics developers need in order to be able to create an app. The exercise could be set up as a competition (Georgetown Law SchoolIron Tech Lawyer) with an expert jury consisting of practicing lawyers and developers.

4 Legal education as an app

Talking about legal education as an app can have different meanings. While legal apps (for lawyers and individuals) and educational apps are rather common these days, legal educational apps are not so developed, yet.Puzzle

Legal education, as mentioned, is traditionally taught in blocks or modules, with very few references and links between them. This setup clearly has its benefits, not least logistically. There are clear arguments in favor of such an approach; planning and studying becomes easier for teachers and students alike, time limitations mean that implementing an approach that makes connections between each subject is hard. This is where we believe that technology has the potential to play an important role. Technology is not bound to physical classrooms and attendance requirements of students or teachers. It has the ability to be accessed at a time of the student’s choosing, without placing additional demands on instructors.

A legal education app could provide the key in aiding students to make connections between their study areas; it could be made to fit alongside a law degree, assuming a student’s knowledge in sync with their level of study, by including content from both current and past courses. The app would offer an easy way to implement an interactive, problem-based learning approach. It could provide additional content, quizzes, exercises, social media functions etc. complementing the education and enabling a holistic perspective.

Although no teacher-student relationship is required here, clearly pedagogical thinking would need to play a strong role so that a worthwhile learning environment for the individual could be created. Much time and effort would need to be invested in planning, and the application itself would need to be flexible to adjust to different study plans and so forth. Another issue is, of course, who would make the app. As curricula vary from law school to law school, and jurisdiction to jurisdiction, such an app is ideally built by those who know the curriculum. Such “in-house” expertise also means that potential bias from outside factors should be avoided.

Legal apps have already been introduced to help lawyers study for qualifying exams, e.g. BarMax. (These are often, however, still very topic-specific.) Implementing the same kind of thinking at the educational level would start to prepare students for their future workplace, allowing them to be better prepared for helping clients with real-world scenarios dealing with complex and interrelating legal issues. If students begin such thinking at the beginning of their legal studies, it becomes normal, arguably allowing for better educated graduates.

This last approach is perhaps a little future-oriented (although not as much as, for example, grading by technology), and it is of course not easy to implement at the university level; academics must work together with app developers to produce a tool of real value to students. However, even a slimmed-down version of such an app can be a tool for helping students prepare for exams, test their knowledge of legal areas, or simply make sure that they have understood concepts covered in teaching. Some examples of such implementations in legal education are shown here.

5 Conclusions

There is no doubt that apps are the future for legal services. To what extent they will be included in legal education is yet to be decided. Here we have shown three differing approaches that could help in this regard. Implementation of any or all of these would bring in aspects that are currently lacking in legal education.

Rather often discussions on technology and legal education focus on e-learning and online teaching environments. In our opinion, traditional offline exercises and their pedagogical value should not be underestimated, with technology offering an excellent platform as an object, tool or companion during legal education and life as a lawyer.

6 Sources

Christine KirchbergerChristine Kirchberger is a doctoral candidate & lecturer in legal informatics at the Swedish Law and Informatics Research Institute (IRI). Her research focuses on legal information retrieval, the concept of legal information within the framework of the doctrine of legal sources and also examines the information-seeking behaviour of lawyers. Christine blogs at and can be found as @iinek on Twitter.

Pam StorrPam Storr is a lecturer at the Swedish Law and Informatics Research Institute (IRI), and course director for the Master Programme in Law and Information Technology at Stockholm University. Her main areas of interest are within information technology and intellectual property law. Pam is the editor for IRI’s blog, Blawblaw, and can be found as @pamstorr on Twitter.


VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed. The information above should not be considered legal advice. If you require legal representation, please consult a lawyer.

When I started to write a post about THOMAS and its place in open government about three months ago, I was feeling apologetic. I was going make a heavy-handed, literal comparison of the opening hours of the Law Library of Congress (one of the maintainers of THOMAS, and my former employer, whose views are not at all represented here) with open government. I planned to wax sympathetic on the history of THOMAS, and how little has changed since it was first built. But, that post would not have added anything new to the #opengovdata conversation, or really mentioned data at all.

Just over one month ago, #freeTHOMAS reached a fever pitch surrounding the passage of H.R. 5882, the Legislative Branch Appropriations Act for FY2013. Just before passage, H.Rpt.511 directed official conversation on open legislative data for the coming fiscal year by saying, “let’s talk.” In a section about the Government Printing Office, the House Appropriations Committee expressed concerns about authentication and open legislative data, but called for a task force “composed of staff representatives of the Library of Congress, the Congressional Research Service, the Clerk of the House, the Government Printing Office, and such other congressional offices as may be necessary,” to look in to the matter.

Opengovdata was disappointed in government. The tone of the House Report suggested that government had been dismissive of opengovdata–and in all fairness, others were beginning to be dismissive of opengovdata as well.

But, a clear classification problem was emerging. Inspired by Lawrence Lessig’s Freedom To Connect speech at the AFI in late May, I had a very librarian moment on organizational hierarchies.

#OpenGov is a really big tent.

People who want a more open environment for government to communicate with the governed want data, increased transparency, plain language legislation, open court filings, access to government funded research, silly walks, etc. Accountability through transparency often dominates the conversation, thanks to well-funded non-profits and high profile projects. However, there’s more to the movement. In that spirit…

Let’s be transparent about transparency.

When the goal of an open government project is legislative transparency through freely accessible data, let’s focus on that. When the goal is something else, let’s focus on that too. We hear about government accountability through data because the voices calling for it are loud. But data can do much more than bring about a more transparent lawmaking process.

In the words of a wise man, make transparency your Number Two.

If you haven’t had the chance to watch this aforementioned Freedom To Connect talk, and you’ve got half an hour to spare, I highly recommend it. The subject is community broadband, but it’s hard not to be inspired to frame other issues smarter, with transparency ever-present in the background.

Let’s focus on Number One.

If the THOMAS data, for example, were open right now, this instant, you couldn’t watch it on TV. You couldn’t read it on your Kindle. It’s mere presence would not increase transparency. Someone would have to do something with it.  Number One is the thing you do with the data to reach your own goal–and that goal might not be legislative transparency.

As a public law librarian to a broad constituency, my goals are different than those of a non-profit think tank, or a law firm, or a law school, or even a non-law public library. In a climate of doing more with less, of needing to show much return for little investment, we each have to frame specific, measurable, achievable Number Ones tailored to the needs of our institutions. Without these Number Ones (goals, mission statements, benchmarks, or whatever management word your organization uses), we flounder off-mission, lose focus, and potentially lose funding.

Librarians are foot soldiers for the First Amendment–we like open information, we place a high value on the freedom to know. However, we’re among the first to be cut in tight budget situations, and we’re all too familiar with the perils of asking for something that’s overly broad, or asking for something that you can’t show narrowly tailored value for later on.

With respect to open gov data: government accountability is not unimportant to me as a voter. However, as a law librarian, I need to focus on Number Ones with more specific, smaller-scale goals than transparency, that will create measurable outcomes, allowing me to show concrete value to my institution. The big picture of how information is available, and the relationship between the government and the governed is important, but it doesn’t always get you funding, and it can’t always answer the question of the patron in front of you.

 What’s your Number One?
There’s plenty of data out there. What are you doing with it? How can you manipulate raw free resources into something good for your institution? There is much to be said for information for the sake of information. I can’t imagine needing to convince most library-types of that. That said, we library-types, we information professionals, we decision makers, and perhaps we citizens need to narrow open gov to make it work for us. Data is good, but a real-time interactive civics education program based on THOMAS data for K-12 students is better. Let transparency folks fight the good fight, and don’t forget their work. But while you’ve got your librarian hat on, focus on a Number One that works for you.

Meg Lulofs is an information professional at large, blogging at, editing Pimsleur’s Checklists of Basic American Legal Publications, and making mischief. She earned a J.D. from the University of Baltimore, and a M.L.I.S. from Catholic University. She welcomes feedback at You can follow her on Twitter @librarylulu, or on Facebook at


VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed. The information above should not be considered legal advice. If you require legal representation, please consult a lawyer.

[Editor's Note] For topic-related VoxPopuLII posts please see: Nick Holmes, Accessible Law; John Sheridan,; David Moore, Researching U.S. State Legislation.