skip navigation


AT4AM – Authoring Tool for Amendments – is a web editor provided to Members of European Parliament (MEPs) that has greatly improved the drafting of amendments at European Parliament since its introduction in 2010.

The tool, developed by the Directorate for Innovation and Technological Support of European Parliament (DG ITEC) has replaced a system based on a collection of macros developed in MS Word and specific ad hoc templates.

Why move to a web editor?

The need to replace a traditional desktop authoring tool came from the increasing complexity of layout rules combined with a need to automate several processes of the authoring/checking/translation/distribution chain.

In fact, drafters not only faced complex rules and had to search among hundreds of templates in order to get the right one, but the drafting chain for all amendments relied on layout to transmit information down the different processes. Bold / Italic notation or specific tags were used to transmit specific information on the meaning of the text between the services in charge of subsequent revision and translation.

Over the years, an editor that was initially conceived to support mainly the printing of documents was often used to convey information in an unsuitable manner. During the drafting activity, documents transmitted between different services included a mix of content and layout where the layout sometime referred to some information on the business process that should rather be transmitted via other mediums.

Moreover, encapsulating in one single file all the amendments drafted in 23 languages was a severe limitation for subsequent revisions and translations carried out by linguistic sectors. Experts in charge of legal and linguistic revision of drafted amendments, who need to work in parallel on one document grouping multilingual amendments, were severely hampered in their work.

All the needs listed above justified the EP undertaking a new project to improve the drafting of amendments. The concept was soon extended to the drafting, revision, translation and distribution of the entire legislative content in the European Parliament, and after some months the eParliament Programme was initiated to cover all projects of the parliamentary XML-based drafting chain.

It was clear from the beginning that, in order to provide an advanced web editor, the original proposal to be amended had to be converted into a structured format. After an extensive search, XML Akoma Ntoso format was chosen, because it is the format that best covers the requirements for drafting legislation. Currently it is possible to export amendments produced via AT4AM in Akoma Ntoso. It is planned to apply Akoma Ntoso schema to the entire legislative chain within eParliament Programme. This will enable EP to publish legislative texts in open data format.

What distinguishes the approach taken by EP from other legislative actors who handle XML documents is the fact that EP decided to use XML to feed the legislative chain rather than just converting existing documents into XML for distribution. This aspect is fundamental because requirements are much stricter when the result of XML conversion is used as the first step of legislative chain. In fact, the proposal coming from European Commission is first converted in XML and after loaded into AT4AM. Because the tool relies on the XML content, it is important to guarantee a valid structure and coherence between the language versions. The same articles, paragraphs, point, subpoints must appear at the correct position in all the 23 language versions of the same text.

What is the situation now?

After two years of intensive usage,  Members of European Parliaments have drafted 285.000 amendments via AT4AM. The tool is also used daily by the staff of the secretariat in charge of receiving tabled amendments, checking linguistic and legal accuracy and producing voting lists. Today more then 2300 users access the system regularly, and no one wants to go back to the traditional methods of drafting. Why?

Automatic Bold ItalicBecause it is much simpler and faster to draft and manage amendments via an editor that takes care of everything, thus  allowing drafters to concentrate on their essential activity: modifying the text.

Soon after the introduction of AT4AM, the secretariat’s staff who manage drafted amendments breathed a sigh of relief, because errors like wrong position references, which weBetterre the cause of major headaches, no longer occurred.

What is better than a tool that guides drafters through the amending activity by adding all the surrounding information and taking care of all the metadata necessary for subsequent treatment, while letting the drafter focus on the text amendments and produce well-formatted output with track changes?

After some months of usage, it was clear that not only the time to draft, check and translate amendments was drastically reduced, but also the quality of amendments increased.

QuickerThe slogan that best describes the strength of this XML editor is: “You are always just two clicks away from tabling an amendment!”



Web editor versus desktop editor: is it an acceptable compromise?

One of the criticisms that users often raise against web editors is that they are limited when compared with a traditional desktop rich editor. The experience at the European Parliament has demonstrated that what users lose in terms of editing features is highly compensated by the gains of getting a tool specifically designed to support drafting activity. Moreover, recent technologies enable programmers to develop rich web WYSIWYG (What You See Is What You Get) editors that include many of the traditional features plus new functions specific to a “networking” tool.

What’s next?

The experience of EP was so positive and so well received by other Parliaments that in May 2012, at the opening of the international workshop “Identifying benefits deriving from the adoption of XML-based chains for drafting legislation“, Vice President Wieland announced the launch of a new project aimed at to providing an open source version of the AT4AM code.

AT4AM for All in a video conference with the United Nations Department for General Assembly and Conference Management from New York on 19 March 2013, Vice President Wieland announced,  the UN/DESA’s Africa i-Parliaments Action Plan from Nairobi and the Senate of Italy from Rome, the availability of AT4AM for All, which is the name given to this open source version, for any parliament and institution interested in taking advantage of this well-oiled IT tool that has made the life of MEPs much easier.

The code has been released under EUPL(European Union Public Licence), an open source licence provided by European Commission that is compatible with major open source licences like Gnu GPLv2 with the advantage of being available in the 22 official languages of the European Union.

AT4AM for All is provided with all the important features of the amendment tool used in the European Parliament and can manage all type of legislative content provided in the XML format Akoma Ntoso. This XML standard, developed through the UN/DESA’s initiative Africa i-Parliaments Action Plan, is currently under certification process at OASIS, a non-profit consortium that drives the development, convergence and adoption of open standards for the global information society. Those who are interested may have a look to the committee in charge of the certification: LegalDocumentML

Currently the Documentation Division, Department for General Assembly and Conference Management of United Nations is evaluating the software for possible integration in their tools to manage UN resolutions.

The ambition of EP is that other Parliaments with fewer resources may take advantage of this development to improve their legislative drafting chain. Moreover, the adoption of such tools allows a Parliament to move towards an XML based legislative chain. The distribution of legislative content in open document formats like XML allows other parties to treat in an efficient way the legislation produced.

Thanks to the efforts of European Parliament, any parliament in the world is now able to use the advanced features of AT4AM to support the drafting of amendments. AT4AM will serve as a useful tool for all those interested in moving towards open data solutions and more democratic transparency in the legislative process.

At AT4AM for All website it is possible to get the status of works and run a sample editor with several document types. Any Parliament interested can go to the repository and download the code.

Claudio FabianiClaudio Fabiani is Project Manager at the Directorate-General for Innovation and Tecnological Support of European Parliament. After an experience of several years in private sector as IT consultant, he started his career as civil servant at European Commission, in 2001, where he has managed several IT developments. Since 2008 he is responsible of AT4AM project and more recently he has managed the implementation of AT4AM for All, the open source version.



VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.


Vox.summer_readingMaybe it’s a bit late for a summer reading list, or maybe you’re just now starting to pack for your vacation, deep in a Goodreads list that you don’t ever expect to dig your way out of. Well, let us add to your troubles with a handful of books your editors are currently enjoying.

Clearing in the forest : law, life, and mind, by Steven L. Winter. A 2001 cognitive science argument for studying and developing law. Perhaps a little heavy for poolside, one of your editors finds it perfect for multi-day midwestern summer rainstorms, alons with a pot of tea. Review by Lawrence Solan in the Brooklyn Law Review, as part of a symposium.

Digital Disconnect: How Capitalism is Turning the Internet Against Democracy, by Robert W. McChesney.

“In Digital Disconnect, Robert McChesney offers a groundbreaking critique of the Internet, urging us to reclaim the democratizing potential of the digital revolution while we still can.”

This is currently playing on my work commute.

The Cognitive Style of Power Point: Pitching Out Corrupts Within, by Edward Tufte. Worth re-reading every so often, especially heading into conference/teaching seasons.

Delete: The Virtue of Forgetting in a Digital Age, by VoxPopuLII contributor Viktor Mayer-Schonberger. Winner of the 2010 Marshall McLuhan Award for Outstanding Book in Media ecology, Media Ecology Association; Winner of the 2010 Don K. Price Award for Best Book in Science and Technology Politics, Section on Science, Technology, and Environmental Politics (STEP) by the American Political Science Association. Review at the Times Higher Education.

Piracy: The Intellectual Property Wars from Gutenberg to Gates, by Adrian Johns (2010). A historian’s view of Intellectual Property — or, this has all happened before. Reviews at the Washington Post and the Electronic Frontier Foundation. From the latter, “Radio arose in the shadow of a patent thicket, became the province of tinkers, and posed a puzzle for a government worried that “experimenters” would ruin things by mis-adjusting their sets and flooding the ether with howling oscillation. Many will immediately recognize the parallels to modern controversies about iPhone “jailbreaking,” user innovation, and the future of the Internet.”

The Master Switch: The Rise and Fall of Information Empires, by Tim Wu (2010). A history of communications technologies, and the cyclical (or not) trends of their openness, and a theory on the fate of the Internet. Nice reviews on Ars Tecnica and The Guardian.

Too Big to Know: Rethinking Knowledge Now That the Facts Aren’t the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room, by David Weinberger (author of the Cluetrain Manifesto). For more, check out this excerpt by Weinberger in The Atlantic and

You are not so smart, by David McRaney. Examines the myth of being intelligent — a very refreshing read for the summer. A review of the book can be found at Brainpickings, which by the way is an excellent blog and definitely worth a look.

On a rainy day you can always check out the BBC series “QI” with a new take on what we think we know but don’t know. Hosted by Stephen Fry. Comedians share their intelligence with witty humour and you will learn a thing or two along the way. The TV show has also led to a few books, e.g. Qi: the Book of General Ignorance (Q1), by John Lloyd


Sparing the cheesy beach reads, here’s a fiction set that you may find interesting.

The Ware Tetralogy: Ware #1-4 , by Rudy Rucker (currently $6.99 for the four-pack)

Rucker’s four Ware novels–Software (1982), Wetware (1988), Freeware (1997), and Realware (2000)–form an extraordinary cyberweird future history with the heft of an epic fantasy novel and the speed of a quantum processor. Still exuberantly fresh despite their age, they primarily follow two characters (and their descendants): Cobb Anderson, who instigated the first robot revolution and is offered immortality by his grateful “children,” and stoner Sta-Hi Mooney, who (against his impaired better judgment) becomes an important figure in robot-human relations. Over several generations, humans, robots, and society evolve, but even weird drugs and the wisdom gathered from interstellar signals won’t stop them from making the same old mistakes in new ways. Rucker is both witty and serious as he combines hard science and sociology with unrelentingly sharp observations of all self-replicating beings. — Publisher’s Weekly

Happy reading! We’ll return mid-August with a feature on AT4AM.


VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

In March, Mike Lissner wrote for this blog about the troubling state of access to case law – noting with dismay that most of the US corpus is not publicly available. While a few states make official cases available, most still do not, and neither does the federal government. At Ravel Law we’re building a new legal research platform and, like Mike, we’ve spent substantial time troubleshooting access to law issues. Here, we will provide some more detail about how official case law is created and share our recommendations for making it more available and usable. We focus in particular on FDsys – the federal judiciary’s effort in this space – but the ideas apply broadly.

The Problem

If you ask a typical federal court clerk, such as our friend Rose, Pacific_Reporterabout the provenance of case opinions you will only learn half the story. Rose can tell you that after she and her judge finish an opinion it gets sent to a permanent court staffer. After that the story that Rose knows basically ends. The opinion at this stage is in its “slip” opinion state, and only some time later will Rose see the “official” version – which will have a citation number, copy edits, and perhaps other alterations. Yet, it is only this new “official” version that may be cited in court. For Mike Lissner, for Ravel, and for many others, the crux of the access challenge lies in steps beyond Rose’s domain, beyond the individual court’s in fact – when a slip becomes an official opinion.

For years the federal government has outsourced the creation of official opinions, relying on Westlaw and Lexis to create and publish them. These publishers are handed slip opinions by court staff, provide some editing, assign citations and release official versions through their systems. As a result, access to case law has been de facto privatized, and restricted.


Of late, however, courts are making some strides to change the nature of this system. The federal judiciary’s FDsys_bannerprimary effort in this regard is FDsys (and also see the 9th Circuit’s recent moves). But FDsys’s present course gives reason to worry that its goals have been too narrowly conceived to achieve serious benefit. This discourages the program’s natural supporters and endangers its chances of success.

We certainly count ourselves amongst FDsys’s strongest supporters, and we applaud the Judicial Conference for its quick work so far. And, as friends of the program, we want to offer feedback about how it might address the substantial skepticism it faces from those in the legal community who want the program to succeed but fear for its ultimate success and usability.

Our understanding is that FDsys’s primary goal is to provide free public access to court opinions. Its strategy for doing so (as inexpensively and as seamlessly as possible) seems to be to fully implement the platform at all federal courts before adding more functionality. This last point is especially critical. Because FDsys only offers slip opinions, which can’t be cited in court, its current usefulness for legal professionals is quite limited; even if every court used FDsys it would only be of marginal value. As a result, the legal community lacks incentive to lend its full, powerful, support to the effort. This support would be valuable in getting courts to adopt the system and in providing technology that could further reduce costs and help to overcome implementation hurdles.

Setting Achievable Goals

We believe that there are several key goals FDsys can accomplish, and that by doing so it will win meaningful support from the legal community and increase its end value and usage. With loftier goals (some modest, others ambitious), FDsys would truly become a world-class opinion publishing system. The following are the goals we suggest, along with metrics that could be used to assess them.



1. Comprehensive Access to Opinions - Does every federal court release every published and unpublished opinion?
  - Are the electronic records comprehensive in their historic reach?
2. Opinions that can be Cited in Court - Are the official versions of cases provided, not just the slip opinions?
  - And/or, can the version released by FDsys be cited in court?
3. Vendor-Neutral Citations - Are the opinions provided with a vendor-neutral citation (using, e.g., paragraph numbers)?
4. Opinions in File Formats that Enable Innovation - Are opinions provided in both human and machine-readable formats?
5. Opinions Marked with Meta-Data - Is a machine-readable language such as XML used to tag information like case date, title, citation, etc?
  - Is additional markup of information such as sectional breaks, concurrences, etc. provided?
6. Bulk Access to Opinions - Are cases accessible via bulk access methods such as FTP or an API?


The first three goals are the basic building blocks necessary to achieve meaningful open-access to the law. As Professor Martin of Cornell Law and others have chronicled, the open-access community has converged around these goals in recent years, and several states (such as Oklahoma) have successfully implemented them with very positive results.

Goals 3-6 involve the electronic format and storage medium used, and are steps that would be low-cost enablers of massive innovation. If one intention of the FDsys project is to support the development of new legal technologies, the data should be made accessible in ways that allow efficient computer processing. Word documents and PDFs do not accomplish this. PDFs, for example, are a fine format for archival storage and human reading, but computers don’t easily read them and converting PDFs into more usable forms is expensive and imperfect.

In contrast, publishing cases at the outset in a machine-readable Oliver_Wendell_Holmes_Jr_circa_1930-editformat is easy and comes at virtually no additional cost. It can be done in addition to publishing in PDF. Courts and the GPO already have electronic versions of cases and with a few mouse clicks could store them in a format that would inspire innovation rather than hamper it. The legal technology community stands ready to assist with advice and development work on all of these issues.

We believe that FDsys is a commendable step toward comprehensive public access to law, and toward enabling innovation in the legal space. Left to its current trajectory, however, it is certain to fall short of its potential. With some changes now, the program could be a home run for the entire legal community, ensuring that clerks like Rose can rest assured that the law as interpreted by her judge is accessible to everyone.


Nik and DanielDaniel Lewis and Nik Reed are graduates of Stanford Law School and the co-founders of Ravel Law, a legal search, analytics, and collaboration platform. In 2012, Ravel spun out of a Stanford University Law School, Computer Science Department, and Design School collaborative research effort focused on legal citation networks and information design. The Ravel team includes software engineers and data scientists from Stanford, MIT, and Georgia Tech. You can follow them on Twitter @ravellaw

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

For decades, words have been lawyers’ tools of trade. Today, we should no longer let tradition force us to think inside the text-only box. Apart from words, there are other means available.

It is no longer enough (if it ever was) to offer more information or to enhance access alone: the real challenge is the understandability of the content. We might have access to information, but still be unable to decode it or realize its importance. It is already painfully clear that the general public does not understand legalese, and that communication is becoming more and more visual and rapid. There is a growing literature about style and typography for legal documents and contracts, yet the use of visual and non-textual elements has been so far omitted for the most part. Perhaps images do not seem “official”, “legal”, or trustworthy enough for all.

Last year, in Sean McGrath’s post on Digital Law, we were alerted to what lawyers need to learn from accountants. In this post, we present another profession as our role model, one with a considerably shorter history than that of accountants: information designers.

Focus on users and good communication

Lawyers are communication professionals, even though we do not tend to think about ourselves in these terms. Most of us give advice and produce content and documents to deliver a specific message. In many cases a document — such as a piece of legislation or a contract — in itself is not the goal; its successful implementation is. Implementation, in turn, means adoption and action, often a change of behavior, on the part of the intended individuals and organizations.

Law school does not teach us how to enhance the effectiveness of our message. While many lawyers are known to be good communicators, most have had to learn the hard way. It is easy to forget that our colleagues, members of the legal community, are not the only users of our work. When it comes to other users of our content and documents, we can benefit from starting to think about 1) who these users are, 2) what they want or need to know, 3) what they want to achieve, 4) in which situation, and 5) how we can make our content and documents as clear, engaging and accessible as possible.

These questions are deeply rooted in the discipline of information design. The work of information designers is about organizing and displaying information in a way that maximizes its clarity and understandability. It focuses on the needs of the users and the context in which they need to find and apply information. When the content is complex, readers need to grasp both the big picture and the details and often switch between these two views. This is where visualization — here understood as adding graphs, icons, tables, charts and images to supplement text — enters the picture. Visualization can help in navigating text, opening up its meaning and reinforcing its message, even in the field of law. And information design is not about visualization only: it is also about many other useful things such as language, readability, typography, layout, color coding, and white space.

Want to see examples? Look no further!


Figure 1: Excerpt from Vendor Power! – a visual guide to the rights and duties for street vendors in New York City. © 2009 The Center for Urban Pedagogy.

A convincing example of visualizing legal rules is “Vendor Power!”, a work carried out by a collaboration of the Center for Urban Pedagogy, the designer Candy Chang, and the advocacy organization the Street Vendor Project. After noting that the “rulebook [of legal code] is intimidating and hard to understand by anyone, let alone someone whose first language isn’t English”, the project prepared Vendor Power!, a visual Street Vendor Guide that makes city regulations accessible and understandable (Figure 1). The Guide presents key information using short sentences in five languages along with diagrams illustrating vendors’ rights and the rules that are most commonly violated.

In the UK, the TDL London team turned recent changes in the rules related to obtaining a UK motorcycle licence into an interactive diagram that helps its viewers understand which motorcycles they are entitled to ride and how to go about obtaining a motorcycle licence.  In Canada in 2000, recognizing the need for new ways to improve public access to the law, the Government commissioned a White Paper proposing a new format for legislation. The author, communication designer David Berman, introduced graphic design methods and the concept of using diagrams to help describe laws. While creating a flowchart diagram, Berman’s team revealed inconsistencies not accounted for in the legislation, suggesting that if visualization was used in the drafting process, the resulting legislation could be improved. One of the authors (the designer) can confirm this “logical auditing”  power of visualization, as similar information gaps were promptly revealed by visualizing through flowcharts the Finnish General Terms of Public Procurement in Service Contracts, during the PRO2ACT research project.

Not only have designers applied their talent to legal information; some lawyers, like Susanne Hoogwater of Legal Visuals and Olivia Zarcate of Imagidroit, and future lawyers, like Margaret Hagan of Open Law Lab, have turned into designers themselves, with some remarkable results that you can find on their websites.

Legal visualization may deal with data, information, or knowledge. While the former two require software tools and coding expertise in order to generate images that represent complex data structures (an example is the work of Oliver Bieh-Zimmert who visualized the network of paragraphs and the structure of the German Civil Code), knowledge visualization tends to use a more ‘handcrafted’ approach, similar to how graphic designers rather than programmers work. The authors of this post have relied on the latter when enhancing contract usability and user experience through visualization, utilizing simple yet effective visualizations such as “metro maps” (Figure 2), timelines, flowcharts, icons and graphs. More examples of the work, carried out in the FIMECC research program User Experience & Usability in Complex Systems (UXUS), are available here, while our most recent paper, Transforming Contracts from Legal Rules to User-centered Communication Tools, published in 2013 Volume I Issue III of Communication Design Quarterly Review , discusses how greatly visualization can contribute to the user-centeredness of contracts.

Figure 2

Figure 2. Example of a “metro map” that explains the process of availability testing, as described in an agreement on the purchase of industrial machinery and equipment. © 2012 Aalto University. Author: Stefania Passera.

When teaching cross-border contract law to business managers and students, one of the authors (the lawyer) has also experimented with graphic facilitation and real-time visualization, with the aim of curing contract phobia, changing attitudes, and making contracts’ invisible (implied) terms visible. Examples of images by Annika Varjonen of Visual Impact are available here and, dating back from 1997, here.

The Wolfram Demonstrations Project illustrates a library of visual and interactive demonstrations, including one contributed by Seth Chandler on the Battle of Forms that describes the not-uncommon situation where one company makes an offer using a pre-printed form containing its standard terms, and the other party responds with its own form and set of standard terms. The demonstration allows users to choose various details of the case, with the output showing the most likely finding as to whether a contract exists and the terms of that contract, together with the arguments that can be advanced in support of that finding.

In the digital world, Creative Commons licenses use simple, recognizable icons which can be clicked on to reveal a plain-language version of the relevant text. If additional information is required, the full text is also available and just one click away. The information is layered: there is what the authors call the traditional Legal Code (the “lawyer readable” version), the Commons Deed (the “human readable” version, acting as a user-friendly interface to the Legal Code), and the “machine readable” version of the license. A compilation made by Pär Lannerö in the context of the Common Terms project reveals a number of projects that have looked into the simplification of online terms, conditions and policies. An experiment involving icons was carried out by Aza Raskin for Mozilla. The set of Privacy Icons developed by Raskin can be used by websites to clarify the ways in which users of the website are agreeing to allow their personal data to be used (Figure 3).

Figure 3

Figure 3. Examples of icons used for the rapid communication of complex content on the Web: Mozilla Privacy Icons by Aza Raskin. Source: Image released under a CreativeCommons licence CC BY-NC 2.0

In Australia, Michael Curtotti and Eric McCreath have worked with enhancing the online visualization of legislation, and work is currently in progress on the development of software-based tools for reading and writing law . This work has grown out of experience in contract drafting and the drafters’ needs for practical software tools. Already in 2001, in their ACCA Docket article Doing deals with flowcharts, Henry W. (Hank) Jones and Michael Oswald recognized this need and discussed the technology tools available to help lawyers and others to use flowcharts to clarify contractual information. They showed examples of how the logic of contract structure, the actors involved, and clauses such as contract duration and indemnification can be visualized, as well as explaining why this should be done.

In the United States, the State Decoded (State codes, for humans) is a platform that develops new ways to display state codes, court decisions, and information from legislative tracking services. With typography, embedded definitions of legal terms and other means, this project aims to make the law more easily understandable. The first two state sites, Virginia and Florida, are currently being tested.

Recently, visual elements have even made their way into court decisions: In Sweden, a 2009 judgment of the Court of Appeal for Western Sweden includes two timeline images showing the chain of events that is crucial to understanding the facts of the case. This judgment won the Plain Swedish Crystal 2010, a plain language award. In the United States, an Opinion by Judge Richard Posner of the Chicago-based 7th U.S. Circuit Court of Appeals uses the ostrich metaphor to criticize lawyers who ignore court precedent. Two photos are included in this opinion: one of an ostrich with its head buried in the sand, another of a man in a suit with his head buried in the sand.

Want to learn more and explore? Read this – or join one of our Design Jams!

In Central Europe, the visualization of legal information has developed into a research field in its own right. In German-speaking countries, the terms legal visualization (Rechtsvisualisierung), visual legal communication, visual law and multisensory law have been used to describe this growing field of research and practice. The pioneer, Colette R. Brunschwig, defended her doctoral thesis on the topic  in 2001, and has since published widely on related topics. She is the leader of the Multisensory Law & Visual Law Community at beck-community.

In his doctoral research related to legal risks in the context of contracts at the Faculty of Law in the University of Oslo, Tobias Mahler used icons and diagrams illustrating legal risk and developed a graphical modeling language. In a case study he conducted, a group of lawyers, managers, and engineers were asked to use the method to analyze the risks connected with a contract proposal. The results showed that the diagrams were perceived as very helpful in communicating risk.

At the Nordic Conference of Law and IT, “Internationalisation of law in the digital information society” in Stockholm in November 2012, visualization of law was one of the three main topics. The proceedings, which include visual law related papers by Colette R. Brunschwig, Tobias Mahler and Helena Haapio, will be published in the forthcoming Nordic Yearbook of Legal Informatics (Svantesson & Greenstein, eds., Ex Tuto Publishing 2013).

Furthermore, the use of visualizations has been studied, for example, in the context of improving comprehension of jury instructions and in facilitating the making of complex decisions connected with dispute resolution. Visualization has also been observed in the role of a persuasion tool in a variety of settings, from the courtroom to the boardroom. After Richard Sherwin debuted Visual Persuasion in the Law at New York Law School and launched the Visual Persuasion Project website , it has become easier for law schools to teach their students about visual evidence and visual advocacy. It is no longer unusual for law teachers or students to use flowcharts and decision trees, and the list goes on. A Google search will reveal the growing number of such applications in law.

If instead of reading you prefer learning by doing, there are some great opportunities later this year. The Simplification Centre and the University of the Aegean will run an international summer/autumn Course on Information Design 30 September to 4 October 2013 in Syros, Greece. Provided that there is enough interest, we plan to arrange 1) special sessions on merging contract/legal design with information design and visualization; and 2) a Legal Design Jam, modeled on hackathons, with a small committed group of interested people, including legal and other practitioners and graphic designers, aiming at giving an extreme visual makeover to a chosen text or document (piece of legislation, contract, license, terms and conditions, …). If you are interested, please contact the organizers at info(at)

On 8 October 2013, the International Association for Contract & Commercial Management (IACCM) will hold its Academic Forum in Phoenix, Arizona. The conference topics include legal visualization as it relates to commercial and contract management. If you are interested in submitting a proposal for a presentation or a paper, there is still time to do so: the deadline is 1 July 2013. Please see the Call for Papers for details. If we can find a host and a group of committed professionals, scholars and graphic designers, we are also planning to put together a Design Jam right before or after the IACCM Americas Forum on a US location to be agreed. The candidate document for redesign is still to be decided, so please send us your suggestions! If you are interested to host or to participate, please contact either of us at the email address below to express interest, ask questions, or give suggestions.

What does the future hold?

We see these steps as just the beginning. Once the visual turn has begun, we do not think it can be stopped; the benefits are just too many. As lawyers, we have a lot to learn and we could do our job better in so many respects if we indeed started to get into the mode of thinking and acting like a designer and not just like a lawyer. This applies not only to purely legal information, but everything else we produce: contracts, memos, corporate governance materials, policies, manuals, employee handbooks, and guidance.

Legal information tends to be complex, and information design(ers) can help us make it easier to understand and act upon. The goal is accomplishing the writer’s goals by meeting the readers’ needs. We can start to radically transform legal information following the footsteps of Rob Waller’s team at the Simplification Centre by applying What makes a good document to legal documents.

With new tools and services being developed, it will become easier to convey our content and documents in more usable and more engaging ways. As the work progresses and new tools and apps appear, we are likely to see a major change in the legal industry. Meanwhile, let us know your views and ideas and what you are doing or interested in doing with visuals.

Helena HaapioHelena Haapio is International Contract Counsel for Lexpert Ltd based in Helsinki, Finland. Before founding Lexpert she served for several years as in-house legal counsel. She holds a Diploma in Legal Studies (University of Cambridge) and a LL.M. (Turku). She does research on proactive contracting, user-centered contract design and visualization as means to enhance companies’ ease of doing business and to simplify contracting processes and documents as part of her Ph.D. at the University of Vaasa, where she teaches strategic business law. She also acts as arbitrator. Her recent books include A Short Guide to Contract Risk (Gower 2013) and Proactive Law for Managers (Gower 2011), co-authored with Professor George Siedel. Through visualization, she seeks to revolutionize the way contracts and the law are communicated, taught, and perceived. Helena can be contacted at Helena.Haapio(at)

Stefania Passera Soita mummolle!Stefania Passera is a researcher in MIND Research Group, a multidisciplinary research team at Aalto University School of Science, Helsinki, Finland. She holds a MA in graphic design (Aalto University School of Art, Design and Architecture), and has been doing research on the usability and user experience of information visualizations in contracts as part of her Ph.D. The leitmotiv of her work is to explore how design and designers can contribute to new multidisciplinary endeavors and what value their way of thinking and doing bring to the mix. Stefania has been collaborating with private and public organizations in Finland on the development of user-centered visual contract documents since 2011.
Stefania can be contacted at stefania.passera(at)


VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

Take a look at your bundle of tags on Delicious. Would you ever believe you’re going to change the law with a handful of them?

You’re going to change the way you research the law. The way you apply it. The way you teach it and, in doing so, shape the minds of future lawyers.

Do you think I’m going too far? Maybe.

But don’t overlook the way taxonomies have changed the law and shaped lawyers’ minds so far. Taxonomies? Yeah, taxonomies.

We, the lawyers, have used extensively taxonomies through the years; Civil lawyers in particular have shown to be particularly prone to them. We’ve used taxonomies for three reasons: to help legal research, to help memorization and teaching, and to apply the law.


Taxonomies help legal research.

2959826262_9b724b5a72First, taxonomies help us retrieve what we’ve stored (rules and case law).

Are you looking for a rule about a sales contract? Dive deep into the “Obligations” category and the corresponding book (Recht der Schuldverhältnisse, Obbligazioni, Des contrats ou des obligations conventionnelles en général, you name it ).

If you are a Common Lawyer, and ignore the perverse pleasure of browsing through Civil Code taxonomy, you’ll probably know Westlaw’s classification and its key numbering system. It has much more concrete categories and therefore much longer lists than the Civilians’ classification.
Legal taxonomies are there to help users find the content they’re looking for.

However, taxonomies sometimes don’t reflect the way the users reason; when this happens, you just won’t find what you’re looking for.

The problem with legal taxonomies.

If you are a German lawyer, you’ll probably be searching the “Obligations” book for rules concerning marriage; indeed in the German lawyer’s frame of mind, marriage is a peculiar form of contract. But if you are Italian, like I am, then you will most probably start looking in the “Persons” book; marriage rules are simply there, and we have been taught that marriage is not a contract but an agreement with no economic content (we have been trained to overlook the patrimonial shade in deference to the sentimental one).

So if I, the Italian, look for rules about marriage in the German civil code, I won’t find anything in the “Persons” book.
In other words, taxonomies work when they’re used by someone who reasons like the creator or–-and this happens with lawyers inside a certain legal system–-when users are trained to use the same taxonomy, and lawyers are trained at length.

But let’s take my friend Tim; he doesn’t have a legal education. He’s navigating Westlaw’s key number system looking for some relevant case law on car crashes. By chance he knows he should look below “torts,” but where? Is this injury and damage from act (k439)? Is this injury to a person in general (k425)? Is this injury to property or right of property in general (k429)? Wait, should he look below “crimes” (he is unclear on the distinction between torts and crimes)? And so on. Do these questions sound silly to you, the lawyers? Consider this: the titles we mentioned give no hint of the content, unless you already know what’s in there.

Because Law, complex as it is, needs a map. Lawyers have been trained to use the map. But what about non-lawyers?

In other words, the problems with legal taxonomies occur when the creators and the users don’t share the same frame of mind. And this is most likely to happen when the creators of the taxonomy are lawyers and the users are not lawyers.
Daniel Dabney wrote something similar some time ago. Let’s imagine that I buy a dog, take the little pooch home and find out that it’s mangy. Let’s imagine I’m that kind of aggressively unsatisfied customer and want to sue the seller, but know nothing about law. I go to the library and what will I look for? Rules on dogs sale? A book on Dog’s law? I’m lucky, there’s one, actually: “Dog law”, a book that gathers all laws regarding dogs and dogs owners.
But of course, that’s just luck, and  if I had to browse through legal category in the Westlaw’s index, I would never have found anything regarding “dogs”. I will never find the word “dog”, which is nonetheless the first word a non-legal trained person would think of. A savvy lawyer would look for rules regarding sales and warranties: general categories I may not know of (or think of) if I’m not a lawyer. If I’m not a lawyer I may not know that “the sale of arguably defective dogs are to be governed by the same rules that apply to other arguably defective items, like leaky fountain pens”. Dogs are like pens for a lawyer, but they are just dogs for a dogs-owner: so a dogs owner will look for rules about dogs, not rules about sales and warranties (or at least he would look for sale of dogs). And dog law, a  user aimed, object oriented category would probably fits his needs.

Observation #1: To make legal content available to everyone we must change the information architecture through which legal information are presented.

Will folksonomies make a better job?
Let’s come to folksonomies now. Here, the mismatch between creators (lawyers) and users’ way of reasoning is less likely to occur. The very same users decide which category to create and what to put into it. Moreover, more tags can overlap; that is, the same object can be tagged more than once. This allows the user to consider the same object from different perspectives. Take Delicious. If you search for “Intellectual property” on the Delicious search engine, you find a page about Copyright definition on Wikipedia. It was tagged mainly with “copyright.” But many users also tagged it with “wikipedia,” “law” and “intellectual-property” and even “art”. Maybe it was the non-lawyers out there who found it more useful to tag it with the “law” tag (a lawyer’s tag would have been more specific); maybe it was the lawyers who massively tagged it with “art” (there are a few “art” tags in their libraries). Or was it the other way around? The thing is, it’s up to users to decide where to classify it.

People also tag laws on Delicious using different labels that may or may not be related to law, because Delicious is a general-use website. But instead, let’s take a crowdsourced legal content website like Docracy. Here, people upload and tag their contracts, so it’s only legal content, and they tag them using only legal categories.

On Docracy, I found out that a whole category of documents that was dedicated to Terms of Service. Terms of Service is not a traditional legal category—-like torts, property, and contracts—-but it was a particularly useful category for Docracy users.

Docracy: WordPress Terms of Service are tagged with "TOS" but also with "Website".

Docracy: WordPress Terms of Service are tagged with “TOS” but also with “Website”.

If I browse some more, I see that the WordPress TOS are also tagged with “website.” Right, it makes sense; that is, if I’m a web designer looking for the legal stuff I need to know before deploying my website. If I start looking just from “website,” I’ll find TOS, but also “contract of works for web design or “standard agreements for design services” from AIGA.

You got it? What legal folksonomies bring us is:

  1. User-centered categories
  2. Flexible categorization systems. Many items can be tagged more than once and so be put into different categories. Legal stuff can be retrieved through different routes but also considered under different lights.

Will this enhance findability? I think it will, especially if the users are non-lawyers. And services that target the low-end of the legal market usually target non-lawyers.

Alright, I know what you’re thinking. You’re thinking, oh no, again another naive folksonomy supporter! And then you say: “Folksonomie structures are too flat to constitute something useful for legal research!” and “Law is too a specific sector with highly technical vocabulary and structure. Non-legal trained users would just tag wrongly”.

Let me quickly address these issues.

Objection 1: Folksonomies are too flat to constitute something  useful for legal research

Let’s start from a premise: we have no studies on legal folksonomies yet. Docracy is not a full folksonomy yet ( users can tag but tags are pre-determined by administrators). But we do have examples of folksonomies tout court, so my argument moves analogically from them. Folksonomies do work. Take  the Library of Congress Flickr project. Like an old grandmother, the Library gathered thousands of pictures that no-one ever had the time to review and categorize.  So pictures were uploaded on Flickr and left for the users to tag and comment. They did it en masse, mostly by using descriptive or topical tags (non-subjective) that were useful for retrieval. If folksonomies work for pictures (Flickr), books (Goodreads), questions and answers (Quora), basically everything else (Delicious), why shouldn’t they work for law? Given that premise, let’s move to first objection: folksonomies are flat. Wrong. As folksonomies evolve, we find out that they can have two, three and even more levels of categories. Take a look at the Quora hierarchy.

That’s not flat. Look, there are at least four levels in the screenshot: Classical Musicians & Composers > Pianists > Jazz Pianists > Ray Charles > What’d I Say. Right, Jazz pianists are not classical musicians: but mistakes do occur and the good point in folksonomies is that users can freely correct them.

Second point: findability doesn’t depend only on hierarchies. You can browse the folksonomy’s categories but you can also use free text search to dig into it.  In this case, users’ tags are metadata and so findability is enhanced because the search engine retrieves what users have tagged–not what admins have tagged.


Objection 2: Non-legal people will use the wrong tags

Uhm, yes, you’re right. They will tag a criminal law document with “tort” and a tort case involving a car accident with “car crash”. And so? Who cares? What if the majority of users find it useful? We forget too often that law is a social phenomenon, not a tool for technicians. And language is a social phenomenon too. If users consistently tag a legal document with the “wrong” tag X instead of the “right” tag Y, it means that they usually name that legal document with X. So most of them, when looking for that document, will look for X. And they’ll retrieve it, and be happy with that.

Of course, legal-savvy people would like to search by typical legal words (like, maybe, “chattel”?) or by using the legal categories they know so well.  Do we want to compromise? The fact is, in a system where there is only user-generated content, it goes without saying that a traditional top-down taxonomy would not work. But if we have to imagine a system where content is not user-generated, like a legal or case law database, that could happen. There could be, for instance, a mixed taxonomy-folksonomy system where taxonomy is built with traditional legal terms and scheme, whereas folksonomy is built by the users who are free to tag. Search in the end, can be done by browsing the taxonomy, by browsing the folksonomy or by means of a search engine which fishes on content relying both on metadata chosen by system administrators and on metadata chosen by the users who tagged the content.

This may seem like an imaginary system–but it’s happening already. Amazon uses traditional categories and leave the users free to tag. The BBC website followed a similar pattern, moving from full taxonomy system to a hybrid taxonomy-folksonomy one. Resilience, resilience, as Andrea Resmini and Luca Rosati put it in their seminal book on information architecture. Folksonomies and taxonomies can coexist. But this is not what this article is about, so sorry for the digression and let’s move to the first prediction.

Prediction #1: Folksonomies will provide the right information architecture for non-legal users.

Taxonomies and folksonomies help legal teaching.

7797310218_8d42f4743bSecondly, taxonomies help us memorize rules and case law. Put all the things in a box and group them on the basis of a common feature, and you’ll easily remember where they are. For this reason, taxonomies have played a major role in legal teaching. I’ll tell you a little story. Civil lawyers know very well the story of Gaius, the ancient Roman jurist who created a successful taxonomy for his law handbook, the Institutiones. His taxonomy was threefold: all law can be divided into persons, things, and actions. Five centuries later (five centuries!) Emperor Justinian transferred the very same taxonomy into his own Institutiones, a handbook aimed at youth “craving for legal knowledge” (cupida legum iuventes). Why? Because it worked! How powerful, both the slogan and the taxonomy! Indeed more than 1000 years later, we found it again, with a few changes, in German, French, Italian, and Spanish Civil Codes and that, in a whole bunch of nutshells, explains private law following the taxonomy of the Codes.

And now, consider what the taxonomies have done to lawyers’ minds.

Taxonomies have shaped their way of considering facts. Think. Put something into a category and you will lose all the other points of view on the same thing. The category shapes and limits our way to look at that particular thing.

Have you ever noticed how civil lawyers and common lawyers have a totally different way of looking at facts? Common lawyers see and take into account the details. Civil lawyers overlook them because the taxonomy they use has told them to do so.

In Rylands vs Fletcher (a UK tort case) some water escapes from a reservoir and floods a mine nearby. The owner of the reservoir could not possibly foresee the event and prevent it. However, the House of Lords states that the owner of the mine has the right to recover damages, even if there is no negligence. (“The person who for his own purpose brings on his lands and collects and keeps there anything likely to do mischief, if it escapes, must keep it in at his peril, and if he does not do so, is prima facie answerable for all the damage which is the natural consequence of its escape.”)

In Read vs Lyons, however, an employee gets injured during an explosion occurring in the ammunition factory where she is employed. The rule set in Rylands couldn’t be applied, as, according to the House of Lords, the case was very different; there is no escape.

On the contrary, for a Civil lawyer the decision would have been the same in both cases. For instance, under Italian Civil Code (but French and German Codes are not substantially different on this point), one would apply the general rule that grants reward for damages caused by “dangerous activities” and requires no proof of negligence on the plaintiff (art.2050 of the Civil Code), no matter what causes the danger (a big reservoir of water, an ammunition factory, whatever else).

Observation#2: taxonomies are useful for legal teaching and they shape lawyers minds.

Folksonomies for legal teaching?

Okay, and what about folksonomies? What if the way people tag legal concepts makes its way into legal teaching?

Take the Docracy‘s TOS category—have you ever thought about a course on TOS?

Another website, another example: Rocket Lawyer. Its categorization is not based on folksonomy, however; it’s purposely built around a user’s needs, which have been tested over the years, so in a way the taxonomy of the website comes from its users. One category is “identity theft”, which should be quite popular if it is prompted on the first page. What about teaching a course on identity theft? That would merge some material traditionally taught in privacy law, criminal law, and torts courses. Some course areas would overlap, which is good for memorization. Think again to the example of “Dog Law” by Dabney. What about a course about Dog Law, collecting material that refers to dogs across traditional legal categories?

Also, the same topic would be considered from different points of view.

What if students were trained to the specifications of the above-mentioned flexibility of categories? They wouldn’t get trapped into a single way of seeing things. If folksonomies account for different levels of abstractions, they would be trained to consider details. Not only that,  they would develop a very flexible frame of mind.

Prediction #2: legal folksonomies in legal teaching would keep lawyers’ minds flexible.


Taxonomies and folksonomies SHAPE the law.

Third, taxonomies make the law apply differently. Think about it. They are the very highways that allow the law to travel down to us. And here it comes, the real revolutionary potential of legal folksonomies, if we were to make them work.

Let’s start from taxonomies, with a couple of examples.

Civil lawyers are taught that Public and Private Law are two distinctive areas of law, to which different rules apply. In common law, the distinction is not that clear-cut. In Rigby vs Chief Constable of Northamptonshire  (a tort case from UK case law) the police—in an attempt to catch a criminal—damage a private shop by accidentally firing a canister of gas and setting the shop ablaze. The Queen’s Bench Division establishes that the police are liable under the tort of negligence only because the plaintiff manages to prove the police’s fault; they apply a private law category to a public body.
How would the same case have been decided under, say, French law? As the division between public and private law is stricter, the category of liability without fault, which is traditionally used when damages are caused by public bodies, would apply. The State would have to indemnify the damage, no matter if there was negligence.

Remember Rylands vs Fletcher and Lyons vs Read? The presence of escape/no escape was determinant, because the English taxonomy is very concrete. Civil lawyers work with taxonomies that have fewer, larger, and more abstract categories. If you cause damages by performing a risky activity, even if conducted without fault, you have to repay them. Period. Abstract taxonomy sweeps out any concrete detail. I think that Robert Berring had something like this in mind–although he referred to legal research–when he said  that “classification  defines the world of thinkable thoughts”. Or, as Dabney puts it, “thoughts that aren’t represented in the system had become unthinkable”.
So taxonomies make the law apply differently. In the former case, by setting a boundary between the public-private spheres; in the latter by creating a different framework for the application of more abstract or more detailed rules.


You don’t get it? All right, it’s tough, but do you have two minutes more? Let’s take this example by Dabney. Key number system’s taxonomy distinguishes between Navigable and Non-navigable waters (in the screenshot: waters and water courses). There’s a reason for that: lands under navigable waters presumptively belongs to the state, because “private ownership of the land under navigable waters would (…) compromise the use of those waters for navigation ad commerce”. So there are two categories because different laws apply to each. But now look at this screenshot.avulsion

Find anything strange? Yes:  avulsion rules are “doubled”: they are contained in both categories. But they are the very same: rules concerning avulsion don’t change if the water is navigable or not (check avulsion definition if you, like me, don’t remember what it is ). Dabney: “In this context,(…) there is no difference in the legal rules that are applied that depend on whether or not the water is navigable. Navigability has an effect on a wide range of issues concerning waters, but not on the accretion/avulsion issue. Here, the organization of the system needlessly separates cases from each other on the basis of an irrelevant criterion”. And you think, ok, but as long as we are aware of this error and know the rules concerning avulsion are the same, it’s not biggie. Right, but in the future?

“If searchers, over time, find cases involving navigable waters in one place and non-navigable waters in another, there might develop two distinct bodies of law.” Got it? Dabney foresees it. The way we categorize the law would shape the way we apply it.

Observation #3 Different taxonomies entail different ways to apply the law.

So, what if we substitute taxonomies with folksonomies?

And what if they had the power to shape the way judges, legal scholars, lawmakers and legal operators think?

Legal folksonomies are just starting out, and what I envisage is still yet to come. Which makes this article kind of a visionary one, I admit.

However, what Docracy is teaching us is that users—I didn’t say lawyers, but users—are generating decent legal content. Would you have bet your two cents on this, say, five years ago?
What if users started generating new legal categories (legal folksonomies?)

Berring wrote something really visionary more than ten years ago in his beautiful “Legal Research and the World of Thinkable Thoughts”. He couldn’t have folksonomies in mind, and still, wouldn’t you think he referred to them when writing: “There is simply too much stuff to sort through. No one can write a comprehensive treatise any more, and no one can read all of the new cases. Machines are sorting for us. We need a new set of thinkable thoughts.  We need a new Blackstone. We need someone, or more likely a group of someones, who can reconceptualize the structure of legal information.“?

Prediction #3 Legal folksonomies will make the law apply differently.

Let’s wait and see. Let the users tag. Where this tagging is going to take us is unpredictable, yes, but if you look at where taxonomies have taken us for all these years, you may find a clue.

I have a gut feeling that folksonomies are going to change the way we search, teach, and apply the law.




Serena Manzoli is a legal architect and the founder at Wildcat, legal search for curious humans. She has been a Euro bureaucrat, a cadet, an in-house counsel, a bored lawyer. She holds an LLM from University of Bologna. She blogs at Lawyers are boring.  Twitter: SquareLaw

[Editor’s Note: We are pleased to publish this piece from Qiang Lu and Jack Conrad, both of whom worked with Thomson Reuters R&D on the WestlawNext research team. Jack Conrad continues to work with Thomson Reuters, though currently on loan to the Catalyst Lab at Thomson Reuters Global Resources in Switzerland. Qiang Lu is now based at Kore Federal in the Washington, D.C. area. We read with interest their 2012 paper from the International Conference on Knowledge Engineering and Ontology Development (KEOD), “Bringing order to legal documents: An issue-based recommendation system via cluster association”, and are grateful that they have agreed to offer some system-specific context for their work in this area. Their current contribution represents a practical description of the advances that have been made between the initial and current versions of Westlaw, and what differentiates a contemporary legal search engine from its predecessors.  -sd]

In her blog on “Pushing the Envelope: Innovation in Legal Search” (2009) [1], Edinburgh Informatics Ph.D. candidate K. Tamsin Maxwell presents her perspective of the state of legal search at the time. The variations of legal information retrieval (IR) that she reviews − everything from natural language search (e.g., vector space models, Bayesian inference net models, and language models) to NLP and term weighting − refer to techniques that are now 10, 15, even 20 years old. She also refers to the release of the first natural language legal search engine by West back in 1993−WIN (Westlaw Is Natural) [2]. Adding to this on-going conversation about legal search, we would like to check back in, a full 20 years after the release of that first natural language legal search engine. The objective we hope to achieve in this posting is to provide a useful overview of state-of-the-art legal search today.

What Maxwell’s article could not have predicted, even five years ago, are some of the chief factors that distinguish state-of-the-art search engines today from their earlier counterparts. One of the most notable distinctions is that unlike their predecessors, contemporary search engines, including today’s state-of-the-art legal search engine, WestlawNext , separate the function of document retrieval from document ranking. Whereas the first retrieval function primarily addresses recall, ensuring that all potentially relevant documents are retrieved, the second and ensuing function focuses on the ideal ranking of those results, addressing precision at the highest ranks. By contrast, search engines of the past effectively treated these two search functions as one and the same. So what is the difference? Whereas the document retrieval piece may not be dramatically different from what it was when WIN was first released in 1993, what is dramatically different lies in the evidence that is considered in the ranking piece, which allows potentially dozens of weighted features to be taken into account and tracked as part of the optimal ranking process.

Figure 1: Views

Figure 1. The set of evidence (views) that can be used by modern legal search engines.

In traditional search, the principal evidence considered was the main text of the document in question. In the case of traditional legal search, those documents would be cases, briefs, statutes, regulations, law reviews and other forms of primary and secondary (a.k.a. analytical) legal publications. This textual set of evidence can be termed the document view of the world. In the case of legal search engines like Westlaw, there also exists the ability to exploit expert-generated annotations or metadata. These annotations come in the form of attorney-editor generated synopses, points of law (a.k.a. headnotes), and attorney-classifier assigned topical classifications that rely on a legal taxonomy such as West’s Key Number System [3]. The set of evidence based on such metadata can be termed the annotation view. Furthermore, in a manner loosely analogous to today’s World Wide Web and the lattice of inter-referencing documents that reside there, today’s legal search can also exploit the multiplicity of both out-bound (cited) sources and in-bound (citing) sources with respect to a document in question, and, frequently, the granularity of these citations is not merely at a document-level but at the sub-document or topic level. Such a set of evidence can be termed the citation network view. More sophisticated engines can examine not only the popularity of a given cited or citing document based on the citation frequency, but also the polarity and scope of the arguments they wager as well.

In addition to the “views” described thus far, a modern search engine can also harness what has come to be known as aggregated user behavior. While individual users and their individual behavior are not considered, in instances where there is sufficient accumulated evidence, the search function can consider document popularity thanks to a user view. That is to say, in addition to a document being returned in a result set for a certain kind of query, the search provider can also tabulate how often a given document was opened for viewing, how often it was printed, or how often it was checked for its legal validity (e.g., through citator services such as KeyCite [4]). (See Figure 1) This form of marshaling and weighting of evidence only scratches the surface, for one can also track evidence between two documents within the same research session, e.g., noting that when one highly relevant document appears in result sets for a given query-type, another document typically appears in the same result sets. In summary, such a user view represents a rich and powerful additional means of leveraging document relevance as indicated through professional user interactions with legal corpora such as those mentioned above.

It is also worth noting that today’s search engines may factor in a user’s preferences, for example, by knowingVOX.LegalResearch what jurisdiction a particular attorney-user practices in, and what kinds of sources that user has historically preferred, over time and across numerous result sets.

While the materials or data relied upon in the document view and citation network view are authored by judges, law clerks, legislators, attorneys and law professors, the summary data present in the annotation view is produced by attorney-editors. By contrast, the aggregated user behavior data represented in the user view is produced by the professional researchers who interact with the retrieval system. The result of this rich and diverse set of views is that the power and effectiveness of a modern legal search engine comes not only from its underlying technology but also from the collective intelligence of all of the domain expertise represented in the generation of its data (documents) and metadata (citations, annotations, popularity and interaction information). Thus, the legal search engine offered by WestlawNext (WLN) represents an optimal blend of advanced artificial intelligence techniques and human expertise [5].

Given this wealth of diverse material representing various forms of relevance information and tractable connections between queries and documents, the ranking function executed by modern legal search engines can be optimized through a series of training rounds that “teach” the machine what forms of evidence make the greatest contribution for certain types of queries and available documents, along with their associated content and metadata. In other words, the re-ranking portion of the machine learns how to weigh the “features” representing this evidence in a manner that will produce the best (i.e., highest precision) ranking of the documents retrieved.

Nevertheless, a search engine is still highly influenced by the user queries it has to process, and for some legal research questions, an independent set of documents grouped by legal issue would be a tremendous complementary resource for the legal researcher, one at least as effective as trying to assemble the set of relevant documents through a sequence of individual queries. For this reason, WLN offers in parallel a complement to search entitled “Related Materials” which in essence is a document recommendation mechanism. These materials are clustered around the primary, secondary and sometimes tertiary legal issues in the case under consideration.

Legal documents are complex and multi-topical in nature. By detecting the top-level legal issues underlying the original document and delivering recommended documents grouped according to these issues, a modern legal search engine can provide a more effective research experience to a user when providing such comprehensive coverage [6,7]. Illustrations of some of the approaches to generating such related material are discussed below.

Take, for example, an attorney who is running a set of queries that seeks to identify a group of relevant documents involving “attractive nuisance” for a party that witnessed a child nearly drowned in a swimming pool. After a number of attempts using several different key terms in her queries, the attorney selects the “Related Materials” option that subsequently provides access to the spectrum of “attractive nuisance”-related documents. Such sets of issue-based documents can represent a mother lode of relevant materials. In this instance, pursuing this navigational path rather than a query-based one turns out to be a good choice. Indeed, the query-based approach could take time and would lead to a gradually evolving set of relevant documents. By contrast, harnessing the cluster of documents produced for “attractive nuisance” may turn out to be the most efficient approach to total recall and the desired degree of relevance.

To further illustrate the benefit of a modern legal search engine, we will conclude our discussion with an instructive search using WestlawNext, and its subsequent exploration by way of this recommendation resource available through “Related Materials.”

The underlying legal issue in this example is “church support for specific candidates”, and a corresponding query is issued in the search box. Figure 2 provides an illustration of the top cases retrieved.


Figure 2: Search result from WestlawNext

Let’s assume that the user decides to closely examine the first case. By clicking the link to the document, the content of the case is rendered, as in Figure 3. Note that on the right-hand side of the panel, the major legal issues of the case “Canyon Ferry Road Baptist Church … v. Unsworth” have been automatically identified and presented with hierarchically structured labels, such as “Freedom of Speech / State Regulation of Campaign Speech” and “Freedom of Speech / View of Federal Election Campaign Act / Definition of Political Committee,” … By presenting these closely related topics, a user is empowered with the ability to dive deep into the relevant cases and other relevant documents without explicitly crafting any additional or refined queries.


Figure 3: A view of a case and complementary materials from WestlawNext

By selecting these sets of relevant topics, a set of recommended cases will be rendered under that particular label. Figure 4, for example, shows the related topic view of the case under the label of “Freedom of Speech / View of Federal Election Campaign Act / Definition of Political Committee.” Note that this process can be repeated based on the particular needs of a user, starting with a document in the original results set.


Figure 4: Related Topic view of a case

In summary, by utilizing the combination of human expert-generated resources and sophisticated machine-learning algorithms, modern legal search engines bring the legal research experience to an unprecedented and powerful new level. For those seeking the next generation in legal search, it’s no longer on the horizon. It’s already here.


[1] K. Tamsin Maxwell, “Pushing the Envelope: Innovation in Legal Search,” in VoxPopuLII, Legal Information Institute, Cornell University Law School, 17 Sept. 2009.
[2] Howard Turtle, “Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance,” In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research & Development in Information Retrieval (SIGIR 1994) (Dublin, Ireland), Springer-Verlag, London, pp. 212-220, 1994.
[3] West’s Key Number System:
[4] West’s KeyCite Citator Service:
[5] Peter Jackson and Khalid Al-Kofahi, “Human Expertise and Artificial Intelligence in Legal Search,” in Structuring of Legal Semantics, A. Geist, C. R. Brunschwig, F. Lachmayer, G. Schefbeck Eds., Festschrift ed. for Erich Schweighofer, Editions Weblaw, Bern, pp. 417-427, 2011.
[6] On Cluster definition and population: Qiang Lu, Jack G. Conrad, Khalid Al-Kofahi, William Keenan, “Legal Document Clustering with Build-in Topic Segmentation,” In Proceedings of the 2011 ACM-CIKM Twentieth International Conference on Information and Knowledge Management (CIKM 2011)(Glasgow, Scotland), ACM Press, pp. 383-392, 2011.
[7] On Cluster association with individual documents: Qiang Lu and Jack G. Conrad, “Bringing order to legal documents: An Issue-based Recommendation System via Cluster Association,” In Proceedings of the 4th International Conference on Knowledge Engineering and Ontology Development  (KEOD 2012) (Barcelona, Spain), SciTePress DL, pp. 76-88, 2012.

Jack G. Conrad currently serves as Lead Research Scientist with the Catalyst Lab at Thomson Reuters Global Resources in Baar, Switzerland. He was formerly a Senior Research Scientist with the Thomson Reuters Corporate Research & Development department. His research areas fall under a broad spectrum of Information Retrieval, Data Mining and NLP topics. Some of these include e-Discovery, document clustering and deduplication for knowledge management systems. Jack has researched and implemented key components for WestlawNext, West‘s next-generation legal search engine, and PeopleMap, a very large scale Public Record aggregation system. Jack completed his graduate studies in Computer Science at the University of Massachusetts–Amherst and in Linguistics at the University of British Columbia–Vancouver.

Qiang Lu was a Senior Research Scientist with Thomson Reuters Corporate Research & Development department. His research interests include data mining, text mining, information retrieval, and machine learning. He has extensive experience of applying various NLP technologies in various data sources, such as news, legal, financial, and law enforcement data. Qiang was a key member of WestlawNext research team. He has a Ph.D. in computer science and engineering from State University of New York at Buffalo. He is now a managing associate at Kore Federal in Washington D.C. area.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

VOX.latin. ignorance_of_the_law_excuses_no_one_card-p137531564186928160envwi_400At CourtListener, we are making a free database of court opinions with the ultimate goal of providing the entire U.S. case-law corpus to the world for free and combining it with cutting-edge search and research tools. We–like most readers of this blog–believe that for justice to truly prevail, the law must be open and equally accessible to everybody.

It is astonishing to think that the entire U.S. case-law corpus is not currently available to the world at no cost. Many have started down this path and stopped, so we know we’ve set a high goal for a humble open source project. From time to time it’s worth taking a moment to reflect on where we are and where we’d like to go in the coming years.

The current state of affairs

We’ve created a good search engine that can provide results based on a number of characteristics of legal cases. Our users can search for opinions by the case name, date, or any text that’s in the opinion, and can refine by court, by precedential status or by citation. The results are pretty good, but are limited based on the data we have and the “relevance signals” that we have in place.

A good legal search engine will use a number of factors (a.k.a. “relevance signals”) to promote documents to the top of their listings. Things like:

  • How recent is the opinion?
  • How many other opinions have cited it?
  • How many journals have cited it?
  • How long is it?
  • How important is the court that heard the case?
  • Is the case in the jurisdiction of the user?
  • Is the opinion one that the user has looked at before?
  • What was the subsequent treatment of the opinion?

And so forth. All of the above help to make search results better, and we’ve seen paid legal search tools make great strides in their products by integrating these and other factors. At CourtListener, we’re using a number of the above, but we need to go further. We need to use as many factors as possible, we need to learn how the factors interact with each other, which ones are the most important, and which lead to the best results.

A different problem we’re working to solve at CourtListener is getting primary legal materials freely onto the Web. What good is a search engine if the opinion you need isn’t there in the first place? We currently have about 800,000 federal opinions, including West’s second and third Federal Reporters, F.2d and F.3d, and the entire Supreme Court corpus. This is good and we’re very proud of the quality of our database–we think it’s the best free resource there is. Every day we add the opinions from the Circuit Courts in the federal system and the U.S. Supreme Court, nearly in real-time. But we need to go further: we need to add state opinions, and we need to add not just the latest opinions but all the historical ones as well.

This sounds daunting, but it’s a problem that we hope will be solved in the next few years. Although it’s taking longer than we would like, in time we are confident that all of the important historical legal data will make its way to the open Internet. Primary legal sources are already in the public domain, so now it’s just a matter of getting it into good electronic formats so that anyone can access it and anyone can re-use it. If an opinion only exists as unsearchable scanned versions, in bound books, or behind a pricey pay wall, then it’s closed to many people that should have access to it. As part of our citation identification project, which I’ll talk about next, we’re working to get the most important documents properly digitized.

Our citation identification project was developed last year by U.C. Berkeley School of Information students Rowyn McDonald and Karen Rustad to identify and cross-link any citations found in our database. This is a great feature that makes all the citations in our corpus link to the correct opinions, if we have them. For example, if you’re reading an opinion that has a reference to Roe v. Wade, you can click on the citation, and you’ll be off and reading Roe v. Wade. By the way, if you’re wondering how many Federal Appeals opinions cite Roe v. Wade, the number in our system is 801 opinions (and counting). If you’re wondering what the most-cited opinion in our system is, you may be bemused: With about 10,000 citations, it’s an opinion about ineffective assistance of legal counsel in death penalty cases, Strickland v. Washington, 466 U.S. 668 (1984).

A feature we’ll be working on soon will tie into our citation system to help us close any gaps in our corpus. Once the feature is done, whenever an opinion is cited that we don’t yet have, our users will be able to pay a small amount–one or two dollars–to sponsor the digitization of that opinion. We’ll do the work of digitizing it, and after that point the opinion will be available to the public for free.

This brings us to the next big feature we added last year: bulk data. Because we want to assist academic VOX.pile.of.paperresearchers and others who might have a use for a large database of court opinions, we provide free bulk downloads of everything we have. Like Carl Malamud’s, (to whom we owe a great debt for his efforts to collect opinions and provide them to others for free and for his direct support of our efforts) we have giant files you can download that provide thousands of opinions in computer-readable format. These downloads are available by court and date, and include thousands of fixes to the corpus. They also include something you can’t find anywhere else: the citation network. As part of the metadata associated with each opinion in our bulk download files, you can look and see which opinions it cites as well as which opinions cite it. This provides a valuable new source of data that we are very eager for others to work with. Of course, as new opinions are added to our system, we update our downloads with the new citations and the new information.

Finally, we would be remiss if we didn’t mention our hallmark feature: daily, weekly and monthly email alerts. For any query you put into CourtListener, you can request that we email you whenever there are new results. This feature was the first one we created, and one that we continue to be excited about. This year we haven’t made any big innovations to our email alerts system, but its popularity has continued to grow, with more than 500 alerts run each day. Next year, we hope to add a couple small enhancements to this feature so it’s smoother and easier to use.

The future

I’ve hinted at a lot of our upcoming work in the sections above, but what are the big-picture features that we think we need to achieve our goals?

We do all of our planning in the open, but we have a few things cooking in the background that we hope to eventually build. Among them are ideas for adding oral argument audio, case briefs, and data from PACER. Adding these new types of information to CourtListener is a must if we want to be more useful for research purposes, but doing so is a long-term goal, given the complexity of doing them well.

We also plan to build an opinion classifier that could automatically, and without human intervention, determine the subsequent treatment of opinions. Done right, this would allow our users to know at a glance if the opinion they’re reading was subsequently followed, criticized, or overruled, making our system even more valuable to our users.

In the next few years, we’ll continue building out these features, but as an open-source and open-data project, everything we do is in the open. You can see our plans on our feature tracker, our bugs in our bug tracker, and can get in touch in our forum. The next few years look to be very exciting as we continue building our collection and our platform for legal research. Let’s see what the new year brings!


lissnerMichael Lissner is the co-founder and lead developer of CourtListener, a project that works to make the law more accessible to all. He graduated from U.C. Berkeley’s School of Information, and when he’s not working on CourtListener he develops search and eDiscovery solutions for law firms. Michael is passionate about bringing greater access to our primary legal materials, about how technology can replace old legal models, and about open source, community-driven approaches to legal research.


carverBrian W. Carver is Assistant Professor at the U.C. Berkeley School of Information where he does ressearch on and teaches about intellectual property law and cyberlaw. He is also passionate about the public’s access to the law. In 2009 and 2010 he advised an I School Masters student, Michael Lissner, on the creation of, an alert service covering the U.S. federal appellate courts. After Michael’s graduation, he and Brian continued working on the site and have grown the database of opinions to include over 750,000 documents. In 2011 and 2012, Brian advised I School Masters students Rowyn McDonald and Karen Rustad on the creation of a legal citator built on the CourtListener database.


VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed. The information above should not be considered legal advice. If you require legal representation, please consult a lawyer.


Van Winkle wakes

In this post, we return to a topic we first visited in a book chapter in 2004.  At that time, one of us (Bruce) was an electronic publisher of Federal court cases and statutes, and the other (Hillmann, herself a former law cataloger) was working with large, aggregated repositories of scientific papers as part of the National Sciences Digital Library project.  Then, as now, we were concerned that little attention was being paid to the practical tradeoffs involved in publishing high quality metadata at low cost.  There was a tendency to design metadata schemas that said absolutely everything that could be said about an object, often at the expense of obscuring what needed to be said about it while running up unacceptable costs.  Though we did not have a name for it at the time, we were already deeply interested in least-cost, use-case-driven approaches to the design of metadata models, and that naturally led us to wonder what “good” metadata might be.  The result was “The Continuum of Metadata Quality: Defining, Expressing, Exploiting”, published as a chapter in an ALA publication, Metadata in Practice.

In that chapter, we attempted to create a framework for talking about (and evaluating) metadata quality.  We were concerned primarily with metadata as we were then encountering it: in aggregations of repositories containing scientific preprints, educational resources, and in caselaw and other primary legal materials published on the Web.   We hoped we could create something that would be both domain-independent and useful to those who manage and evaluate metadata projects.  Whether or not we succeeded is for others to judge.

The Original Framework

At that time, we identified seven major components of metadata quality. Here, we reproduce a part of a summary table that we used to characterize the seven measures. We suggested questions that might be used to draw a bead on the various measures we proposed:

Quality Measure Quality Criteria
Completeness Does the element set completely describe the objects?
Are all relevant elements used for each object?
Provenance Who is responsible for creating, extracting, or transforming the metadata?
How was the metadata created or extracted?
What transformations have been done on the data since its creation?
Accuracy Have accepted methods been used for creation or extraction?
What has been done to ensure valid values and structure?
Are default values appropriate, and have they been appropriately used?
Conformance to expectations Does metadata describe what it claims to?
Are controlled vocabularies aligned with audience characteristics and understanding of the objects?
Are compromises documented and in line with community expectations?
Logical consistency and coherence Is data in elements consistent throughout?
How does it compare with other data within the community?
Timeliness Is metadata regularly updated as the resources change?
Are controlled vocabularies updated when relevant?
Accessibility Is an appropriate element set for audience and community being used?
Is it affordable to use and maintain?
Does it permit further value-adds?


There are, of course, many possible elaborations of these criteria, and many other questions that help get at them.  Almost nine years later, we believe that the framework remains both relevant and highly useful, although (as we will discuss in a later section) we need to think carefully about whether and how it relates to the quality standards that the Linked Open Data (LOD) community is discovering for itself, and how it and other standards should affect library and publisher practices and policies.

… and the environment in which it was created

Our work was necessarily shaped by the environment we were in.  Though we never really said so explicitly, we were looking for quality not only in the data itself, but in the methods used to organize, transform and aggregate it across federated collections.  We did not, however, anticipate the speed or scale at which standards-based methods of data organization would be applied.  Commonly-used standards like FOAF, models such as those contained in, and lightweight modelling apparatus like SKOS are all things that have emerged into common use since, and of course the use of Dublin Core — our main focus eight years ago — has continued even as the standard itself has been refined.  These days, an expanded toolset makes it even more important that we have a way to talk about how well the tools fit the job at hand, and how well they have been applied. An expanded set of design choices accentuates the need to talk about how well choices have been made in particular cases.

Although our work took its inspiration from quality standards developed by a government statistical service, we had not really thought through the sheer multiplicity of information services that were available even then.  We were concerned primarily with work that had been done with descriptive metadata in digital libraries, but of course there were, and are, many more people publishing and consuming data in both the governmental and private sectors (to name just two).  Indeed, there was already a substantial literature on data quality that arose from within the management information systems (MIS) community, driven by concerns about the reliability and quality of  mission-critical data used and traded by businesses.  In today’s wider world, where work with library metadata will be strongly informed by the Linked Open Data techniques developed for a diverse array of data publishers, we need to take a broader view.  

Finally, we were driven then, as we are now, by managerial and operational concerns. As practitioners, we were well aware that metadata carries costs, and that human judgment is expensive.  We were looking for a set of indicators that would spark and sustain discussion about costs and tradeoffs.  At that time, we were mostly worried that libraries were not giving costs enough attention, and were designing metadata projects that were unrealistic given the level of detail or human intervention they required.  That is still true.  The world of Linked Data requires well-understood metadata policies and operational practices simply so publishers can know what is expected of them and consumers can know what they are getting. Those policies and practices in turn rely on quality measures that producers and consumers of metadata can understand and agree on.  In today’s world — one in which institutional resources are shrinking rather than expanding —  human intervention in the metadata quality assessment process at any level more granular than that of the entire data collection being offered will become the exception rather than the rule.   

While the methods we suggested at the time were self-consciously domain-independent, they did rest on background assumptions about the nature of the services involved and the means by which they were delivered. Our experience had been with data aggregated by communities where the data producers and consumers were to some extent known to one another, using a fairly simple technology that was easy to run and maintain.  In 2013, that is not the case; producers and consumers are increasingly remote from each other, and the technologies used are both more complex and less mature, though that is changing rapidly.

The remainder of this blog post is an attempt to reconsider our framework in that context.

The New World

The Linked Open Data (LOD) community has begun to consider quality issues; there are some noteworthy online discussions, as well as workshops resulting in a number of published papers and online resources.  It is interesting to see where the work that has come from within the LOD community contrasts with the thinking of the library community on such matters, and where it does not.  

In general, the material we have seen leans toward the traditional data-quality concerns of the MIS community.  LOD practitioners seem to have started out by putting far more emphasis than we might on criteria that are essentially audience-dependent, and on operational concerns having to do with the reliability of publishing and consumption apparatus.   As it has evolved, the discussion features an intellectual move away from those audience-dependent criteria, which are usually expressed as “fitness for use”, “relevance”, or something of the sort (we ourselves used the phrase “community expectations”). Instead, most realize that both audience and usage  are likely to be (at best) partially unknown to the publisher, at least at system design time.  In other words, the larger community has begun to grapple with something librarians have known for a while: future uses and the extent of dissemination are impossible to predict.  There is a creative tension here that is not likely to go away.  On the one hand, data developed for a particular community is likely to be much more useful to that community; thus our initial recognition of the role of “community expectations”.  On the other, dissemination of the data may reach far past the boundaries of the community that develops and publishes it.  The hope is that this tension can be resolved by integrating large data pools from diverse sources, or by taking other approaches that result in data models sufficiently large and diverse that “community expectations” can be implemented, essentially, by filtering.

For the LOD community, the path that began with  “fitness-for-use” criteria led quickly to the idea of maintaining a “neutral perspective”. Christian Fürber describes that perspective as the idea that “Data quality is the degree to which data meets quality requirements no matter who is making the requirements”.  To librarians, who have long since given up on the idea of cataloger objectivity, a phrase like “neutral perspective” may seem naive.  But it is a step forward in dealing with data whose dissemination and user community is unknown. And it is important to remember that the larger LOD community is concerned with quality in data publishing in general, and not solely with descriptive metadata, for which objectivity may no longer be of much value.  For that reason, it would be natural to expect the larger community to place greater weight on objectivity in their quality criteria than the library community feels that it can, with a strong preference for quantitative assessment wherever possible.  Librarians and others concerned with data that involves human judgment are theoretically more likely to be concerned with issues of provenance, particularly as they concern who has created and handled the data.  And indeed that is the case.

The new quality criteria, and how they stack up

Here is a simplified comparison of our 2004 criteria with three views taken from the LOD community.

Bruce & Hillmann Dodds, McDonald Flemming
Completeness Completeness
Amount of data
Provenance History
Accuracy Accuracy
Validity of documents
Conformance to expectations Modeling correctness
Modeling granularity
Logical consistency and coherence Directionality
Modeling correctness
Internal consistency
Referential correspondence
Timeliness Currency Timeliness
Accessibility Intelligibility
Accessibility (technical)
Performance (technical)

Placing the “new” criteria into our framework was no great challenge; it appears that we were, and are, talking about many of the same things. A few explanatory remarks:

  • Boundedness has roughly the same relationship to completeness that precision does to recall in information-retrieval metrics. The data is complete when we have everything we want; its boundedness shows high quality when we have only what we want.
  • Flemming’s amount of data criterion talks about numbers of triples and links, and about the interconnectedness and granularity of the data.  These seem to us to be largely completeness criteria, though things to do with linkage would more likely fall under “Logical coherence” in our world. Note, again, a certain preoccupation with things that are easy to count.  In this case it is somewhat unsatisfying; it’s not clear what the number of triples in a triplestore says about quality, or how it might be related to completeness if indeed that is what is intended.
  • Everyone lists criteria that fit well with our notions about provenance. In that connection, the most significant development has been a great deal of work on formalizing the ways in which provenance is expressed.  This is still an active level of research, with a lot to be decided.  In particular, attempts at true domain independence are not fully successful, and will probably never be so.  It appears to us that those working on the problem at DCMI are monitoring the other efforts and incorporating the most worthwhile features.
  • Dodds’ typing criterion — which basically says that dereferenceable URIs should be preferred to string literals  — participates equally in completeness and accuracy categories.  While we prefer URIs in our models, we are a little uneasy with the idea that the presence of string literals is always a sign of low quality.  Under some circumstances, for example, they might simply indicate an early stage of vocabulary evolution.
  • Flemming’s verifiability and validity criteria need a little explanation, because the terms used are easily confused with formal usages and so are a little misleading.  Verifiability bundles a set of concerns we think of as provenance.  Validity of documents is about accuracy as it is found in things like class and property usage.  Curiously, none of Flemming’s criteria have anything to do with whether the information being expressed by the data is correct in what it says about the real world; they are all designed to convey technical criteria.  The concern is not with what the data says, but with how it says it.
  • Dodds’ modeling correctness criterion seems to be about two things: whether or not the model is correctly constructed in formal terms, and whether or not it covers the subject domain in an expected way.  Thus, we assign it to both “Community expectations” and “Logical coherence” categories.
  • Isomorphism has to do with the ability to join datasets together, when they describe the same things.  In effect, it is a more formal statement of the idea that a given community will expect different models to treat similar things similarly. But there are also some very tricky (and often abused) concepts of equivalence involved; these are just beginning to receive some attention from Semantic Web researchers.
  • Licensing has become more important to everyone. That is in part because Linked Data as published in the private sector may exhibit some of the proprietary characteristics we saw as access barriers in 2004, and also because even public-sector data publishers are worried about cost recovery and appropriate-use issues.  We say more about this in a later section.
  • A number of criteria listed under Accessibility have to do with the reliability of data publishing and consumption apparatus as used in production.  Linked Data consumers want to know that the endpoints and triple stores they rely on for data are going to be up and running when they are needed.  That brings a whole set of accessibility and technical performance issues into play.  At least one website exists for the sole purpose of monitoring endpoint reliability, an obvious concern of those who build services that rely on Linked Data sources. Recently, the LII made a decision to run its own mirror of the DrugBank triplestore to eliminate problems with uptime and to guarantee low latency; performance and accessibility had become major concerns. For consumers, due diligence is important.

For us, there is a distinctly different feel to the examples that Dodds, Flemming, and others have used to illustrate their criteria; they seem to be looking at a set of phenomena that has substantial overlap with ours, but is not quite the same.  Part of it is simply the fact, mentioned earlier, that data publishers in distinct domains have distinct biases. For example, those who can’t fully believe in objectivity are forced to put greater emphasis on provenance. Others who are not publishing descriptive data that relies on human judgment feel they can rely on more  “objective” assessment methods.  But the biggest difference in the “new quality” is that it puts a great deal of emphasis on technical quality in the construction of the data model, and much less on how well the data that populates the model describes real things in the real world.  

There are three reasons for that.  The first has to do with the nature of the discussion itself. All quality discussions, simply as discussions, seem to neglect notions of factual accuracy because factual accuracy seems self-evidently a Good Thing; there’s not much to talk about.  Second, the people discussing quality in the LOD world are modelers first, and so quality is seen as adhering primarily to the model itself.  Finally, the world of the Semantic Web rests on the assumption that “anyone can say anything about anything”, For some, the egalitarian interpretation of that statement reaches the level of religion, making it very difficult to measure quality by judging whether something is factual or not; from a purist’s perspective, it’s opinions all the way down.  There is, then, a tendency to rely on formalisms and modeling technique to hold back the tide.

In 2004, we suggested a set of metadata-quality indicators suitable for managers to use in assessing projects and datasets.  An updated version of that table would look like this:

Quality Measure Quality Criteria
Completeness Does the element set completely describe the objects?
Are all relevant elements used for each object?
Does the data contain everything you expect?
Does the data contain only what you expect?
Provenance Who is responsible for creating, extracting, or transforming the metadata?
How was the metadata created or extracted?
What transformations have been done on the data since its creation?
Has a dedicated provenance vocabulary been used?
Are there authenticity measures (eg. digital signatures) in place?
Accuracy Have accepted methods been used for creation or extraction?
What has been done to ensure valid values and structure?
Are default values appropriate, and have they been appropriately used?
Are all properties and values valid/defined?
Conformance to expectations Does metadata describe what it claims to?
Does the data model describe what it claims to?
Are controlled vocabularies aligned with audience characteristics and understanding of the objects?
Are compromises documented and in line with community expectations?
Logical consistency and coherence Is data in elements consistent throughout?
How does it compare with other data within the community?
Is the data model technically correct and well structured?
Is the data model aligned with other models in the same domain?
Is the model consistent in the direction of relations?
Timeliness Is metadata regularly updated as the resources change?
Are controlled vocabularies updated when relevant?
Accessibility Is an appropriate element set for audience and community being used?
Is the data and its access methods well-documented, with exemplary queries and URIs?
Do things have human-readable labels?
Is it affordable to use and maintain?
Does it permit further value-adds?
Does it permit republication?
Is attribution required if the data is redistributed?
Are human- and machine-readable licenses available?
Accessibility — technical Are reliable, performant endpoints available?
Will the provider guarantee service (eg. via a service level agreement)?
Is the data available in bulk?
Are URIs stable?


The differences in the example questions reflect the differences of approach that we discussed earlier. Also, the new approach separates criteria related to technical accessibility from questions that relate to intellectual accessibility. Indeed, we suspect that “accessibility” may have been too broad a notion in the first place. Wider deployment of metadata systems and a much greater, still-evolving variety of producer-consumer scenarios and relationships have created a need to break it down further.  There are as many aspects to accessibility as there are types of barriers — economic, technical, and so on.

As before, our list is not a checklist or a set of must-haves, nor does it contain all the questions that might be asked.  Rather, we intend it as a list of representative questions that might be asked when a new Linked Data source is under consideration.  They are also questions that should inform policy discussion around the uses of Linked Data by consuming libraries and publishers.  

That is work that can be formalized and taken further. One intriguing recent development is work toward a Data Quality Management Vocabulary.   Its stated aims are to

  • support the expression of quality requirements in the same language, at web scale;
  • support the creation of consensual agreements about quality requirements
  • increase transparency around quality requirements and measures
  • enable checking for consistency among quality requirements, and
  • generally reduce the effort needed for data quality management activities


The apparatus to be used is a formal representation of “quality-relevant” information.   We imagine that the researchers in this area are looking forward to something like automated e-commerce in Linked Data, or at least a greater ability to do corpus-level quality assessment at a distance.  Of course, “fitness-for-use” and other criteria that can really only be seen from the perspective of the user will remain important, and there will be interplay between standardized quality and performance measures (on the one hand) and audience-relevant features on the other.   One is rather reminded of the interplay of technical specifications and “curb appeal” in choosing a new car.  That would be an important development in a Semantic Web industry that has not completely settled on what a car is really supposed to be, let alone how to steer or where one might want to go with it.


Libraries have always been concerned with quality criteria in their work as a creators of descriptive metadata.  One of our purposes here has been to show how those criteria will evolve as libraries become publishers of Linked Data, as we believe that they must. That much seems fairly straightforward, and there are many processes and methods by which quality criteria can be embedded in the process of metadata creation and management.

More difficult, perhaps, is deciding how these criteria can be used to construct policies for Linked Data consumption.  As we have said many times elsewhere, we believe that there are tremendous advantages and efficiencies that can be realized by linking to data and descriptions created by others, notably in connecting up information about the people and places that are mentioned in legislative information with outside information pools.   That will require care and judgement, and quality criteria such as these will be the basis for those discussions.  Not all of these criteria have matured — or ever will mature — to the point where hard-and-fast metrics exist.  We are unlikely to ever see rigid checklists or contractual clauses with bullet-pointed performance targets, at least for many of the factors we have discussed here. Some of the new accessibility criteria might be the subject of service-level agreements or other mechanisms used in electronic publishing or database-access contracts.  But the real use of these criteria is in assessments that will be made long before contracts are negotiated and signed.  In that setting, these criteria are simply the lenses that help us know quality when we see it.



Thomas R. Bruce is the Director of the Legal Information Institute at the Cornell Law School.

Diane Hillmann is a principal in Metadata Management Associates, and a long-time collaborator with the Legal Information Institute.  She is currently a member of the Advisory Board for the Dublin Core Metadata Initiative (DCMI), and was co-chair of the DCMI/RDA Task Group.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

There have been a series of efforts to create a national legislative data standard – one master XML format to which all states will adhere for bills, laws, and regulations.Those efforts have gone poorly.

Few states provide bulk downloads of their laws. None provide APIs. Although nearly all states provide websites for people to read state laws, they are all objectively terrible, in ways that demonstrate that they were probably pretty impressive in 1995. Despite the clear need for improved online display of laws, the lack of a standard data format and the general lack of bulk data has enabled precious few efforts in the private sector. (Notably, there is Robb Schecter’s, which provides vastly improved experiences for the laws of California, Oregon, and New York. There was also a site built experimentally by Ari Hershowitz that was used as a platform for last year’s California Laws Hackathon.)

A significant obstacle to prior efforts has been the perceived need to create a single standard, one that will accommodate the various textual legal structures that are employed throughout government. This is a significant practical hurdle on its own, but failure is all but guaranteed by also engaging major stakeholders and governments to establish a standard that will enjoy wide support and adoption.

What if we could stop letting the perfect be the enemy of the good? What if we ignore the needs of the outliers, and establish a “good enough” system, one that will at first simply work for most governments? And what if we completely skip the step of establishing a standard XML format? Wouldn’t that get us something, a thing superior to the nothing that we currently have?

The State Decoded
This is the philosophy behind The State Decoded. Funded by the John S. and James L. Knight Foundation, The State Decoded is a free, open source program to put legal codes online, and it does so by simply skipping over the problems that have hampered prior efforts. The project does not aspire to create any state law websites on its own but, instead, to provide the software to enable others to do so.

Still in its development (it’s at version 0.4), The State Decoded leaves it to each implementer to gather up the contents of the legal code in question and interface it with the program’s internal API. This could be done via screen-scraping off of an existing state code website, modifying the parser to deal with a bulk XML file, converting input data into the program’s simple XML import format, or by a few other methods. While a non-trivial task, it’s something that can be knocked out in an afternoon, thus avoiding the need to create a universal data format and to persuade Wexis to provide their data in that format.

The magic happens after the initial data import. The State Decoded takes that raw legal text and uses it to populate a complete, fully functional website for end-users to search and browse those laws. By packaging the Solr search engine and employing some basic textual analysis, every law is cross-referenced with other laws that cite it and laws that are textually similar. If there exists a repository of legal decisions for the jurisdiction in question, that can be incorporated, too, displaying a list of the court cases that cite each section. Definitions are detected, loaded into a dictionary, and make the laws self-documenting. End users can post comments to each law. Bulk downloads are created, letting people get a copy of the entire legal code, its structural elements, or the automatically assembled dictionary. And there’s a REST-ful, JSON-based API, ready to be used by third parties. All of this is done automatically, quickly, and seamlessly. The time elapsed varies, depending on server power and the length of the legal code, but it generally takes about twenty minutes from start to finish.

The State Decoded is a free program, released under the GNU Public License. Anybody can use it to make legal codes more accessible online. There are no strings attached.

It has already been deployed in two states, Virginia and Florida, despite not actually being a finished project yet.

State Variations
The striking variations in the structures of legal codes within the U.S. required the establishment of an appropriately flexible system to store and render those codes. Some legal codes are broad and shallow (e.g., Louisiana, Oklahoma), while others are narrow and deep (e.g., Connecticut, Delaware). Some list their sections by natural sort order, some in decimal, a few arbitrarily switch between the two. Many have quirks that will require further work to accommodate.

For example, California does not provide a catch line for their laws, but just a section number. One must read through a law to know what it actually does, rather than being able to glance at the title and get the general idea. Because this is a wildly impractical approach for a state code, the private sector has picked up the slack – Westlaw and LexisNexis each write their own titles for those laws, neatly solving the problem for those with the financial resources to pay for those companies’ offerings. To handle a problem like this, The State Decoded either needs to be able to display legal codes that lack section titles, or pointedly not support this inferior approach, and instead support the incorporation of third-party sources of title. In California, this might mean mining the section titles used internally by the California Law Revision Commission, and populating the section titles with those. (And then providing a bulk download of that data, allowing it to become a common standard for California’s section titles.)

Many state codes have oddities like this. The State Decoded combines flexibility with open source code to make it possible to deal with these quirks on a case-by-case basis. The alternative approach is too convoluted and quixotic to consider.

There is strong interest in seeing this software adapted to handle regulations, especially from cash-strapped state governments looking to modernize their regulatory delivery process. Although this process is still in an early stage, it looks like rather few modifications will be required to support the storage and display of regulations within The State Decoded.

More significant modifications would be needed to integrate registers of regulations, but the substantial public benefits that would provide make it an obvious and necessary enhancement. The present process required to identify the latest version of a regulation is the stuff of parody. To select a state at random, here are the instructions provided on Kansas’s website:

To find the latest version of a regulation online, a person should first check the table of contents in the most current Kansas Register, then the Index to Regulations in the most current Kansas Register, then the current K.A.R. Supplement, then the Kansas Administrative Regulations. If the regulation is found at any of these sequential steps, stop and consider that version the most recent.

If Kansas has electronic versions of all this data, it seems almost punitive not to put it all in one place, rather than forcing people to look in four places. It seems self-evident that the current Kansas Register, the Index to Regulations, the K.A.R. Supplement, and the Kansas Administrative Regulations should have APIs, with a common API atop all four, which would make it trivial to present somebody with the current version of a regulation with a single request. By indexing registers of regulations in the manner that The State Decoded indexes court opinions, it would at least be possible to show people all activity around a given regulation, if not simply show them the present version of it, since surely that is all that most people want.

A Tapestry of Data
In a way, what makes The State Decoded interesting is not anything that it actually does, but instead what others might do with the data that it emits. By capitalizing on the program’s API and healthy collection of bulk downloads, clever individuals will surely devise uses for state legal data that cannot presently be envisioned.

The structural value of state laws is evident when considered within the context of other open government data.

Major open government efforts are confined largely to the upper-right quadrant of this diagram – those matters concerned with elections and legislation. There is also some excellent work being done in opening up access to court rulings, indexing scholarly publications, and nascent work in indexing the official opinions of attorneys general. But the latter group cannot be connected to the former group without opening up access to state laws. Courts do not make rulings about bills, of course – it is laws with which they concern themselves. Law journals cite far more laws than they do bills. To weave a seamless tapestry of data that connects court decisions to state laws to legislation to election results to campaign contributions, it is necessary to have a source of rich data about state laws. The State Decoded aims to provide that data.

Next Steps
The most important next step for The State Decoded is to complete it, releasing a version 1.0 of the software. It has dozens of outstanding issues – both bug fixes and new features – so this process will require some months. In that period, the project will continue to work with individuals and organizations in states throughout the nation who are interested in deploying The State Decoded to help them get started.

Ideally, The State Decoded will be obviated by states providing both bulk data and better websites for their codes and regulations. But in the current economic climate, neither are likely to be prioritized within state budgets, so unfortunately there’s liable to remain a need for the data provided by The State Decoded for some years to come. The day when it is rendered useless will be a good day.

Waldo Jaquith is a website developer with the Miller Center at the University of Virginia in Charlottesville, Virginia. He is a News Challenge Fellow with the John S. and James L. Knight Foundation and runs Richmond Sunlight, an open legislative service for Virginia. Jaquith previously worked for the White House Office of Science and Technology Policy, for which he developed, and is now a member of the White House Open Data Working Group.
[Editor’s Note: For topic-related VoxPopuLII posts please see: Ari Hershowitz & Grant Vergottini, Standardizing the World’s Legal Information – One Hackathon At a Time; Courtney Minick, Universal Citation for State Codes; John Sheridan,; and Robb Schecter, The Recipe for Better Legal Information Services. ]

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.