CourtListener: Where We Are and Where We'd Like to Go

Mar 042013

At CourtListener, we are making a free database of court opinions with the ultimate goal of providing the entire U.S. case-law corpus to the world for free and combining it with cutting-edge search and research tools. We–like most readers of this blog–believe that for justice to truly prevail, the law must be open and equally accessible to everybody.

It is astonishing to think that the entire U.S. case-law corpus is not currently available to the world at no cost. Many have started down this path and stopped, so we know we’ve set a high goal for a humble open source project. From time to time it’s worth taking a moment to reflect on where we are and where we’d like to go in the coming years.

The current state of affairs

We’ve created a good search engine that can provide results based on a number of characteristics of legal cases. Our users can search for opinions by the case name, date, or any text that’s in the opinion, and can refine by court, by precedential status or by citation. The results are pretty good, but are limited based on the data we have and the “relevance signals” that we have in place.

A good legal search engine will use a number of factors (a.k.a. “relevance signals”) to promote documents to the top of their listings. Things like:

How recent is the opinion?
How many other opinions have cited it?
How many journals have cited it?
How long is it?
How important is the court that heard the case?
Is the case in the jurisdiction of the user?
Is the opinion one that the user has looked at before?
What was the subsequent treatment of the opinion?

And so forth. All of the above help to make search results better, and we’ve seen paid legal search tools make great strides in their products by integrating these and other factors. At CourtListener, we’re using a number of the above, but we need to go further. We need to use as many factors as possible, we need to learn how the factors interact with each other, which ones are the most important, and which lead to the best results.

A different problem we’re working to solve at CourtListener is getting primary legal materials freely onto the Web. What good is a search engine if the opinion you need isn’t there in the first place? We currently have about 800,000 federal opinions, including West’s second and third Federal Reporters, F.2d and F.3d, and the entire Supreme Court corpus. This is good and we’re very proud of the quality of our database–we think it’s the best free resource there is. Every day we add the opinions from the Circuit Courts in the federal system and the U.S. Supreme Court, nearly in real-time. But we need to go further: we need to add state opinions, and we need to add not just the latest opinions but all the historical ones as well.

This sounds daunting, but it’s a problem that we hope will be solved in the next few years. Although it’s taking longer than we would like, in time we are confident that all of the important historical legal data will make its way to the open Internet. Primary legal sources are already in the public domain, so now it’s just a matter of getting it into good electronic formats so that anyone can access it and anyone can re-use it. If an opinion only exists as unsearchable scanned versions, in bound books, or behind a pricey pay wall, then it’s closed to many people that should have access to it. As part of our citation identification project, which I’ll talk about next, we’re working to get the most important documents properly digitized.

Our citation identification project was developed last year by U.C. Berkeley School of Information students Rowyn McDonald and Karen Rustad to identify and cross-link any citations found in our database. This is a great feature that makes all the citations in our corpus link to the correct opinions, if we have them. For example, if you’re reading an opinion that has a reference to Roe v. Wade, you can click on the citation, and you’ll be off and reading Roe v. Wade. By the way, if you’re wondering how many Federal Appeals opinions cite Roe v. Wade, the number in our system is 801 opinions (and counting). If you’re wondering what the most-cited opinion in our system is, you may be bemused: With about 10,000 citations, it’s an opinion about ineffective assistance of legal counsel in death penalty cases, Strickland v. Washington, 466 U.S. 668 (1984).

A feature we’ll be working on soon will tie into our citation system to help us close any gaps in our corpus. Once the feature is done, whenever an opinion is cited that we don’t yet have, our users will be able to pay a small amount–one or two dollars–to sponsor the digitization of that opinion. We’ll do the work of digitizing it, and after that point the opinion will be available to the public for free.

This brings us to the next big feature we added last year: bulk data. Because we want to assist academic researchers and others who might have a use for a large database of court opinions, we provide free bulk downloads of everything we have. Like Carl Malamud’s Resource.org, (to whom we owe a great debt for his efforts to collect opinions and provide them to others for free and for his direct support of our efforts) we have giant files you can download that provide thousands of opinions in computer-readable format. These downloads are available by court and date, and include thousands of fixes to the Resource.org corpus. They also include something you can’t find anywhere else: the citation network. As part of the metadata associated with each opinion in our bulk download files, you can look and see which opinions it cites as well as which opinions cite it. This provides a valuable new source of data that we are very eager for others to work with. Of course, as new opinions are added to our system, we update our downloads with the new citations and the new information.

Finally, we would be remiss if we didn’t mention our hallmark feature: daily, weekly and monthly email alerts. For any query you put into CourtListener, you can request that we email you whenever there are new results. This feature was the first one we created, and one that we continue to be excited about. This year we haven’t made any big innovations to our email alerts system, but its popularity has continued to grow, with more than 500 alerts run each day. Next year, we hope to add a couple small enhancements to this feature so it’s smoother and easier to use.

The future

I’ve hinted at a lot of our upcoming work in the sections above, but what are the big-picture features that we think we need to achieve our goals?

We do all of our planning in the open, but we have a few things cooking in the background that we hope to eventually build. Among them are ideas for adding oral argument audio, case briefs, and data from PACER. Adding these new types of information to CourtListener is a must if we want to be more useful for research purposes, but doing so is a long-term goal, given the complexity of doing them well.

We also plan to build an opinion classifier that could automatically, and without human intervention, determine the subsequent treatment of opinions. Done right, this would allow our users to know at a glance if the opinion they’re reading was subsequently followed, criticized, or overruled, making our system even more valuable to our users.

In the next few years, we’ll continue building out these features, but as an open-source and open-data project, everything we do is in the open. You can see our plans on our feature tracker, our bugs in our bug tracker, and can get in touch in our forum. The next few years look to be very exciting as we continue building our collection and our platform for legal research. Let’s see what the new year brings!

Michael Lissner is the co-founder and lead developer of CourtListener, a project that works to make the law more accessible to all. He graduated from U.C. Berkeley’s School of Information, and when he’s not working on CourtListener he develops search and eDiscovery solutions for law firms. Michael is passionate about bringing greater access to our primary legal materials, about how technology can replace old legal models, and about open source, community-driven approaches to legal research.

Brian W. Carver is Assistant Professor at the U.C. Berkeley School of Information where he does ressearch on and teaches about intellectual property law and cyberlaw. He is also passionate about the public’s access to the law. In 2009 and 2010 he advised an I School Masters student, Michael Lissner, on the creation of CourtListener.com, an alert service covering the U.S. federal appellate courts. After Michael’s graduation, he and Brian continued working on the site and have grown the database of opinions to include over 750,000 documents. In 2011 and 2012, Brian advised I School Masters students Rowyn McDonald and Karen Rustad on the creation of a legal citator built on the CourtListener database.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed. The information above should not be considered legal advice. If you require legal representation, please consult a lawyer.

Metadata Quality in a Linked Data Context

Linked Data 4 Responses »

Jan 242013

Van Winkle wakes

In this post, we return to a topic we first visited in a book chapter in 2004. At that time, one of us (Bruce) was an electronic publisher of Federal court cases and statutes, and the other (Hillmann, herself a former law cataloger) was working with large, aggregated repositories of scientific papers as part of the National Sciences Digital Library project. Then, as now, we were concerned that little attention was being paid to the practical tradeoffs involved in publishing high quality metadata at low cost. There was a tendency to design metadata schemas that said absolutely everything that could be said about an object, often at the expense of obscuring what needed to be said about it while running up unacceptable costs. Though we did not have a name for it at the time, we were already deeply interested in least-cost, use-case-driven approaches to the design of metadata models, and that naturally led us to wonder what “good” metadata might be. The result was “The Continuum of Metadata Quality: Defining, Expressing, Exploiting”, published as a chapter in an ALA publication, Metadata in Practice.

In that chapter, we attempted to create a framework for talking about (and evaluating) metadata quality. We were concerned primarily with metadata as we were then encountering it: in aggregations of repositories containing scientific preprints, educational resources, and in caselaw and other primary legal materials published on the Web. We hoped we could create something that would be both domain-independent and useful to those who manage and evaluate metadata projects. Whether or not we succeeded is for others to judge.

The Original Framework

At that time, we identified seven major components of metadata quality. Here, we reproduce a part of a summary table that we used to characterize the seven measures. We suggested questions that might be used to draw a bead on the various measures we proposed:

Quality Measure	Quality Criteria
Completeness	Does the element set completely describe the objects? Are all relevant elements used for each object?
Provenance	Who is responsible for creating, extracting, or transforming the metadata? How was the metadata created or extracted? What transformations have been done on the data since its creation?
Accuracy	Have accepted methods been used for creation or extraction? What has been done to ensure valid values and structure? Are default values appropriate, and have they been appropriately used?
Conformance to expectations	Does metadata describe what it claims to? Are controlled vocabularies aligned with audience characteristics and understanding of the objects? Are compromises documented and in line with community expectations?
Logical consistency and coherence	Is data in elements consistent throughout? How does it compare with other data within the community?
Timeliness	Is metadata regularly updated as the resources change? Are controlled vocabularies updated when relevant?
Accessibility	Is an appropriate element set for audience and community being used? Is it affordable to use and maintain? Does it permit further value-adds?

There are, of course, many possible elaborations of these criteria, and many other questions that help get at them. Almost nine years later, we believe that the framework remains both relevant and highly useful, although (as we will discuss in a later section) we need to think carefully about whether and how it relates to the quality standards that the Linked Open Data (LOD) community is discovering for itself, and how it and other standards should affect library and publisher practices and policies.

… and the environment in which it was created

Our work was necessarily shaped by the environment we were in. Though we never really said so explicitly, we were looking for quality not only in the data itself, but in the methods used to organize, transform and aggregate it across federated collections. We did not, however, anticipate the speed or scale at which standards-based methods of data organization would be applied. Commonly-used standards like FOAF, models such as those contained in schema.org, and lightweight modelling apparatus like SKOS are all things that have emerged into common use since, and of course the use of Dublin Core — our main focus eight years ago — has continued even as the standard itself has been refined. These days, an expanded toolset makes it even more important that we have a way to talk about how well the tools fit the job at hand, and how well they have been applied. An expanded set of design choices accentuates the need to talk about how well choices have been made in particular cases.

Although our work took its inspiration from quality standards developed by a government statistical service, we had not really thought through the sheer multiplicity of information services that were available even then. We were concerned primarily with work that had been done with descriptive metadata in digital libraries, but of course there were, and are, many more people publishing and consuming data in both the governmental and private sectors (to name just two). Indeed, there was already a substantial literature on data quality that arose from within the management information systems (MIS) community, driven by concerns about the reliability and quality of mission-critical data used and traded by businesses. In today’s wider world, where work with library metadata will be strongly informed by the Linked Open Data techniques developed for a diverse array of data publishers, we need to take a broader view.

Finally, we were driven then, as we are now, by managerial and operational concerns. As practitioners, we were well aware that metadata carries costs, and that human judgment is expensive. We were looking for a set of indicators that would spark and sustain discussion about costs and tradeoffs. At that time, we were mostly worried that libraries were not giving costs enough attention, and were designing metadata projects that were unrealistic given the level of detail or human intervention they required. That is still true. The world of Linked Data requires well-understood metadata policies and operational practices simply so publishers can know what is expected of them and consumers can know what they are getting. Those policies and practices in turn rely on quality measures that producers and consumers of metadata can understand and agree on. In today’s world — one in which institutional resources are shrinking rather than expanding — human intervention in the metadata quality assessment process at any level more granular than that of the entire data collection being offered will become the exception rather than the rule.

While the methods we suggested at the time were self-consciously domain-independent, they did rest on background assumptions about the nature of the services involved and the means by which they were delivered. Our experience had been with data aggregated by communities where the data producers and consumers were to some extent known to one another, using a fairly simple technology that was easy to run and maintain. In 2013, that is not the case; producers and consumers are increasingly remote from each other, and the technologies used are both more complex and less mature, though that is changing rapidly.

The remainder of this blog post is an attempt to reconsider our framework in that context.

The New World

The Linked Open Data (LOD) community has begun to consider quality issues; there are some noteworthy online discussions, as well as workshops resulting in a number of published papers and online resources. It is interesting to see where the work that has come from within the LOD community contrasts with the thinking of the library community on such matters, and where it does not.

In general, the material we have seen leans toward the traditional data-quality concerns of the MIS community. LOD practitioners seem to have started out by putting far more emphasis than we might on criteria that are essentially audience-dependent, and on operational concerns having to do with the reliability of publishing and consumption apparatus. As it has evolved, the discussion features an intellectual move away from those audience-dependent criteria, which are usually expressed as “fitness for use”, “relevance”, or something of the sort (we ourselves used the phrase “community expectations”). Instead, most realize that both audience and usage are likely to be (at best) partially unknown to the publisher, at least at system design time. In other words, the larger community has begun to grapple with something librarians have known for a while: future uses and the extent of dissemination are impossible to predict. There is a creative tension here that is not likely to go away. On the one hand, data developed for a particular community is likely to be much more useful to that community; thus our initial recognition of the role of “community expectations”. On the other, dissemination of the data may reach far past the boundaries of the community that develops and publishes it. The hope is that this tension can be resolved by integrating large data pools from diverse sources, or by taking other approaches that result in data models sufficiently large and diverse that “community expectations” can be implemented, essentially, by filtering.

For the LOD community, the path that began with “fitness-for-use” criteria led quickly to the idea of maintaining a “neutral perspective”. Christian Fürber describes that perspective as the idea that “Data quality is the degree to which data meets quality requirements no matter who is making the requirements”. To librarians, who have long since given up on the idea of cataloger objectivity, a phrase like “neutral perspective” may seem naive. But it is a step forward in dealing with data whose dissemination and user community is unknown. And it is important to remember that the larger LOD community is concerned with quality in data publishing in general, and not solely with descriptive metadata, for which objectivity may no longer be of much value. For that reason, it would be natural to expect the larger community to place greater weight on objectivity in their quality criteria than the library community feels that it can, with a strong preference for quantitative assessment wherever possible. Librarians and others concerned with data that involves human judgment are theoretically more likely to be concerned with issues of provenance, particularly as they concern who has created and handled the data. And indeed that is the case.

The new quality criteria, and how they stack up

Here is a simplified comparison of our 2004 criteria with three views taken from the LOD community.

Bruce & Hillmann	Dodds, McDonald	Flemming
Completeness	Completeness Boundedness Typing	Amount of data
Provenance	History Attribution Authoritative	Verifiability
Accuracy	Accuracy Typing	Validity of documents
Conformance to expectations	Modeling correctness Modeling granularity Isomorphism	Uniformity
Logical consistency and coherence	Directionality Modeling correctness Internal consistency Referential correspondence Connectedness	Consistency
Timeliness	Currency	Timeliness
Accessibility	Intelligibility Licensing Sustainable	Comprehensibility Versatility Licensing
Accessibility (technical) Performance (technical)

Placing the “new” criteria into our framework was no great challenge; it appears that we were, and are, talking about many of the same things. A few explanatory remarks:

Boundedness has roughly the same relationship to completeness that precision does to recall in information-retrieval metrics. The data is complete when we have everything we want; its boundedness shows high quality when we have only what we want.
Flemming’s amount of data criterion talks about numbers of triples and links, and about the interconnectedness and granularity of the data. These seem to us to be largely completeness criteria, though things to do with linkage would more likely fall under “Logical coherence” in our world. Note, again, a certain preoccupation with things that are easy to count. In this case it is somewhat unsatisfying; it’s not clear what the number of triples in a triplestore says about quality, or how it might be related to completeness if indeed that is what is intended.
Everyone lists criteria that fit well with our notions about provenance. In that connection, the most significant development has been a great deal of work on formalizing the ways in which provenance is expressed. This is still an active level of research, with a lot to be decided. In particular, attempts at true domain independence are not fully successful, and will probably never be so. It appears to us that those working on the problem at DCMI are monitoring the other efforts and incorporating the most worthwhile features.
Dodds’ typing criterion — which basically says that dereferenceable URIs should be preferred to string literals — participates equally in completeness and accuracy categories. While we prefer URIs in our models, we are a little uneasy with the idea that the presence of string literals is always a sign of low quality. Under some circumstances, for example, they might simply indicate an early stage of vocabulary evolution.
Flemming’s verifiability and validity criteria need a little explanation, because the terms used are easily confused with formal usages and so are a little misleading. Verifiability bundles a set of concerns we think of as provenance. Validity of documents is about accuracy as it is found in things like class and property usage. Curiously, none of Flemming’s criteria have anything to do with whether the information being expressed by the data is correct in what it says about the real world; they are all designed to convey technical criteria. The concern is not with what the data says, but with how it says it.
Dodds’ modeling correctness criterion seems to be about two things: whether or not the model is correctly constructed in formal terms, and whether or not it covers the subject domain in an expected way. Thus, we assign it to both “Community expectations” and “Logical coherence” categories.
Isomorphism has to do with the ability to join datasets together, when they describe the same things. In effect, it is a more formal statement of the idea that a given community will expect different models to treat similar things similarly. But there are also some very tricky (and often abused) concepts of equivalence involved; these are just beginning to receive some attention from Semantic Web researchers.
Licensing has become more important to everyone. That is in part because Linked Data as published in the private sector may exhibit some of the proprietary characteristics we saw as access barriers in 2004, and also because even public-sector data publishers are worried about cost recovery and appropriate-use issues. We say more about this in a later section.
A number of criteria listed under Accessibility have to do with the reliability of data publishing and consumption apparatus as used in production. Linked Data consumers want to know that the endpoints and triple stores they rely on for data are going to be up and running when they are needed. That brings a whole set of accessibility and technical performance issues into play. At least one website exists for the sole purpose of monitoring endpoint reliability, an obvious concern of those who build services that rely on Linked Data sources. Recently, the LII made a decision to run its own mirror of the DrugBank triplestore to eliminate problems with uptime and to guarantee low latency; performance and accessibility had become major concerns. For consumers, due diligence is important.

For us, there is a distinctly different feel to the examples that Dodds, Flemming, and others have used to illustrate their criteria; they seem to be looking at a set of phenomena that has substantial overlap with ours, but is not quite the same. Part of it is simply the fact, mentioned earlier, that data publishers in distinct domains have distinct biases. For example, those who can’t fully believe in objectivity are forced to put greater emphasis on provenance. Others who are not publishing descriptive data that relies on human judgment feel they can rely on more “objective” assessment methods. But the biggest difference in the “new quality” is that it puts a great deal of emphasis on technical quality in the construction of the data model, and much less on how well the data that populates the model describes real things in the real world.

There are three reasons for that. The first has to do with the nature of the discussion itself. All quality discussions, simply as discussions, seem to neglect notions of factual accuracy because factual accuracy seems self-evidently a Good Thing; there’s not much to talk about. Second, the people discussing quality in the LOD world are modelers first, and so quality is seen as adhering primarily to the model itself. Finally, the world of the Semantic Web rests on the assumption that “anyone can say anything about anything”, For some, the egalitarian interpretation of that statement reaches the level of religion, making it very difficult to measure quality by judging whether something is factual or not; from a purist’s perspective, it’s opinions all the way down. There is, then, a tendency to rely on formalisms and modeling technique to hold back the tide.

In 2004, we suggested a set of metadata-quality indicators suitable for managers to use in assessing projects and datasets. An updated version of that table would look like this:

Quality Measure	Quality Criteria
Completeness	Does the element set completely describe the objects? Are all relevant elements used for each object? Does the data contain everything you expect? Does the data contain only what you expect?
Provenance	Who is responsible for creating, extracting, or transforming the metadata? How was the metadata created or extracted? What transformations have been done on the data since its creation? Has a dedicated provenance vocabulary been used? Are there authenticity measures (eg. digital signatures) in place?
Accuracy	Have accepted methods been used for creation or extraction? What has been done to ensure valid values and structure? Are default values appropriate, and have they been appropriately used? Are all properties and values valid/defined?
Conformance to expectations	Does metadata describe what it claims to? Does the data model describe what it claims to? Are controlled vocabularies aligned with audience characteristics and understanding of the objects? Are compromises documented and in line with community expectations?
Logical consistency and coherence	Is data in elements consistent throughout? How does it compare with other data within the community? Is the data model technically correct and well structured? Is the data model aligned with other models in the same domain? Is the model consistent in the direction of relations?
Timeliness	Is metadata regularly updated as the resources change? Are controlled vocabularies updated when relevant?
Accessibility	Is an appropriate element set for audience and community being used? Is the data and its access methods well-documented, with exemplary queries and URIs? Do things have human-readable labels? Is it affordable to use and maintain? Does it permit further value-adds? Does it permit republication? Is attribution required if the data is redistributed? Are human- and machine-readable licenses available?
Accessibility — technical	Are reliable, performant endpoints available? Will the provider guarantee service (eg. via a service level agreement)? Is the data available in bulk? Are URIs stable?

The differences in the example questions reflect the differences of approach that we discussed earlier. Also, the new approach separates criteria related to technical accessibility from questions that relate to intellectual accessibility. Indeed, we suspect that “accessibility” may have been too broad a notion in the first place. Wider deployment of metadata systems and a much greater, still-evolving variety of producer-consumer scenarios and relationships have created a need to break it down further. There are as many aspects to accessibility as there are types of barriers — economic, technical, and so on.

As before, our list is not a checklist or a set of must-haves, nor does it contain all the questions that might be asked. Rather, we intend it as a list of representative questions that might be asked when a new Linked Data source is under consideration. They are also questions that should inform policy discussion around the uses of Linked Data by consuming libraries and publishers.

That is work that can be formalized and taken further. One intriguing recent development is work toward a Data Quality Management Vocabulary. Its stated aims are to

support the expression of quality requirements in the same language, at web scale;
support the creation of consensual agreements about quality requirements
increase transparency around quality requirements and measures
enable checking for consistency among quality requirements, and
generally reduce the effort needed for data quality management activities

The apparatus to be used is a formal representation of “quality-relevant” information. We imagine that the researchers in this area are looking forward to something like automated e-commerce in Linked Data, or at least a greater ability to do corpus-level quality assessment at a distance. Of course, “fitness-for-use” and other criteria that can really only be seen from the perspective of the user will remain important, and there will be interplay between standardized quality and performance measures (on the one hand) and audience-relevant features on the other. One is rather reminded of the interplay of technical specifications and “curb appeal” in choosing a new car. That would be an important development in a Semantic Web industry that has not completely settled on what a car is really supposed to be, let alone how to steer or where one might want to go with it.

Conclusion

Libraries have always been concerned with quality criteria in their work as a creators of descriptive metadata. One of our purposes here has been to show how those criteria will evolve as libraries become publishers of Linked Data, as we believe that they must. That much seems fairly straightforward, and there are many processes and methods by which quality criteria can be embedded in the process of metadata creation and management.

More difficult, perhaps, is deciding how these criteria can be used to construct policies for Linked Data consumption. As we have said many times elsewhere, we believe that there are tremendous advantages and efficiencies that can be realized by linking to data and descriptions created by others, notably in connecting up information about the people and places that are mentioned in legislative information with outside information pools. That will require care and judgement, and quality criteria such as these will be the basis for those discussions. Not all of these criteria have matured — or ever will mature — to the point where hard-and-fast metrics exist. We are unlikely to ever see rigid checklists or contractual clauses with bullet-pointed performance targets, at least for many of the factors we have discussed here. Some of the new accessibility criteria might be the subject of service-level agreements or other mechanisms used in electronic publishing or database-access contracts. But the real use of these criteria is in assessments that will be made long before contracts are negotiated and signed. In that setting, these criteria are simply the lenses that help us know quality when we see it.

References

Bruce, Thomas R., and Diane Hillmann (2004). “The Continuum of Metadata Quality: Defining, Expressing, Exploiting”. In Metadata in Practice, Hillmann and Westbrooks, eds. Online at http://www.ecommons.cornell.edu/handle/1813/7895
DCMI Metadata Provenance Task Group, at http://dublincore.org/groups/provenance/ .
Dodds, Leigh (2010) “Quality Indicators for Linked Data Datasets”. Online posting at http://answers.semanticweb.com/questions/1072/quality-indicators-for-linked-data-datasets .
Flemming, Annika (2010) “Quality Criteria for Linked Data Sources”. Online at http://sourceforge.net/apps/mediawiki/trdf/index.php?title=Quality_Criteria_for_Linked_Data_sources&action=history
Fürber, Christian, and Martin Hepp (2011).”Towards a Vocabulary for Data Quality Management in Semantic Web Architectures”. Presentation at the First International Workshop on Linked Web Data Management, Uppsala, Sweden. Online at http://www.slideshare.net/cfuerber/towards-a-vocabulary-for-data-quality-management-in-semantic-web-architectures .
W3C, “Provenance Vocabulary Mappings”. At http://www.w3.org/2005/Incubator/prov/wiki/Provenance_Vocabulary_Mappings .

Thomas R. Bruce is the Director of the Legal Information Institute at the Cornell Law School.

Diane Hillmann is a principal in Metadata Management Associates, and a long-time collaborator with the Legal Information Institute. She is currently a member of the Advisory Board for the Dublin Core Metadata Initiative (DCMI), and was co-chair of the DCMI/RDA Task Group.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

Opening Up State Legal Data

Demand for public access to legal information, Digital legal publishing, Electronic legal publishing, elegislation, elegislation systems, free law data, Legislative information systems 2 Responses »

Dec 082012

There have been a series of efforts to create a national legislative data standard – one master XML format to which all states will adhere for bills, laws, and regulations.Those efforts have gone poorly.

Few states provide bulk downloads of their laws. None provide APIs. Although nearly all states provide websites for people to read state laws, they are all objectively terrible, in ways that demonstrate that they were probably pretty impressive in 1995. Despite the clear need for improved online display of laws, the lack of a standard data format and the general lack of bulk data has enabled precious few efforts in the private sector. (Notably, there is Robb Schecter’s WebLaws.org, which provides vastly improved experiences for the laws of California, Oregon, and New York. There was also a site built experimentally by Ari Hershowitz that was used as a platform for last year’s California Laws Hackathon.)

A significant obstacle to prior efforts has been the perceived need to create a single standard, one that will accommodate the various textual legal structures that are employed throughout government. This is a significant practical hurdle on its own, but failure is all but guaranteed by also engaging major stakeholders and governments to establish a standard that will enjoy wide support and adoption.

What if we could stop letting the perfect be the enemy of the good? What if we ignore the needs of the outliers, and establish a “good enough” system, one that will at first simply work for most governments? And what if we completely skip the step of establishing a standard XML format? Wouldn’t that get us something, a thing superior to the nothing that we currently have?

The State Decoded
This is the philosophy behind The State Decoded. Funded by the John S. and James L. Knight Foundation, The State Decoded is a free, open source program to put legal codes online, and it does so by simply skipping over the problems that have hampered prior efforts. The project does not aspire to create any state law websites on its own but, instead, to provide the software to enable others to do so.

Still in its development (it’s at version 0.4), The State Decoded leaves it to each implementer to gather up the contents of the legal code in question and interface it with the program’s internal API. This could be done via screen-scraping off of an existing state code website, modifying the parser to deal with a bulk XML file, converting input data into the program’s simple XML import format, or by a few other methods. While a non-trivial task, it’s something that can be knocked out in an afternoon, thus avoiding the need to create a universal data format and to persuade Wexis to provide their data in that format.

The magic happens after the initial data import. The State Decoded takes that raw legal text and uses it to populate a complete, fully functional website for end-users to search and browse those laws. By packaging the Solr search engine and employing some basic textual analysis, every law is cross-referenced with other laws that cite it and laws that are textually similar. If there exists a repository of legal decisions for the jurisdiction in question, that can be incorporated, too, displaying a list of the court cases that cite each section. Definitions are detected, loaded into a dictionary, and make the laws self-documenting. End users can post comments to each law. Bulk downloads are created, letting people get a copy of the entire legal code, its structural elements, or the automatically assembled dictionary. And there’s a REST-ful, JSON-based API, ready to be used by third parties. All of this is done automatically, quickly, and seamlessly. The time elapsed varies, depending on server power and the length of the legal code, but it generally takes about twenty minutes from start to finish.

The State Decoded is a free program, released under the GNU Public License. Anybody can use it to make legal codes more accessible online. There are no strings attached.

It has already been deployed in two states, Virginia and Florida, despite not actually being a finished project yet.

State Variations
The striking variations in the structures of legal codes within the U.S. required the establishment of an appropriately flexible system to store and render those codes. Some legal codes are broad and shallow (e.g., Louisiana, Oklahoma), while others are narrow and deep (e.g., Connecticut, Delaware). Some list their sections by natural sort order, some in decimal, a few arbitrarily switch between the two. Many have quirks that will require further work to accommodate.

For example, California does not provide a catch line for their laws, but just a section number. One must read through a law to know what it actually does, rather than being able to glance at the title and get the general idea. Because this is a wildly impractical approach for a state code, the private sector has picked up the slack – Westlaw and LexisNexis each write their own titles for those laws, neatly solving the problem for those with the financial resources to pay for those companies’ offerings. To handle a problem like this, The State Decoded either needs to be able to display legal codes that lack section titles, or pointedly not support this inferior approach, and instead support the incorporation of third-party sources of title. In California, this might mean mining the section titles used internally by the California Law Revision Commission, and populating the section titles with those. (And then providing a bulk download of that data, allowing it to become a common standard for California’s section titles.)

Many state codes have oddities like this. The State Decoded combines flexibility with open source code to make it possible to deal with these quirks on a case-by-case basis. The alternative approach is too convoluted and quixotic to consider.

Regulations
There is strong interest in seeing this software adapted to handle regulations, especially from cash-strapped state governments looking to modernize their regulatory delivery process. Although this process is still in an early stage, it looks like rather few modifications will be required to support the storage and display of regulations within The State Decoded.

More significant modifications would be needed to integrate registers of regulations, but the substantial public benefits that would provide make it an obvious and necessary enhancement. The present process required to identify the latest version of a regulation is the stuff of parody. To select a state at random, here are the instructions provided on Kansas’s website:

To find the latest version of a regulation online, a person should first check the table of contents in the most current Kansas Register, then the Index to Regulations in the most current Kansas Register, then the current K.A.R. Supplement, then the Kansas Administrative Regulations. If the regulation is found at any of these sequential steps, stop and consider that version the most recent.

If Kansas has electronic versions of all this data, it seems almost punitive not to put it all in one place, rather than forcing people to look in four places. It seems self-evident that the current Kansas Register, the Index to Regulations, the K.A.R. Supplement, and the Kansas Administrative Regulations should have APIs, with a common API atop all four, which would make it trivial to present somebody with the current version of a regulation with a single request. By indexing registers of regulations in the manner that The State Decoded indexes court opinions, it would at least be possible to show people all activity around a given regulation, if not simply show them the present version of it, since surely that is all that most people want.

A Tapestry of Data
In a way, what makes The State Decoded interesting is not anything that it actually does, but instead what others might do with the data that it emits. By capitalizing on the program’s API and healthy collection of bulk downloads, clever individuals will surely devise uses for state legal data that cannot presently be envisioned.

The structural value of state laws is evident when considered within the context of other open government data.

Major open government efforts are confined largely to the upper-right quadrant of this diagram – those matters concerned with elections and legislation. There is also some excellent work being done in opening up access to court rulings, indexing scholarly publications, and nascent work in indexing the official opinions of attorneys general. But the latter group cannot be connected to the former group without opening up access to state laws. Courts do not make rulings about bills, of course – it is laws with which they concern themselves. Law journals cite far more laws than they do bills. To weave a seamless tapestry of data that connects court decisions to state laws to legislation to election results to campaign contributions, it is necessary to have a source of rich data about state laws. The State Decoded aims to provide that data.

Next Steps
The most important next step for The State Decoded is to complete it, releasing a version 1.0 of the software. It has dozens of outstanding issues – both bug fixes and new features – so this process will require some months. In that period, the project will continue to work with individuals and organizations in states throughout the nation who are interested in deploying The State Decoded to help them get started.

Ideally, The State Decoded will be obviated by states providing both bulk data and better websites for their codes and regulations. But in the current economic climate, neither are likely to be prioritized within state budgets, so unfortunately there’s liable to remain a need for the data provided by The State Decoded for some years to come. The day when it is rendered useless will be a good day.

Waldo Jaquith is a website developer with the Miller Center at the University of Virginia in Charlottesville, Virginia. He is a News Challenge Fellow with the John S. and James L. Knight Foundation and runs Richmond Sunlight, an open legislative service for Virginia. Jaquith previously worked for the White House Office of Science and Technology Policy, for which he developed Ethics.gov, and is now a member of the White House Open Data Working Group.
[Editor’s Note: For topic-related VoxPopuLII posts please see: Ari Hershowitz & Grant Vergottini, Standardizing the World’s Legal Information – One Hackathon At a Time; Courtney Minick, Universal Citation for State Codes; John Sheridan, Legislation.gov.uk; and Robb Schecter, The Recipe for Better Legal Information Services. ]

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

Digital Law: What Lawyers Need to Learn from Accountants

authentication, Legislative information systems 4 Responses »

Nov 162012

In 1494, Luca Pacioli published “Particularis de Computis et Scripturis,” which is widely regarded as the first written treatise on bookkeeping. In the 600+ years since that event, we have become completely accustomed to the concepts of ledgers, journals and double-entry bookkeeping. Like all profound ideas, the concept of a transaction ledger nowadays seems to be completely natural, as if always existed as some sort of natural law.

Whenever there is a need for detailed, defensible records of how a financial state of affairs (such as a company balance sheet or a profit and loss statement) came to be, we employ Pacioli’s concepts without even thinking about them any more. Of course you need ledgers of transactions as the raw material from which to derive the financial state of affairs at whatever chosen point in time is of interest. How else could you possibly do it?

Back in Pacioli’s day, there was nothing convenient about ledgers. After all, back then, all ledger entries had to be painstakingly made by hand into paper volumes. Care was needed to pair up the debits and credits. Ledger page totals and sub-totals had to checked and re-checked. Very labor intensive stuff. But then computers came along and lo! all the benefits of ledgers in terms of the rigorous audit trail could be enjoyed without all the hard labor.

Doubtless, somewhere along the line in the early days of the computerization of financial ledgers, it occurred to somebody that ledger entries need not be immutable. That is to say, there is no technical reason to carry forward the “limitation” that pen and ink imposes on ledger writers, that an entry – once made – cannot be changed without leaving marks on the page that evidence the change. Indeed, bookkeeping has long had the concept of a “contra-entry” to handle the immutability of pen and ink. For example, if a debit of a dollar is made to a ledger by mistake, then another ledger entry is made – this time a credit – for a dollar to counter-balance the mistake while preserving the completeness of the audit-trail.

Far from being a limitation of the paper-centric world, the concept of an “append-only” ledger turns out, in my opinion, to be the key to the trustworthiness and transparency of financial statements. Accountants and auditors can take different approaches to how information from the ledgers is grouped/treated, but the ledgers are the ledgers are the ledgers. Any doubt that the various summations accurately reflect the ledgers can readily be checked.

Now let us turn to the world of law. Well, law is so much more complicated! Laws are not simple little numerical values that fit nicely into transaction rows either in paper ledgers or in database tables. True, but does it follow that the many benefits of the ledger-centric approach cannot be enjoyed in our modern day digital world where we do not have the paper-centric ledger limitations of fixed size lines to fit our information into? Is the world of legal corpus management really so different from the world of financial accounting?

What happens if we look at, say, legal corpus management in a prototypical U.S. legislature, from the perspective of an accountant? What would an accountant see? Well, there is this asset called the statute. That is the “opening balance” inventory of the business in accounting parlance. There is a time concept called a Biennium which is an “accounting period”. All changes to the statute that happen in the accounting period are recorded in the form of bills. bills are basically accounting transactions. The bills are accumulated into a form of ledger typically known as Session Laws. At the end of the accounting period – the Biennium – changes to the statute are rolled forward from the Session Laws into the statute. In accounting parlance, this is the period-end accounting culminating in a new set of opening balances (statute), for the start of the next Biennium. At the start of the Biennium, all the ledger transactions are archived off and a fresh set of ledgers is created; that is, bill numbers/session law numbers are reset, the active Biennium name changes etc.

I could go on and on extending the analogy (chamber journals are analogous to board of directors meeting minutes; bill status reporting is analogous to management accounting, etc.) but you get the idea. Legal corpus management in a legislature can be conceptualized in accounting terms. Is it useful to do so? I would argue that it is incredibly useful to do so. Thanks to computerization, we do not have to limit the application of Luca Pacioli’s brilliant insight to things that fit neatly into little rows of boxes in paper ledgers. We can treat bills as transactions and record them architecturally as 21st century digital ledger transactions. We can manage statute as a “balance” to be carried forward to the next Biennium. We can treat engrossments of bills and statute alike as forms of trail balance generation and so on.

Now I am not for a moment suggesting that a digital legislative architecture be based on any existing accounting system. What I am saying is that the concepts that make up an accounting system can – and I would argue should – be used. A range of compelling benefits accrue from this. A tremendous amount of the back-office work that goes on in many legislatures can be traced back to work-in-progress (WIP) reporting and period-end accounting of what is happening with the legal corpus. Everything from tracking bill status to the engrossment of committee reports becomes significantly easier once all the transactions are recorded in legislative ledgers. The ledgers then becomes the master repository from which all reports are generated. The reduction in overall IT moving parts, reduction in human effort, reduction in latency and the increase in information consistency that can be achieved by doing this is striking.

For many hundreds of years we have had ledger-based accounting. For hundreds of years the courts have taken the view that, for example, a company cannot simply announce a Gross Revenue figure to tax officials or to investors, without having the transaction ledgers to back it up. Isn’t in interesting that we do not do the same for the legal corpus? We have all sorts of publishers in the legal world, from public bodies to private sector, who produce legislative outputs that we have to trust because we do not have any convenient form of access to the transaction ledgers. Somewhere along the line, we seem to have convinced ourselves that the level of rigorous audit trail routinely applied in financial accounting cannot be applied to law. This is simply not true.

We can and should fix that. The prize is great, the need is great and the time is now. The reason the time is now is that all around us, I see institutions that are ceasing to produce paper copies of critical legal materials in the interests of saving costs and streamlining workflows. I am all in favour of both of these goals, but I am concerned that many of the legal institutions going fully paperless today are doing so without implementing a ledger-based approach to legal corpus management. Without that, the paper versions of everything from registers to regulations to session laws to chamber journals to statute books – for all their flaws – are the nearest thing to an immutable set of ledgers that exist. Take away what little audit trail we have and replace it will a rolling corpus of born digital documents without a comprehensive audit trail of who changed what and when?…Not good.

Once an enterprise-level ledger-based approach is utilised, another great prize can be readily won; namely, the creation of a fully digital yet fully authenticated and authoritative corpus of law. To see why, let us step back into the shoes of the accountant for a moment. When computers came along and the financial paper ledgers were replaced with digital ledgers, the world of accounting did not find itself in a crisis concerning authenticity in the way the legal world has. Why so?

I would argue that the reason for this is that ledgers – Luca Pacioli’s great gift to the world – are the true source of authenticity for any artifact derived from the ledgers. Digital authenticity of balance sheets or Statute sections does not come from digital signatures or thumb-print readers or any of the modern high tech gadgetry of the IT security landscape. Authenticity come from knowing that what you are looking at was mechanically and deterministically derived from a set of ledgers and that those ledgers are available for inspection. What do financial auditors do for living? They check authenticity of financial statements. How do they do it? They do it by inspecting the ledgers. Why is authenticity of legal materials such a tough nut to crack? Because there are typically no ledgers!

From time to time we hear an outburst of emotion about the state of the legal corpus. From time to time we hear how some off-the-shelf widget will fix the problem. Technology absolutely holds the solutions, but it can only work, in my opinion, when the problem of legal corpus management is conceptualized as ledger-centric problem where we put manic focus on the audit trail. Then, and only then, can we put the legal corpus on a rigorous digital footing and move forward to a fully paperless world with confidence.

From time to time, we hear an outburst of enthusiasm to create standards for legal materials and solve our problems that way. I am all in favour of standards but we need to be smart about what we standardize. Finding common ground in the industry for representing legislative ledgers would be an excellent place to start, in my opinion.

Is this something that some standards body such as OASIS or NIEM might take on? I would hope so and hopeful that it will happen at some point. Part of why I am hopeful is that I see an increasing recognition of the value of ledger-based approaches in the broader world of GRC (Governance, Risk and Compliance). For too long now, the world of law has existed on the periphery of the information sciences. It can, and should be, an exemplar of how a critical piece of societal infrastructure has fully embraced what it means to be “born digital”. We have known conceptually how to do it since 1494. The technology all exists today to make it happen. A number of examples already exist in production use in legislatures in Europe and in the USA. What is needed now, is for the idea to spread like wildfire the same way that Pacioli’s ideas spread like wildfire into the world of finance all those years ago.

Perhaps some day, when the ledger-centric approach to legal corpus management had removed doubts about authenticity/reliability, we will look back and think digital law was always done with ledgers, just as today we think that accounting was always done that way.

Sean McGrath is co-founder and CTO of Propylon, based in Lawrence, Kansas. He has 30 years of experience in the IT industry, most of it in the legal and regulatory publishing space. He holds a first class honors degree in Computer Science from Trinity College Dublin and served as an invited expert to the W3C special interest group that created the XML standard in 1996. He is the author of three books on markup languages published by Prentice Hall in the Dr Charles F. Goldfarb Series on Open Information Management. He is a regular speaker at industry conferences and runs a technology-oriented blog at http://seanmcgrath.blogspot.com.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

From fuzzy systems and legal knowledge to cognition and commercialization

commercial systems, Legal reasoning, Modeling legal reasoning No Responses »

Oct 162012

Ending up in legal informatics was probably more or less inevitable for me, as I wanted to study both law and electrical engineering from early on, and I just hoped that the combination would start making some sense sooner or later. ICT law (which I still pursue sporadically) emerged as an obvious choice, but AI and law seemed a better fit for my inner engineer.

The topic for my (still ongoing-ish) doctoral project just sort of emerged. Reading through the books recommended by my master’s thesis supervisor (professor Peter Blume in Copenhagen) a sentence in Cecilia Magnusson Sjöberg‘s dissertation caught my eye: “According to Bench-Capon and Sergot, fuzzy logic is unsuitable for modelling vagueness in law.” (translation ar) Having had some previous experiences with fuzzy control, this seemed like an interesting question to study in more detail. To me, vagueness and uncertainty did indeed seem like good places to use fuzzy logic, even in the legal domain.

After going through loads of relevant literature, I started looking for an example domain to do some experiments. The result was MOSONG, a fairly simple model of trademark similarity that used Type-2 fuzzy logic to represent both vagueness and uncertainty at the same time. Testing MOSONG yielded perfect results on the first validation set as well, which to me seemed more suspicious than positive. If the user/coder could decide the cases correctly without the help of the system, would it not affect the coding process as well? As a consequence I also started testing the system on a non-expert population (undergraduates, of course), and the performance started to conform better to my expectations.

My original idea for the thesis was to look at different aspects of legal knowledge by building a few working prototypes like MOSONG and then explaining them in terms of established legal theory (the usual suspects, starting from Hart, Dworkin, and Ross). Testing MOSONG had, however, made me perhaps more attuned to the perspective of an extremely naive reasoner, certainly a closer match for an AI system than a trained professional. From this perspective I found conventional legal theory thoroughly lacking, and so I turned to the more general psychological literature on reasoning and decision-making. After all, there is considerable overlap between cognitive science and artificial intelligence as multidisciplinary ventures. Around this time, the planned title of my thesis also received a subtitle, thus becoming Fuzzy Systems and Legal Knowledge: Prolegomena to a Cognitive Theory of Law, and a planned monograph morphed into an article-based dissertation instead.

One particularly useful thing I found was the dual-process theory of cognition, on which I presented a paper at IVR-2011 just a couple of months before Daniel Kahneman’s Thinking, Fast and Slow came out and everyone started thinking they understood what System 1 and System 2 meant. In my opinion, the dual-process theory has important implications for AI and law, and also explains why it has struggled to create widespread systems of practical utility. Representing legal reasoning only in classically rational System 2 terms may be adequate for expert human reasoners (and simple prototype systems), but AI needs to represent the ecological rationality (as opposed to the cognitive biases) of System 1 as well, and to do this properly, different methods are needed, and on a different scale. Hello, Big Dada!

In practice this means that the ultimate way to properly test one’s theories of legal reasoning computationally is through a full-scale R&D process of an AI system that hopefully does something useful. In an academic setting, doing the R part is no problem, but the D part is a different matter altogether, both because much of the work required can be fairly routine and too uninteresting from a publication standpoint, and because the muchness itself makes the project incompatible with normal levels of research funding. Instead, typically, an interested external recipient is required in order to get adequate funding. A relevant problem domain and a base of critical test users should also follow as a part of the bargain.

In the case of legal technology, the judiciary and the public administration are obvious potential recipients. Unfortunately, there are at least two major obstacles for this. One is attitudinal, as exemplified by the recent case of a Swedish candidate judge whose career path was cut short after creating a more usable IR system for case law on his own initiative. The other one is structural, with public sector software procurement in general in a state of crisis due to both a limited understanding of how to successfully develop software systems that result in efficiency rather than frustration, and the constraints of procurement law and associated practices which make such projects almost impossible to carry out successfully even if the required will and know-how were there.

The private sector is of course the other alternative. With law firms, the prevailing business model based on hourly billing offers no financial incentives for technological innovation, as most notably pointed out by Richard Susskind, and the attitudinal problems may not be all that different. Legal publishers are generally not much better, either. And overall, in large companies the organizational culture is usually geared towards an optimal execution of plans from above, making it too rigid to properly foster innovation, and for established small companies the required investment and the associated financial risk are too great.

So what is the solution? To all early-stage legal informatics researchers out there: Find yourselves a start-up! Either start one yourself (with a few other people with complementary skillsets) or find an existing one that is already trying to do something where your skills and knowledge should come in handy, maybe just on a consultancy basis. In the US, there are already over a hundred start-ups in the legal technology field. The number of start-ups doing intelligent legal technology (and European start-ups in the legal field in general) is already much smaller, so it should not be too difficult to gain a considerable advantage over the competition with the right idea and a solid implementation. I myself am fortunate enough to have found a way to leverage all the work I have done on MOSONG by co-founding Onomatics earlier this year.

This is not to say that just any idea, even one that is good enough to be the foundation for a doctoral thesis, will make for a successful business. This is indeed a common pitfall with the commercialization of academic research in general. Just starting with an existing idea, a prototype or even a complete system and then trying to find problems it (as such) could solve is a proven way to failure. If all you have is a hammer, all your problems start to look like nails. This is also very much the case with more sophisticated tools. A better approach is to first find a market need and then start working towards a marketable technological solution for it, of course using all one’s existing knowledge and technology whenever applicable, but without being constrained by them, when other methods work better.

Testing one’s theories by seeing whether they can actually be used to solve real-world problems is the best way forward towards broader relevance for one’s own work. Doing so typically involves considerable amounts of work that is neither scientifically interesting nor economically justifiable in an academic context, but which all the same is necessary to see if things work as they should. Because of this, such real-world integration is more feasible when done on a commercial basis. In this lies a considerable risk for the findings of this type of applied research to remain entirely confidential and proprietary as trade secrets, rather than becoming published at least to some degree, thus fuelling future research also in the broader research community and not just the individual company. To avoid this, active cooperation between the industry and academia should be encouraged.

Anna Ronkainen is currently working as the Chief Scientist of Onomatics, Inc., a legal technology start-up of which she is a co-founder. Previously she has worked with language technology both commercially and academically for over fifteen years. She is a serial dropout with (somehow) a LL.M. from the University of Copenhagen, and she expects to defend her LL.D. thesis Fuzzy Systems and Legal Knowledge: Prolegomena to a Cognitive Theory of Law at the University of Helsinki during the 2013/14 academic year. She blogs at www.legalfuturology.com (with Anniina Huttunen) and blog.onomatics.com.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

Following the Law with Scout

Citizen participation in lawmaking, free access to law, Legislative information systems, Open Government Data, open source software 1 Response »

Sep 302012

At my organization, the Sunlight Foundation, we follow the rules. I don’t just mean that we obey the law — we literally track the law from inception to enactment to enforcement. After all, we are a non-partisan advocacy group dedicated to increasing government transparency, so we have to do this if we mean to serve one of our main functions: creating and guarding good laws, and stopping or amending bad ones.

One of the laws we work to protect is the Freedom of Information Act. Last year, after a Supreme Court ruling provided Congress with motivation to broaden the FOIA’s exemption clauses, we wanted to catch any attempts to do this as soon as they were made. As many reading this blog will know, one powerful way to watch for changes to existing law is to look for mentions of where that law has been codified in the United States Code. In the case of the FOIA, it’s placed at 5 U.S.C. § 552. So, what we wanted was a system that would automatically sift through the full text of all legislation, as soon as it was introduced or revised, and email us if such a citation appeared.

With modern web technology, and the fact that the Government Printing Office publishes nearly every bill in Congress in XML, this was actually a fairly straightforward thing to build internally. In fact, it was so straightforward that the next question felt obvious: why not do this for more kinds of information, and make it available freely to the public?

That’s why we built Scout, our search and notification system for government action. Scout searches the bills and speeches of Congress, and every federal regulation as they’re drafted and proposed. Through the awe-tacular power of our Open States project, Scout also tracks legislation as it emerges in statehouses all over the country. It offers simple and advanced search operators, and any search can be turned into an email alert or an RSS feed. If your search turns up a bill worth following, you can subscribe to bill-specific alerts, like when a vote on it is coming up.

This has practical applications for, really, just about everyone. If you care about an issue, be it as an environmental activist, a hunting enthusiast, a high (or low) powered lawyer, or a government affairs director for a company – finding needles in the giant haystack of government is a vital function. Since launching, Scout’s been used by thousands of people from a wide variety of backgrounds, by professionals and ordinary citizens alike.

Search and notifications are simple stuff, but simple can be powerful. Soon after Scout was operational, our original FOIA exemption alerts, keyed to mentions of 5 U.S.C. § 552, tipped us off to a proposal that any information a government passed to the Food and Drug Administration be given blanket immunity to FOIA if the passing government requested it.

If that sounds crazily broad, that’s because it is, and when we in turn passed this information onto the public interest groups who’d helped negotiate the legislation, they too were shocked. As is so often the case, the bill had been negotiated for 18 months behind closed doors, the provision was inserted immediately and anonymously before formal introduction, and was scheduled for a vote as soon as Senate processes would allow.

Because of Scout’s advance warning, there was just barely enough time to get the provision amended to something far narrower, through a unanimous floor vote hours before final passage. Without it, it’s entirely possible the provision would not have been noticed, much less changed.

This is the power of information; it’s why many newspapers, lobbying shops, law firms, and even government offices themselves pay good money for services like this. We believe everyone should have access to basic political intelligence, and are proud to offer something for free that levels the playing field even a little.

Of particular interest to the readers of this blog is that, since we understand the value of searching for legal citations, we’ve gone the extra mile to make US Code citation searches extra smart. If you search on Scout for a phrase that looks like a citation, such as “section 552 of title 5”, we’ll find and highlight that citation in any form, even if it’s worded differently or referencing a subsection (such as “5 U.S.C. 552(b)(3)”). If you’re curious about how we do this, check out our open source citation extraction engine – and feel free to help make it better!

It’s worth emphasizing that all of this is possible because of publicly available government information. In 2012, our legislative branch (particularly GPO and the House Clerk) and executive branch (particularly the Federal Register) provide a wealth of foundational information, and in open, machine-readable formats. Our code for processing it and making it available in Scout is all public and open source.

Anyone reading this blog is probably familiar with how easily legal information, even when ostensibly in the public domain, can be held back from public access. The judicial branch is particularly badly afflicted by this, where access to legal documents and data is dominated by an oligopoly of pay services both official (PACER) and private-sector (Westlaw, LexisNexis).

It’s easy to argue that legal information is arcane and boring to the everyday person, and that the only people who actually understand the law work at a place with the money to buy access to it. It’s also easy to see that as it stands now, this is a self-fulfilling prophecy. If this information is worth this much money, services that gate it amplify the political privilege and advantage that money brings.

The Sunlight Foundation stands for the idea that when government information is made public, no matter how arcane, it opens the door for that information to be made accessible and compelling to a broader swathe of our democracy than any one of us imagines. We hope that through Scout, and other projects like Open States and Capitol Words, we’re demonstrating a few important reasons to believe that.

Eric Mill is a software developer and international program officer for the Sunlight Foundation. He works on a number of Sunlight’s applications and data services, including Scout and the Congress app for Android.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

[Editor’s Note: For topic-related VoxPopuLII posts please see, among others: Nick Holmes, Accessible Law; Matt Baca & Olin Parker, Collaborative, Open Democracy with LexPop; and John Sheridan, Legislation.gov.uk

Standardizing the World’s Legislative Information—One hackathon at a time

Annotation of legal texts, Cross-language legal information retrieval, digital law, Electronic government, elegislation, Legal metadata, Legal XML, Open Government Data, Semantic annotation of legal texts, Standards 8 Responses »

Sep 172012

As guest bloggers to this site, we have been asked to write about big ideas. We’ll get to those. But first, a note about hackathons.

Could legal hackathons be like this one day?

Hackathons used to be the exclusive domain of soda-and-coffee-guzzling, pizza-eating, all-night hacking, highly competitive computer programmers. The result of such a hackathon is often supposed to be a cool app (like the forerunner of Twitter) that is even cooler because it was built in the compressed schedule of the event. More recently, hackathons have been popping up in a variety of places, with some unexpected contexts and sponsors including the U.S. House of Representatives, NASA, Brooklyn Law School, New York City government, and others. These events serve as a way to prove (or build) the sponsor’s tech credentials and to cross-fertilize policy and technology expertise. There has been some handwringing and thoughtful commentary about the expansion of “civic” hackathons and what sustainable outcomes they produce.

As co-organizers, with Karen Suhaka, Greg Wilson, Charles Belle and others, of two legislative focused, hackathon-inspired events–the California Law Hackathon, and the International Legislation Un-hackathon–we can attest to their value in bringing engineers and lawyer and policy folks together. We can give some insights into the kinds of benefits these events have had in propelling efforts on legislative data standards, and some of the advances that have taken place in the development of these standards over the last year.

Big Idea: Legislative Data Standards

And now to the big idea: to represent all the world’s legislation in a standard structured data format. That’s actually two big ideas: (1) putting legislation into a structured data format, and (2) designing that format so that it is compatible with the wide variety of laws and legislative document types worldwide.

There are reasons for doing these things: First, introducing structured data to legislation can make it possible to search and analyze the law with greater precision and efficiency. And second, having a common standard can permit more comprehensive bill-tracking and comparison between jurisdictions.

California Bill with Metadata

It also can make it possible for legislatures with small (and shrinking) budgets to benefit from some of the same bill drafting software that is being developed for much larger jurisdictions. (Full disclosure: Xcential has developed such software for more than ten years, including the drafting platform used by the State of California.)

In the age of Google, these ideas may not seem so big; in fact, they are a subset of Google’s far-reaching mission. However, legislation is a corner of the world’s information that Google has not yet addressed in a systematic way. And as regular readers of this blog know, legislation presents its own hurdles, technical and bureaucratic (not necessarily in that order), that make this both an interesting and a challenging problem. One of the challenges is that the kind of people who generally work with data (we’ll call them engineers) and the kind of people who generally work with legislation (we’ll call them lawyers and policy folks) don’t often work on data and legislation together. One of us, a lawyer and policy type, has made this point graphically (and somewhat hyperbolically) in a Quora response to a question about whether version control software could be used for legislation. That question, and a subsequent discussion generated in response to a blogpost by software engineer Abe Voelker about version control for legislation, drew in many engineers and some lawyers and policy folks.

For software engineers who consider such things, it is very attractive to think about treating legal text as if it were software code; we could automatically highlight and cross-validate key terms, run test cases, automate redlining and version control, etc. It would be easy to see what the state of the law was at any particular point in time, and to trace the series of amendments that got us into the mess we’re in today. This desire is often expressed as “What if we had a Github for legislation?” On the other hand, people who work closely with legislation–researching it, drafting it or developing information systems to deal with it–tend to see the many places that the analogy between computer code and legal code break down. Legal texts have been shaped over hundreds of years by technologically conservative institutions, using print-based systems.

The full transformation of law to digital information is not going to happen overnight. While most law is already accessible in electronic format (often pdf), it is not encoded in a way that software engineers could start using their favorite text-munching tools. One of us, an engineer, has described this as the difference between computerization and automation. The move toward better digital tools for automating legislative drafting and research tasks will require more dialogue and working exchanges between engineers and the lawyers and policy folks.

That brings us back to hackathons.

What is a Legislative Hackathon?

Recognizing the need to bring lawyers and engineers together in order to implement our big idea(s), and appreciating the valuable bandwagon that hackathons have become, we decided to jump onboard. The first event we organized, the California Law Hackathon, was hosted just over a year ago, in September 2011, in Berkeley at the offices of Maplight, and in Denver by Karen Suhaka’s team at BillTrack50. The event focused on building web-based visualization tools to track the timeline of amendments to California legislation, and to link particular amendments, through their legislative sponsors, to particular donors or interest groups. We were joined remotely by a number of international participants, including John Sheridan, head of e-services for legislation.gov.uk, and a fellow guest contributor to this blog. As one participant noted, we learned a great deal at the event, including the limits placed on us by the existing data. Neither the legislative record, nor the donations databases are detailed enough to trace influence in politics in the way we hoped. This helped spark an interest in a more in-depth exploration of legislative data formats, and in particular how more and better metadata could be added to legislation.

That led to the International Legislation Un-hackathon, held simultaneously at UC Hastings, Stanford and Denver, with participants from the University of Bologna (Ravenna campus) and around the world. So assuming you can get engineers together with lawyers and policy folks, what do you do with them? We decided that we’d need a user-friendly tool that could be used to explore and add metadata to legislation from around the world. This could highlight a developing legislative XML standard, Akoma Ntoso (more about this standard soon), and give hands-on experience to lawyer and policy types kinds of text and analysis tools that engineers take for granted.

Hacking With A Legislative Editor

So one of us (the engineer, naturally) started building a web-based editor for legislation, while the other (the lawyer, naturally) started organizing the next hackathon. Of course, thought the lawyer, it would just

Legislative Editor at legalhacks.org

be a matter of time before all governments worldwide use such editors to draft their laws and regulations in a standard data format.

Advances in Legislative Data Standards Efforts

Akoma Ntoso

Akoma Ntoso (AkN) is a strong contender to be that format. Developed under the auspices of the UN Department of Economic and Social Affairs, AkN is an XML data structure that is meant to capture high-level forms and semantic ideas that are common to a broad variety of legal texts. OASIS, the folks who brought us the DocBook standards, among others, have convened a standards committee to create an official legislative data standard based on AkN. (More disclosure: the engineer is a member of this committee.) There’s just one problem. Few governments are using AkN to draft or store their legislation.

AkN itself is fast evolving, and with more exposure to legal data structures from different jurisdictions, the OASIS committee will be able to adapt AkN to better model those structures.

We saw the International Legislative Un-Hackathon as a venue to kick off this process. It was conceived with Charles Belle of UC Hastings, as part of the Legal Hacks initiative. The event was held simultaneously at UC Hastings, Stanford, in Denver. Jim Harper and Francis Avila of the Cato Institute came to the Hastings Event. We also had many international participants. Key among them were Professors Monica Palmirani and Fabio Vitali of the University of Bologna, the architects and primary evangelists of AkN. Over the course of the day, participants learned about AkN and, importantly, got a chance to try it out, marking up documents of their choosing with the web editor. In the process, as expected, we found bugs in the software and bugs in the standard. We found structures in U.S. legislation that didn’t fit well with the existing AkN element set. We saw places where there was confusion in applying AkN’s data structures to documents. All of this information was collected to incorporate in the development of both the editor and AkN, underscoring again the importance of getting more practical exposure for both.

University of Bologna Summer School–Ravenna

And we are working to expand the venues for this kind of practical exposure to develop the AkN standard. Every September, the University of Bologna hosts the LEX Summer School in Ravenna, Italy. For them, it’s an opportunity to introduce Akoma Ntoso to new groups of students from around Europe and around the world. For the students, it’s an opportunity to learn about the application of XML to legislation, see the success various groups are having around the world, and to meet interesting new people having a passion for legal informatics. One of us, the engineer, who was a student two years ago, was invited to return last year to present a success story, and this year is returning once more to deliver a class in how to build and use the HTML5-based editor for drafting legislation in XML. For us, this is an opportunity to expose the editor to the European legal traditions in order for us to better understand how our editor must evolve to fulfill our vision of a unified standard around the world with common, highly adaptable, tools.

Chile National Library of Congress Browser-based editor

Another step toward adoption of legislative data standards is a project by Chile’s National Library of Congress (BCN in Spanish) called the “History of the Law” (Historia de la Ley). This ambitious project aims to bring together machine learning, a legislative editor and other features to mark up Chile’s legislative record and other legislative documents. The BCN has chosen Xcential’s browser-based editor, working with the AkN standard, to conduct the mark-up and correction after documents are passed through an automated parser. As with the hackathon, but on a larger scale, we are learning from experience the modifications that are needed to AkN, to make it work with Chile’s live documents. Excitingly, each mismatch we find between AkN and actual legislation can be fed back into the OASIS committee process, to make AkN able to handle a wider variety of real-world use cases.

Other Efforts and the Future of Legislative Data Standards

We see these steps as just the beginning. European governments are also flirting with legislative standards, and Karen Suhaka’s group at BillTrack50 has converted all U.S. bills from all states into a single standard XML format showing that the technical hurdles can be overcome, and many of the practical benefits of doing so. In focusing on the projects (and hackathons) we are most closely involved with, we have certainly left out a lot of the initiatives that are advancing legislative data standards around the world. That’s what the comments are for. Let us know your experience with Akoma Ntosa as a legislative standard, and what you’re doing or interested in doing with AkN or other legislative data standards worldwide.

Grant Vergottini (the engineer) is a founder of Xcential. He is a leading authority on applications of XML data to legislation. Prior to founding Xcential, Grant was the Director of Applications at Chrystal Software, a company dedicated to XML design and reporting software. Before Chrystal, Grant led the redesign of Homestore.com, and founded Genedax Design Automation, which developed innovative team and data management applications for electronics design. Bringing data structures and automation tools to the legislative drafting process parallels the work that Grant did earlier in his career at Mentor Graphics and the Boeing Company, where he participated in the transformation from manual drafting to CAD software. Mr. Vergottini holds a Bachelor of Science in Electrical Engineering from Cleveland State University, where he graduated Summa Cum Laude.

Ari Hershowitz (the lawyer) is a consultant at Xcential, and founder of Tabulaw. Tabulaw develops software for lawyers, including a web-based legal research and writing platform. Prior to Tabulaw, Ari worked to protect wildlife and habitats from Chile to Mexico as Director of the BioGems project for Latin America at the Natural Resources Defense Council. Ari has a law degree from Georgetown University Law Center, a Masters in Computation and Neural Systems from Caltech, and a Bachelors in Molecular Biophysics & Biochemistry from Yale College.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

[Editor’s Note: For topic-related VoxPopuLII posts please see: Núria Casellas, Semantic Enhancement of legal information … Are we up for the challenge?; João Lima, et.al, LexML Brazil Project; and Rinke Hoekstra, The MetaLex Document Server

Law as an app - technology in legal education

Applications, Disruptive legal technology, Electronic commerce, Innovation in legal technology, legal education, visualization 10 Responses »

Sep 052012

The growing usage of apps meant it was only a matter of time until they would find their way into legal education. Following up on a previously published article on LaaS – Law as a Service, this post discusses different ways that apps can be included into the law degree curriculum.

1 Changing legal education through the use of apps

There are different ways in which apps can be used in legal education in order to better prepare students for the legal profession. In this post we suggest three different possibilities for the usage of apps, reflecting different pedagogical styles and learning outcomes. What each of the suggestions has in common is to bring legal education closer to the real-life work of lawyers.

Through identifying aspects in which we perceive legal education as lacking quality or quantity, we apply and implement these to our suggestions for changed legal education. The aspects we view as lacking are: identifying and managing risks, the interaction between different areas of law, and proactive problem-based learning. To take each of these briefly in turn:

managing risks is something that practicing lawyers and other legal service professionals must do on a daily basis. Law is not only about applying legal rules but also about weighing options, estimating possible outcomes and deciding upon which risks to accept. Legal education has not traditionally included this in the curriculum, and students have arguably very little experience of such training in their studies.
interaction between different areas of law is often hard to incorporate in legal studies, which follow a block or module structure. Each course provides students with in-depth knowledge of that particular legal area. However, the interaction between such modules is lacking, with teachers often unaware of the content of preceding or succeeding courses. For students, a problem with this module structure can be that they forget the content of a course studied at an earlier stage in their education.
problem-based learning is generally encouraged and applied in legal education. However, most problem-based learning (PBL) is reactive, asking students to evaluate the legal consequences of a scenario that has already played out, instead of training students purely in after-the-fact solutions, in other words “clearing up the legal mess.” PBL should be made more proactive, aiming to train students in identifying and counteracting problems before they arise. This can also be viewed as an implementation of the first aspect, managing risks.

In aiming to include these aspects in legal education, we view technology as playing an important role. Perhaps ideally, the whole legal education could be re-structured in order to include such practical aspects that reflect the current legal profession; however, such change is perhaps too complex and viewed as somewhat unnecessary by those who are able to make such changes – if it ain’t broke, don’t fix it, as the saying goes. An app is not necessarily the sole possible implementation method, but it serves as an example of how these aspects can relatively easily be brought within legal education.

The first teaching approach looks at legal aspects of apps themselves, where the apps are viewed as objects within law. Here students are provided with a set problem and are encouraged to consider how different areas of law may apply to the app in question, and how the various areas impact on each other. This approach implements both PBL and the interaction of different legal fields. Proactiveness may also be included by asking students to identify legal risks of the app, and how such risks could by reduced through the use of law.

The second approach brings together technology and law, and is as such a suitable suggestion for inclusion in a legal informatics course, or as part of more general jurisprudence. Students are given the task of developing a legal service app, and thus must implement law through a technological tool. Students must first identify a need for a service within an area of law of their choosing, and then develop an app which provides the service. This approach implements both PBL and proactiveness, and can also require students to consider both legal and technical risks.

The third approach aims to add value to the legal education as a whole, by making available an app to students to be used alongside teaching, complementing the existing education. Students are provided with the opportunity to test their knowledge, and combine different areas of study through interactive learning. Depending on the design of the app, this approach has the possibility of implementing all aspects: PBL, interaction of legal fields, and proactiveness or risk management.

2 Legal aspects of apps

Legal education in many countries around the world is set up as linear blocks of different legal fields and subject areas. As law is often divided into various sub-fields–such as private law, public law, administrative law, environmental law, or information technology law–it appears only natural to discuss and teach the subjects one by one. The amount of material to be learned by the student would otherwise be overwhelming. While in some countries, exams might encompass multiple fields of law, subjects are being taught in a consecutive order.

Though the pedagogical reasons for the linearity in legal education are convincing, some improvements are still possible. One idea that we would like to discuss here are legal aspects of apps that intertwine different legal fields and challenge the students to analyze one particular phenomena from various different legal angles. We are not suggesting any particular fora for this exercise; these might stretch from traditional in-class seminars to online e-learning platforms to a mixture of the two and be included in law school curricula either as compulsory or selective modules.

Apps and information communication technologies, in general, do not adhere to geographical, physical or time related boundaries. They inherently challenge the traditional legal system based on bricks and mortar. In this regard they are, therefore, well-suited for legal analysis.

Another reason to use apps as the object for analysis by students is their popularity among the younger (and older) generation and therefore the close relationship students have to them to start with. As an example, one can compare it to using Facebook when discussing privacy, as opposed to showing a large company’s employee database.

In order to reflect the real-life experience of the exercise even more, the students would be allocated a certain expert area. As at law firms, one student would be an expert in intellectual property rights, another in contract law, another in privacy, international law, consumer rights issues, etc.

The students would –from the perspective of their expert area–firstly investigate possible legal issues with a specific gaming app, for example. They would analyze the application of the rules and norms within their field and identify potential conflicts or loopholes within these rules. Their investigation would include testing the app itself, as well as looking at possible end-user agreements and other applicable contractual agreements between the user, the app store and the developer of the app.

The next step would be to identify and discuss possible overlaps, discrepancies and conflicts between the different areas of law in relation to the app. The exercise should result in a written and/or oral report of the different legal issues involved and solutions to potential conflicts between the law and the app.

Adding another layer of real-life scenario, each group could be asked to present their findings to an imaginative client who is the producer of the app. This simulation would allow the students not only to develop a legal analysis based on correlating fields of law but also to present the analysis to non-lawyers, translating legal jargon into understandable everyday language.

The exercise–analyzing an existing app–very much fits into the idea often conveyed in legal education that law is applied after an incident occurs. In order to add a level of proactivity, students could be asked to analyze an app under production, before it is launched. This would guarantee more proactive thinking by the students asking them to foresee potential conflicts and avoid them, rather than discussing legal issues after they have arisen.

While the exercise as such might not be a revolutionary idea, we think that the increased inclusion of such exercises in legal education would contribute to better preparation of students for their life as young lawyers.

3 Law’s implementation in apps

While the previous exercise fits well within the traditional legal education by asking students to deliver a legal analysis, a topic less discussed in undergraduate legal studies is how to employ technology for delivering law. With a few exceptions, students generally focus on analyzing the law rather than implementing law in technology.

Until several years ago legal analysis was the main business for lawyers, so legal education well reflected the profession. In the last few years, however, legal services delivered via and as technology have increased and opened up a new market for lawyers and legal professionals. This change should be reflected in legal education in order to prepare students for their future.

While the idea is not to replace lawyers with apps or software, an app or another technology could either help lawyers in their working tasks or deliver law as a valuable service for consumers, citizens, companies or organizations. Examples of such apps, both for lawyers and end-users, are mentioned in a post at iinek’s blog and Slaw; shorter lists can be found on iinek’s Delicious page and the iPad4lawyers blog.

In the exercise, students would look at law from a different perspective, i.e. how legal regulations affect the individual or organization. Going away from a linear text approach, students would have to translate law into a format that users or apps can read. In other words, law would have to suit the user/app, and not the other way around. Students would, therefore, have to go beyond text and translate rules into flowcharts, diagrams, mind maps and other visual tools in order for the app to be able to follow the law’s instructions.

Implementing legal rules into technology, therefore, not only encourages students to think proactively but it also motivates them to identify solutions for the application of the law and how rules could be transformed into practice. From a pedagogical point of view the exercise would allow the students to think about different aspects of law beyond the traditional case or contract. It would also encourage a wider viewpoint of law as a tool in society.

Again, how the exercise is included in the curriculum is a matter of taste. Technical assistance is of importance, in order for students to know what aspects to take into account and what schematics developers need in order to be able to create an app. The exercise could be set up as a competition (Georgetown Law School – Iron Tech Lawyer) with an expert jury consisting of practicing lawyers and developers.

4 Legal education as an app

Talking about legal education as an app can have different meanings. While legal apps (for lawyers and individuals) and educational apps are rather common these days, legal educational apps are not so developed, yet.

Legal education, as mentioned, is traditionally taught in blocks or modules, with very few references and links between them. This setup clearly has its benefits, not least logistically. There are clear arguments in favor of such an approach; planning and studying becomes easier for teachers and students alike, time limitations mean that implementing an approach that makes connections between each subject is hard. This is where we believe that technology has the potential to play an important role. Technology is not bound to physical classrooms and attendance requirements of students or teachers. It has the ability to be accessed at a time of the student’s choosing, without placing additional demands on instructors.

A legal education app could provide the key in aiding students to make connections between their study areas; it could be made to fit alongside a law degree, assuming a student’s knowledge in sync with their level of study, by including content from both current and past courses. The app would offer an easy way to implement an interactive, problem-based learning approach. It could provide additional content, quizzes, exercises, social media functions etc. complementing the education and enabling a holistic perspective.

Although no teacher-student relationship is required here, clearly pedagogical thinking would need to play a strong role so that a worthwhile learning environment for the individual could be created. Much time and effort would need to be invested in planning, and the application itself would need to be flexible to adjust to different study plans and so forth. Another issue is, of course, who would make the app. As curricula vary from law school to law school, and jurisdiction to jurisdiction, such an app is ideally built by those who know the curriculum. Such “in-house” expertise also means that potential bias from outside factors should be avoided.

Legal apps have already been introduced to help lawyers study for qualifying exams, e.g. BarMax. (These are often, however, still very topic-specific.) Implementing the same kind of thinking at the educational level would start to prepare students for their future workplace, allowing them to be better prepared for helping clients with real-world scenarios dealing with complex and interrelating legal issues. If students begin such thinking at the beginning of their legal studies, it becomes normal, arguably allowing for better educated graduates.

This last approach is perhaps a little future-oriented (although not as much as, for example, grading by technology), and it is of course not easy to implement at the university level; academics must work together with app developers to produce a tool of real value to students. However, even a slimmed-down version of such an app can be a tool for helping students prepare for exams, test their knowledge of legal areas, or simply make sure that they have understood concepts covered in teaching. Some examples of such implementations in legal education are shown here.

5 Conclusions

There is no doubt that apps are the future for legal services. To what extent they will be included in legal education is yet to be decided. Here we have shown three differing approaches that could help in this regard. Implementation of any or all of these would bring in aspects that are currently lacking in legal education.

Rather often discussions on technology and legal education focus on e-learning and online teaching environments. In our opinion, traditional offline exercises and their pedagogical value should not be underestimated, with technology offering an excellent platform as an object, tool or companion during legal education and life as a lawyer.

6 Sources

Christine Kirchberger & Pam Storr, LaaS – Law as a Service, Lov&Data Nr. 4/2011
Christine Kirchberger & Pam Storr, blog post Law as a Service?, Blawblaw, 30 March 2012
Christine Kirchberger, blog post Law as an app, 25 January 2012
Jason Wilson; I Am Now an App™; Slaw, 28 September 2011
John Markoff, Armies of Expensive Lawyers, Replaced by Cheaper Software, New York Times, 4 March 2011
Caroline Strevens, Christine Welch & Roger Welch; On-line legal services and the changing legal market: preparing law undergraduates for the future; The Law Teacher, 45:3, 2011, 328–347
Marc Bousquet, Robots Are Grading Your Papers!, The Chronicle of Higher Education, 18 April 2012

Christine Kirchberger is a doctoral candidate & lecturer in legal informatics at the Swedish Law and Informatics Research Institute (IRI). Her research focuses on legal information retrieval, the concept of legal information within the framework of the doctrine of legal sources and also examines the information-seeking behaviour of lawyers. Christine blogs at iinek.wordpress.com and can be found as @iinek on Twitter.

Pam Storr is a lecturer at the Swedish Law and Informatics Research Institute (IRI), and course director for the Master Programme in Law and Information Technology at Stockholm University. Her main areas of interest are within information technology and intellectual property law. Pam is the editor for IRI’s blog, Blawblaw, and can be found as @pamstorr on Twitter.

Open data in a librarian hat: What's your Number One?

free access to law, Law.gov, Open Government Data 2 Responses »

Aug 152012

When I started to write a post about THOMAS and its place in open government about three months ago, I was feeling apologetic. I was going make a heavy-handed, literal comparison of the opening hours of the Law Library of Congress (one of the maintainers of THOMAS, and my former employer, whose views are not at all represented here) with open government. I planned to wax sympathetic on the history of THOMAS, and how little has changed since it was first built. But, that post would not have added anything new to the #opengovdata conversation, or really mentioned data at all.

Just over one month ago, #freeTHOMAS reached a fever pitch surrounding the passage of H.R. 5882, the Legislative Branch Appropriations Act for FY2013. Just before passage, H.Rpt.511 directed official conversation on open legislative data for the coming fiscal year by saying, “let’s talk.” In a section about the Government Printing Office, the House Appropriations Committee expressed concerns about authentication and open legislative data, but called for a task force “composed of staff representatives of the Library of Congress, the Congressional Research Service, the Clerk of the House, the Government Printing Office, and such other congressional offices as may be necessary,” to look in to the matter.

Opengovdata was disappointed in government. The tone of the House Report suggested that government had been dismissive of opengovdata–and in all fairness, others were beginning to be dismissive of opengovdata as well.

But, a clear classification problem was emerging. Inspired by Lawrence Lessig’s Freedom To Connect speech at the AFI in late May, I had a very librarian moment on organizational hierarchies.

#OpenGov is a really big tent.

People who want a more open environment for government to communicate with the governed want data, increased transparency, plain language legislation, open court filings, access to government funded research, silly walks, etc. Accountability through transparency often dominates the conversation, thanks to well-funded non-profits and high profile projects. However, there’s more to the movement. In that spirit…

Let’s be transparent about transparency.

When the goal of an open government project is legislative transparency through freely accessible data, let’s focus on that. When the goal is something else, let’s focus on that too. We hear about government accountability through data because the voices calling for it are loud. But data can do much more than bring about a more transparent lawmaking process.

In the words of a wise man, make transparency your Number Two.

If you haven’t had the chance to watch this aforementioned Freedom To Connect talk, and you’ve got half an hour to spare, I highly recommend it. The subject is community broadband, but it’s hard not to be inspired to frame other issues smarter, with transparency ever-present in the background.

Let’s focus on Number One.

If the THOMAS data, for example, were open right now, this instant, you couldn’t watch it on TV. You couldn’t read it on your Kindle. It’s mere presence would not increase transparency. Someone would have to do something with it. Number One is the thing you do with the data to reach your own goal–and that goal might not be legislative transparency.

As a public law librarian to a broad constituency, my goals are different than those of a non-profit think tank, or a law firm, or a law school, or even a non-law public library. In a climate of doing more with less, of needing to show much return for little investment, we each have to frame specific, measurable, achievable Number Ones tailored to the needs of our institutions. Without these Number Ones (goals, mission statements, benchmarks, or whatever management word your organization uses), we flounder off-mission, lose focus, and potentially lose funding.

Librarians are foot soldiers for the First Amendment–we like open information, we place a high value on the freedom to know. However, we’re among the first to be cut in tight budget situations, and we’re all too familiar with the perils of asking for something that’s overly broad, or asking for something that you can’t show narrowly tailored value for later on.

With respect to open gov data: government accountability is not unimportant to me as a voter. However, as a law librarian, I need to focus on Number Ones with more specific, smaller-scale goals than transparency, that will create measurable outcomes, allowing me to show concrete value to my institution. The big picture of how information is available, and the relationship between the government and the governed is important, but it doesn’t always get you funding, and it can’t always answer the question of the patron in front of you.

What’s your Number One?
There’s plenty of data out there. What are you doing with it? How can you manipulate raw free resources into something good for your institution? There is much to be said for information for the sake of information. I can’t imagine needing to convince most library-types of that. That said, we library-types, we information professionals, we decision makers, and perhaps we citizens need to narrow open gov to make it work for us. Data is good, but a real-time interactive civics education program based on THOMAS data for K-12 students is better. Let transparency folks fight the good fight, and don’t forget their work. But while you’ve got your librarian hat on, focus on a Number One that works for you.

Meg Lulofs is an information professional at large, blogging at librarylulu.com, editing Pimsleur’s Checklists of Basic American Legal Publications, and making mischief. She earned a J.D. from the University of Baltimore, and a M.L.I.S. from Catholic University. She welcomes feedback at meglulofs@gmail.com. You can follow her on Twitter @librarylulu, or on Facebook at facebook.com/librarylulu.

[Editor’s Note] For topic-related VoxPopuLII posts please see: Nick Holmes, Accessible Law; John Sheridan, Legislation.gov.uk; David Moore, OpenGovernment.org: Researching U.S. State Legislation.

Making a legal dictionary for an indigenous language: the Legal Maori Dictionary

Cross-language legal information retrieval, legal dictionaries 1 Response »

Jul 102012

In 2010, an interesting observation was made about the linguistic identity of the New Zealand state. The observer was the Waitangi Tribunal of New Zealand, a permanently appointed commission of inquiry tasked with investigating claims of Crown breaches of the Treaty of Waitangi that may have caused prejudice to Māori. Of course the Treaty itself was signed by two distinct parties in 1840: the British Crown, and the representatives of Māori tribal groupings. In 1840 the linguistic, ethnic, and cultural identity of each grouping was simply not in doubt. But over the years the British Crown has devolved or morphed into the Crown in right of New Zealand, British settlers became Pākehā New Zealanders, and the Māori themselves have also changed irrevocably. So the Tribunal’s observationwas interesting:

Fundamentally, there is a need for a mindset shift away from the pervasive assumption that the Crown is Pākehā, English-speaking, and distinct from Māori rather than representative of them. Increasingly, in the twenty-first century, the Crown is also Māori. If the nation is to move forward, this reality must be grasped.

In short, the Crown, in right of New Zealand, is not only Māori, but must also be Māori speaking. In view of New Zealand’s bicultural (and bilingual) legal history, this is not as merely ‘aspirational’ as might be presumed.

In early 2013, a new dictionary will be published in New Zealand. This dictionary will be a bilingual Māori-English language dictionary. Nothing unusual about that; there are quite a few Māori dictionaries about. Nor is the fact that this particular dictionary is a legal dictionary particularly strange; the world is well served with those, even in regards to New Zealand legal English. The Legal Māori Dictionary is relatively unusual, however, for combining these two characteristics. There are, as yet, not many indigenous language legal dictionaries, or indigenous legal language projects around the world. Of course, there are some fascinating indigenous legal language projects, such as the rich searchable collection of native Hawaiian legal documents available through the Ka Huli Ao Digital Archives under the auspices of the Ka Huli Ao Center for Excellence in Native Hawai`ian Law. An extensive Irish Language Legal Terminology derived from the bilingual Acts of the Irish parliament has also been made publicly available. In Australia, some exciting work has been done with identifying legal glossaries in a number of aboriginal languages including Yolngu Matha and Murrinh-Patha from the Northern Territory. Not infrequently, such glossaries and terminologies are the result of dedicated workshops, often government funded, set up in order to create a functional lexicon for use in the state legal system by speakers of the target indigenous language, as in the case of the English-Inuktitut-French Legal Glossary released in 1997 by the Nunavut Translator/Interpreter program at Nunavut Arctic College. An earlier but similar project for the Navajo language was published in 1989 by the US District Court for the District of New Mexico, and is still made publicly available by the Judicial Branch of the Navajo Nation. A more recent example is the extensive Sámi legal terminology that has been worked up over recent years and made available online by translators and interpreters working on the translation of state legal documents into Sámi for Sámi-speaking populations of Norway, Finland, Denmark and Sweden.

So, we at the Legal Māori Project, and our Legal Māori Dictionary, are in good, if select, company. But every legal lexicography project has a unique whakapapa (genealogy) and characteristics that somehow reflect the lived histories of the people who belong to each language.

To briefly outline our whakapapa then. The Legal Māori Project, as established in 2008 in the Law Faculty of Victoria University of Wellington, seeks to achieve two primary aims: • A long-term goal of normalizing the use of the Māori language within the New Zealand legal system; and ultimately, the public, civic sphere of New Zealand society. Māori must claim its place as an ordinary language of the enactment of state law, of government, administration, politics and the economy; • A shorter-term aim of providing bilingual Māori speakers with a resource that can help such speakers can effectively and feasibly choose to use Māori rather than English in that legal system. Such ease of choice is critically important for effective language revitalisation.

The Legal Māori has received four years of public funding for our research from New Zealand’s Ministry of Science and Innovation. Rather than create a legal terminology from scratch, however, we thought it absolutely necessary to carry out a kind of textual excavation of the rich, but mainly hidden Māori-language documents of New Zealand’s bilingual and bicultural legal history. We were aware that there are several thousand pages of publicly available, printed, Māori language documents discussing, applying, translating, critiquing and interpreting Western legal concepts. These documents are available, but sequestered in public repositories such as the Alexander Turnbull Library. In the face of such a rich treasure trove of texts, we considered our best approach was to be a corpus-based one. We would build a body of digitized Māori language texts that we could analyse to identify the kinds of words and phrases that Māori speakers and writers of the past 180 years had been using in those texts. By June 2011, the texts we found and, in crucial partnership with the New Zealand Electronic Text Centre, digitized, totaled 8 million word tokens; the largest purpose-built and structured corpus of Māori language texts known. The pre-1910 texts of the Legal Māori Corpus are publicly available for download, with the remainder of the texts to be made available by the end of this year. The Legal Māori Corpus contains printed texts of the following kinds of historical documents, most of which are also available online in the land title system. Some documents might be more accurately described strategic documents issued by government departments in Māori, such as Māori language versions of statements of intent.

These documents taken as a whole provide an incredible opportunity to examine the evolution of an endangered language as it wrestles with the lexicon and conceptual world of the dominant language and that language’s culture. Therefore, the collated texts from the Corpus were examined to find how various words and phrases have been used to express Western legal ideas. Over the past two years we have been identifying those words and phrases; first, to come up with a useful lexicon of possible legal Māori terms, and then, to test and validate those lexicon terms in order choose the terms that are now to form to the base of the Legal Māori Dictionary itself. With the invaluable design, by Dave Moskovitz of ThinkTank Ltd, of an open-source, easy-to-use web-based text browser and dictionary writing system called Freelex, we are now compiling our dictionary entries.

As mentioned above, our purpose has always been to create a dictionary of Māori language terms to express Western legal concepts. Customary Māori legal language had been explored in-depth in other scholarship. For example, customary Māori legal concepts have been investigated by the FRST funded work undertaken by Te Mātāhauāriki Institute based at Waikato University in developing a compendium of customary Māori legal terms: Te Mātāpunenga. Choosing to focus on the expression of Western legal ideas in Māori, however, exposed us to the considerable risk that English meanings and concepts would drive the content of our dictionary. Indeed we expected such English conceptual dominance. However, the pilot stages and subsequent corpus-based work showed that Māori customary legal vocabulary had a far stronger presence in the terms we were identifying than had been expected. In fact, many of the words in te reo Māori (the Māori language) that have been used to describe traditional Māori legal concepts are also terms within legal Māori terminology, communicating Western legal ideas. (Some examples are mana, roughly glossed as ‘authority’; tikanga, or the ‘correct way of doing things’; and rangatiratanga which can be equated to ‘chieftainship’.) The Legal Māori Project then must reflect two very important aspects of legal Māori vocabulary: customary legal meaning and Western legal meaning. A core set of customary legal terms that had acquired further Western legal senses over the past 180 years could in fact be identified within the lexicon of legal terms that were being derived from the corpus itself. In view of this insight, we decided that the idea of identifying a finite set of core customary legal terms could form part of a methodology that would enable Māori ideas and Māori legal thinking, alongside Western legal thinking, to take centre stage in our dictionary generation and formatting. The methodology used by the Legal Māori Project team is one that therefore pays careful attention to both the Western and customary law aspects of a significant, identifiable core of traditional Māori law terms. The team identified that if customary legal and western legal aspects of core terms are accounted for in the selection, formatting, and organisation of the dictionary entries, English glosses and English ideas are less likely to subvert Māori ideas and the Māori language basis of the dictionary as a whole. To provide a practical example of how we attempted to incorporate such prioritization in the design of the Legal Māori Dictionary, the following draft entry for taonga might be useful. It comes from the sample dictionary released in June 2010.

	taonga
	The customary usage of taonga refers to property or anything highly prized. The giving and receiving of taonga was an important part of recording and maintaining reciprocal relationships between groups. @TM Taonga
	1n <cust> valued property [K]i te kitea kua kore te tangata e utu i ngā moni reti, e whai ture ana ki te hamene i a ia ki te muru i ōna taonga[.] @S241886 2n goods Kua kitea te nui haere o ngā mahi o te koroni i runga i te maha o ngā taonga e utautaina ana ki tāwahi[.] @S241891 ☼ Usually used in the context of personal property, but sometimes also used to refer to real property or goods traded on a commercial scale.

Many typical dictionary elements have been used in this draft entry. For example, distinct verb senses have been identified and numbered. The grammatical function of each sense is identified, and the primary usage (here referring to taonga being primarily a customary term) identified. It also includes a one-word English gloss for each sense and some further explanation in English of how the term is used in a technical way (preceded by ☼). Finally, the entry includes a usage example for each term and short code references for each example, which will enable the user to find the original text. The opening sentence at the top of the entry will be shaded in its final printed form, and will thereby be a new addition to the formatting of our dictionary articles. We have labeled this feature the whakamaramatanga (‘clarification’) field, where a very brief explanation is given of the all-important customary context for the term with a reference to further reading for those readers wanting to find out more about the concept. The reference is to the Matapunenga compendium (to be published at roughly at the same time as the Legal Māori Dictionary). These small additions to the traditional dictionary entry, must be taken in conjunction with all the work carried out by the Legal Māori Project to date. Ultimately we hope that our experience in designing and producing our outputs, including the dictionary, might assist the designers of other specialist dictionaries or lexicons of indigenous languages to pay appropriate deference to the customary concepts of those languages, where possible and practicable.

And, above all, just maybe our work will help Māori speakers to choose to use their own language in precisely those domains where they are simply not expected to, or in the view of some, supposed to. And when that happens, a Māori-speaking Crown doesn’t seem so difficult after all.

Thanks to Māori.org.nz for the Māori images used here.

After some years working in the New Zealand Department of Corrections and Māori broadcasting, Māmari completed an MA (Distinction) in Classical Studies, BA (Hons), and an LLB (Hons) at Victoria University. She then spent three and a half years at New Zealand’s largest law firm, Russell McVeagh, in Wellington, working in the Māori legal team in the Corporate Advisory Group. Māmari has been with the School of Law since January 2006 and, with Assistant Professor Mary Boyce of the University of Hawai’i, runs the Legal Māori Project. Her primary research interests are law and language, Māori and the New Zealand legal system, and social security law. Māmari is married to Maynard Gilgen and has two sons, Te Rangihuia (9) Havelund (5) and a daughter, Jessica-Lee Ngātaiotehauauru, born in November 2009.

Older Entries Newer Entries

Suffusion theme by Sayontan Sinha

VoxPopuLII

CourtListener: Where We Are and Where We'd Like to Go

Metadata Quality in a Linked Data Context

Van Winkle wakes

The Original Framework

… and the environment in which it was created

The New World

The new quality criteria, and how they stack up

Conclusion

References

Opening Up State Legal Data

Digital Law: What Lawyers Need to Learn from Accountants

From fuzzy systems and legal knowledge to cognition and commercialization

Following the Law with Scout

Standardizing the World’s Legislative Information—One hackathon at a time

Big Idea: Legislative Data Standards

What is a Legislative Hackathon?

Hacking With A Legislative Editor

Akoma Ntoso

University of Bologna Summer School–Ravenna

Chile National Library of Congress Browser-based editor

Other Efforts and the Future of Legislative Data Standards

Law as an app - technology in legal education

1 Changing legal education through the use of apps

2 Legal aspects of apps

3 Law’s implementation in apps

4 Legal education as an app

5 Conclusions

6 Sources

Open data in a librarian hat: What's your Number One?

Making a legal dictionary for an indigenous language: the Legal Maori Dictionary

Recent Posts

VoxPop people and posts

Subscribe to VoxPopuLII

Blogroll

Van Winkle wakes

The Original Framework

… and the environment in which it was created

The New World

The new quality criteria, and how they stack up

Conclusion

References

Big Idea: Legislative Data Standards

What is a Legislative Hackathon?

Hacking With A Legislative Editor

Akoma Ntoso

University of Bologna Summer School–Ravenna

Chile National Library of Congress Browser-based editor

Other Efforts and the Future of Legislative Data Standards

1 Changing legal education through the use of apps

2 Legal aspects of apps

3 Law’s implementation in apps

4 Legal education as an app

5 Conclusions

6 Sources

Recent Posts

VoxPop people and posts

Subscribe to VoxPopuLII

Blogroll

Tags