skip navigation

Venture Capital and Peer Production

This blog entry focuses on the need for more and better software to reap the benefits of the legal information treasures available. As you’ll see, this turns out to be more complex than one may think.
Network

For commercial software developers, it is surprisingly hard to stay radically innovative, especially when they are successful. To start with, software development itself is a risky undertaking.  Despite five decades of research in managing this development process, projects frequently are late, over budget, and much less impressive than originally envisioned.  IBM once famously bet the company on a new computer platform, but the development of the associated operating system was so much behind schedule that it threatened IBMs’ existence. Management was tempted to throw ever more human resources at the development problem, only to discover that this in itself causes further delays  -  leaving us with the useful term “mythical man-month”.

But the difficulty in envisioning hurdles in a complex software engineering project is not the only source of risk for innovative software developers. Successful developers may pride themselves on a large and increasing user base.  Such success, however  creates its own unintended constraints.

Customers will dislike rapid change in the software they use, as they will have to relearn how to operate it, may have to expend efforts on converting data to new formats, and/or may need to adjust the preferences and customization options they utilized. This gets worse if the successful software is the platform for a thriving ecosystem of other developers and service providers. Any severe change in the underlying platform means that those living in it have to adapt their code. Each time a customer has to invest time in relearning a software product, it offers competing software providers a chance to nab a customer. This prompts software developers, especially very successful ones, to be relatively conservative in their plans for updates and upgrades. They don’t want to undermine their market success, and thus will be tempted to opt for gradual rather than radical innovation when designing the next version of their successful wares.

We have seen it over and over again: Microsoft’s Word, Powerpoint and Excel have gone through numerous iterations over the past decades, but the basic elements of the user experience have changed relatively little. Similarly, concerns for legacy code by third party developers have been a key holdback for Microsoft’s Windows product team. Don’t break something  -  even if it is utterly ancient and inefficient, buggy and broken  -  as long as it works for the customers.  That’s the understandable, but frustrating, mantra.

Or think of Google: the search engines’ user interface hasn’t seen any major changes since its inception more than a decade ago. Only Apple, it seems, has been getting away with radical innovation that breaks things and forces users to relearn, to convert data, and to expend time. That is the advantage of a small but fervently loyal user base. But even Apple has recently seen the need to take a breather in radical change with Snow Leopard.

And in the legal information context, think of Westlaw and Lexis/Nexis.  Despite direct competition with one another,  when was the last time we saw a truly radical innovation coming from either of these two companies?

Radical innovation requires the will to risk alienating users. As companies grow and pay attention to shareholder expectations, the will-to-risk often wanes. With radical innovation in the marketplace, the challenge lies in the time axis. If one is very successful with a radically new product at time T, it is hard to throw that product away, and try to risk radically reinventing it, for T+1.

On a macro level, we combat this conservative tendency against radical change by providing incentives for innovative entrepreneurs to develop and market competing offerings. If enough customers are unhappy with Excel, perhaps entrepreneurs with radically new and improved concepts of how to crunch and manage numbers in a structured way will seize the opportunity and develop a new killer app that they’ll pit against Excel. That’s enormously risky, but also offers the potential of very steep rewards. Angel investors and venture capitalists thrive on providing the lubricant (in the form of financial resources) for such high risk, high reward propositions. They flourish on the improbable. What they don’t like are “small ideas.”  (It happened to me, too, when I pitched innovative ideas to VCs; they thought my ideas had a very high likelihood of success, but not enough of a lever to reap massive returns. Obviously I was dismayed, but they were right: it is what we need if we want to incentivize radical innovation.)

This also implies, however, that for venture capital to work, markets need be large enough to offer high rewards for risky ventures. If the market is not large enough, venture capital may not be available for a sufficient number of radical innovators to keep pushing the limit. Therefore, existing providers may survive for a long time with incremental innovations. Perhaps that is why Westlaw and Lexis are still around, even though they could fight the tendency toward piecemeal development if they wanted to.

skunkOther large corporations, realizing the bias towards incremental innovation, have repeatedly resorted to radical steps to remedy the problem. They have established skunk works, departments  that are largely disconnected from the rest of the company, freeing the members to try revolutionary rather than evolutionary solutions. Sometimes companies acquire a group of radically innovative engineers from the outside, to inject some fresh thinking into internal development processes that may have become too stale.

Peer production models, almost always based on an open source foundation, are not dependent on market success. (On the drivers of peer production see Yochai Benkler’s  “The Wealth of Networks”). They are not profit driven, and thus may put less pressure on the developers to abstain from radical change. Because Firefox does not have to win in the marketplace, its developers can, at least in theory, be bolder than their commercial counterparts.

Unfortunately, open-source peer produced software may also lose its appetite for radical innovation over time  -  not because of monetary incentives, but because of the collaborative structures utilized in the design process. If a large number of volunteering bug reporters, testers, and coders with vastly differing values and preferences work on a joint project, it is likely that development will revert towards a common denominator of what needs to be done, and thus be inherently gradual and evolutionary, rather than radical. Of course, a majority of participants may at rare moments get together and agree on a revolution – much like those in what then was a British colony in 1776.  But that is the brilliant exception to a rather boring rule.

Indecisiveness that stems from too small a common ground, however, is not the only danger. On the other end of the spectrum, communities and groups with too many ties among each other cause a mental alignment, or “group think,” that equally stifles radical innovation. Northwestern University professor Brian Uzzi has written eloquently about this problem. Finding the right sweet spot between the two extremes is what’s necessary, but in the absence of an outside mechanism that balance is difficult to achieve for open source peer-producing groups.fish

If we would like to remedy this situation, how could we offer incentives to peer producing communities to more often give radical rather than incremental innovation a try? What could be the mechanism that takes on the role of venture capitalists and skunk works in the peer production context?

It surely isn’t telling dissenters with a radically new idea to “fork out” of a project. That’s like asking a commercial design group to leave the company and try on their own, but without providing them with enough resources or incentives. Not a good idea if we want to make radical innovation – the experimentation with revolutionary rather than incremental ideas – easier, not harder.

But what is the venture capital/skunk works equivalent in the peer-producing world?

A few thoughts come to mind, but I invite you to add your ideas, because I may not be thinking radically enough.

(1) User: Users, from large to small, could volunteer,  perhaps through a website, to dedicate some modicum of their time to advancing an open source project not by contributing to its design, but by committing to being first adopters of more radical design solutions. One may imagine a website that helps link users (including law firms) willing to dedicate some “risk” to such riskier open source peer produced projects, perhaps on a sectoral basis (Could this be yet another mission for the LII?).

(2) Designers: Quite a number of corporations and organizations explicitly support open source peer producing projects, mostly by dedicating some of their human resources to improving the code base. These organizations could, if they wanted to improve the capability of such projects to push for more radical innovation, set up incentives for employees to select riskier projects.

(3) Tools: The very tools used to organize peer production of software code already offer many techniques for managing a diverse array of contributors. These tools could be altered to evaluate the a group’s level of diversity and willingness to take risks, based on the findings of social network theory. Such an approach would at least provide the community with a sense of its potential and propensity for radical innovation, and could help group organizers in influencing group composition and group dynamics.  (Yes, this is “data.gov” and the government IT dashboards applied to this context.)

These are nothing more than a few ideas.  Many more are necessary to identify the best ones to implement. But given the rise and importance of peer production, and the constraints inherent in how it is organizing itself, the conversation about how to best provide incentives for radical innovation in the legal information context - and beyond - is one we must have.

[NB:  What do you all think?  How does this apply to the world of legal information, and to specialized software applications that support it — things like point-in-time legislative systems, specialized processing tools, and so on?  Comments please…. (the ed.)]

ViktorViktor Mayer-Schönberger is Associate Professor of Public Policy and Director of the Information + Innovation Policy Research Centre at the LKY School of Public Policy / National University of Singapore. He is also a faculty affiliate of the Belfer Center of Science and International Affairs at Harvard University. He has published many books, most recently “Delete - The Virtue of Forgetting in the Digital Age.”He is a frequent public speaker, and sought expert for print and broadcast media worldwide. He is also on the boards of numerous foundations, think tanks and organizations focused on studying the foundations of the new economy, and advises governments, businesses and NGOs on new economy and information society issues.  In his spare time, he likes to travel, go to the movies, and learn about architecture.

VoxPopuLII is edited by Judith Pratt.

Bookmark and Share

Pushing the envelope: Innovation in legal search

puzzle

To take the words of Walt Whitman, when it comes to improving legal information retrieval (IR), lawyers, legal librarians and informaticians are all to some extent, “Both in and out of the game, and watching and wondering at it“. The reason is that each group holds only a piece of the solution to the puzzle, and as pointed out in an earlier post, they’re talking past each other.

In addition, there appears to be a conservative contingent in each group who actively hinder the kind of cooperation that could generate real progress: lawyers do not always take up technical advances when they are made available, thus discouraging further research, legal librarians cling to indexing when all modern search technologies use free-text search, and informaticians are frequently impatient with, and misunderstand, the needs of legal professionals.

What’s holding progress back?

At root, what has held legal IR back may be the lack of cross-training of professionals in law and informatics, although I’m impressed with the open-mindedness I observe at law and artificial intelligence conferences, and there are some who are breaking out of their comfort zone and neat disciplinary boundaries to address issues in legal informatics.

I recently came back from a visit to the National Institute of Informatics in Japan where I met Ken Satoh, a logician who, late in his professional career, has just graduated from law school. This is not just hard work. I believe it takes a great deal of character for a seasoned academic to maintain students’ respect when they are his seniors in a secondary degree. But the end result is worth it: a lab with an exemplary balance of lawyers and computer scientists, experience and enthusiasm, pulling side-by-side.

Still, I occasionally get the feeling we’re all hoping for some sort of miracle to deliver us from the current predicament posed by the explosion of legal information. Legal professionals hope to be saved by technical wizardry, and informaticians like myself are waiting for data provision, methodical legal feedback on system requirements and performance, and in some cases research funding. In other words, we all want someone other than ourselves to get the ball rolling.

Miracle Occurs

The need to evaluate

Take for example, the lack of large corpora for study, which is one of the biggest stumbling blocks in informatics. Both IR and natural language processing (NLP) currently thrive on experimentation with vast amounts of data, which is used in statistical processing. More data means better statistical estimates and the fewer `guesses’ at relevant probabilities. Even commercial legal case retrieval systems, which give the appearance of being Boolean, use statistics and have done so for around 15 years. (They are based on inference networks that simulate Boolean retrieval with weighted indexing by reducing the rigidness associated with conditional probability estimates for Boolean operators `and’, `or’ and `not’. In this way, document ranking increasingly depends on the number of query constraints met).

The problem is that to evaluate new techniques in IR (and thus improve the field), you need not only a corpus of documents to search but also a sample of legal queries and a list of all the relevant documents in response to those queries that exist in your corpus, perhaps even with some indication of how relevant they are. This is not easy to come by. In theory a lot of case data is publicly available, but accumulating and cleaning legal text downloaded from the internet, making it amenable to search, is nothing short of tortuous. Then relevance judgments must be given by legal professionals, which is difficult given that we are talking about a community of people who charge their time by the hour.

Of course, the cooperation of commercial search providers, who own both the data and training sets with relevance judgments, would make everyone’s life much easier, but for obvious commercial reasons they keep their data to themselves.

To see the productive effects of a good data set we need only look at the research boom now occurring in e-discovery (discovery of electronic evidence, or DESI). In 2006 the TREC Legal Track, including a large evaluation corpus, was established in response to the number of trials requiring e-discovery: 75% of Fortune 500 company trials, with more than 90% of company information now stored electronically. This has generated so much interest that an annual DESI workshop has been established since 2007.

Qualitative evaluation of IR performance by legal professionals is an alternative to the quantitative evaluation usually applied in informatics. The development of new ways to visualize and browse results seems particularly well suited to this approach, where we want to know whether users perceive new interfaces to be genuine improvements. Considering the history of legal IR, qualitative evaluation may be as important as traditional IR evaluation metrics of precision and recall. (Precision is the number of relevant items retrieved out of the total number of items retrieved, and recall is the number of relevant items retrieved out of the total number of relevant items in a collection). However, it should not be the sole basis for evaluation.

A well-known study by Blair and Maron makes this point plain. The authors showed that expert legal researchers retrieve less than 20% of relevant documents when they believe they have found over 75%. In other words, even experts can be very poor judges of retrieval performance.

Context in legal retrieval

ParadigmShift

Setting this aside, where do we go from here? Dan Dabney has argued at the American Association of Law Libraries (AALL) 2005 Annual Meeting that free text search decontextualizes information, and he is right. One thing to notice about current methods in open domain IR, including vector space models, probabilistic models and language models, is that the only context they are taking into account is proximate terms (phrases). At heart, they treat all terms as independent.

However, it’s risky to conclude what was reported from the same meeting: “Using indexes improves accuracy, eliminates false positive results, and leads to completion in ways that full-text searching simply cannot.” I would be interested to know if this continues to be a general perception amongst legal librarians despite a more recent emphasis on innovating with technologies that don’t encroach upon the sacred ground of indexing. Perhaps there’s a misconception that capitalizing on full-text search methods would necessarily replace the use of index terms. This isn’t the case; inference networks used in commercial legal IR are not applied in the open domain, and one of their advantages is that they can incorporate any number of types of evidence.

Specifically, index numbers, terms, phrases, citations, topics and any other desired information are treated as representation nodes in a directed acyclic graph (the network). This graph is used to estimate the probability of a user’s information need being met given a document.

For the time being lawyers, unaware of technology under the hood, default to using inference networks in a way that is familiar, via a search interface that easily incorporates index terms and looks like a Boolean search. (Inference nets are not Boolean but they can be made to behave in the same way.) While Boolean search does tend to be more precise than other methods, the more data there is to search the less well the system performs. Further, it’s not all about precision. Recall of relevant documents is also important and this can be a weak point for Boolean retrieval. Eliminating false positives is no accolade when true positives are eliminated at the same time.

Since the current predicament is an explosion of data, arguing for indexing by contrasting it with full-text retrieval without considering how they might work together seems counterproductive.

Perhaps instead we should be looking at revamping legal reliance on a Boolean-style interface so that we can make better use of full-text search. This will be difficult. Lawyers who are charged, and charge, per search, must be able to demonstrate the value of each search to clients; they can’t afford the iterative nature of what characterizes open domain browsing. Further, if the intelligence is inside the retrieval system, rather than held by legal researchers in the form of knowledge about how to put complex queries together, how are search costs justified? Although Boolean queries are no longer well-adapted, at least value is easy to demonstrate. A push towards free-text search by either legal professionals or commercial search providers will demand a rethink of billing structures.

Given our current systems, are there incremental ways we can improve results from full-text search? Query expansion is a natural consideration and incidentally overlaps with much of the technology underlying graphical means of data exploration such as word clouds and wonderwheels; the difference is that query expansion goes on behind the scenes, whereas in graphical methods the user is allowed to control the process. Query expansion helps the user find terms they hadn’t thought of, but this doesn’t help with the decontextualization problem identified by Dabney; it simply adds more independent terms or phrases.

In order to contextualize information we can marry search using text terms and index numbers as is already applied. Even better would be to do some linguistic analysis of a query to really narrow down the situations in which we want terms to appear. In this way we might get at questions such as “What happened in a case?” or “Why did it happen?” rather than just, “What is this document about?”.

Language processing and IR

Use of linguistic information in IR isn’t a novel idea. In the 1980s, IR researchers started to think about incorporating NLP as an intrinsic part of retrieval. Many of the early approaches attempted to use syntactic information for improving term indexing or weighting. For example, Fagan improved performance by applying syntactic rules to extract similar phrases from queries and documents and then using them for direct matching, but it was held that this was comparable to a less complex, and therefore preferable, statistical approach to language analysis. In fact, Fagan’s work demonstrated early on what is now generally accepted: statistical methods that do not assume any knowledge of word meaning or linguistic role are surprisingly (some would say depressingly) hard to beat for retrieval performance.

Since then there have been a number of attempts to incorporate NLP in IR, but depending on the task involved, there can be a lack of highly accurate methods for automatic linguistic analysis of text that are also robust enough to handle unexpected and complex language constructions. (There are exceptions, for example, part-of-speech tagging is highly accurate.) The result is that improved retrieval performance is often offset by negative effects, resulting in a minimal positive, or even a negative impact on overall performance. This makes NLP techniques not worth the price of additional computational overheads in time and data storage.

However, just because the task isn’t easy doesn’t mean we should give up. Researchers, including myself, are looking afresh at ways to incorporate NLP into IR. This is being encouraged by organizations such as the NII Test Collection for IR Systems Project (NTCIR), who from 2003 to 2006 amassed excellent test and training data for patent retrieval with corpora in Japanese and English and queries in five languages. Their focus has recently shifted towards NLP tasks associated with retrieval, such as patent data mining and classification. Further, their corpora enable study of cross-language retrieval issues that become important in e-discovery since only a minority fraction of a global corporation’s electronic information will be in English.

We stand on the threshold of what will be a period of rapid innovation in legal search driven by the integration of existing knowledge bases with emerging full-text processing technologies. Let’s explore the options.

Tamsin MaxwellK. Tamsin Maxwell is a PhD candidate in informatics at the
University of Edinburgh, focusing on information retrieval in law. She
has a MSc in cognitive science and natural language engineering, and
has given guest talks at the University of Bologna, UMass Amherst, and
NAIST in Japan. Her areas of interest include text processing, machine
learning, information retrieval, data mining and information
extraction. Her passion outside academia is Arabic dance.

VoxPopuLII is edited by Judith Pratt.

Bookmark and Share

If the mountain will not come to the prophet, the prophet will go to the mountain

Within the field of legal informatics, discussions often focus on the technical and methodological questions of access to legal information. The topics can range from classification of legal documents to conceptual retrieval methods and Automatic Detection of Argumentation in Legal Cases. Researchers and businesses try to increase both precision and recall in order to improve search results for lawyers, while public administrations open up the process of legislating for the benefit of democracy and openness. Where are, however, the benefits for laypersons not familiar with retrieving legal information? Does clustering of legal documents, for example, yield a legal text any more understandable for a citizen?

To answer these questions, I would like to go back to the beginning, the purpose of law. Unfortunately for us lawyers, law is not created for us, but to serve as the oil that keeps society running smoothly. One can imagine two scenarios to apply the oil: If the motor has not been taken care of sufficiently, some extra greasy oil might be necessary to get it running again (i.e. if all amicable solutions are exhausted, some sort of dispute resolution is required), this would be the retroactive approach. The other possible application is to add enough oil during driving, so the engine will continue running smoothly without any additional boost, in other words trying to avoid disputes, this would be the proactive line of thinking.

How can proactive law work for the citizens? The basic assumption would be that in order to avoid disputes, one has to be aware of possible legal risks and how to prevent them. In line with the position of the European Union, we can further assume that the assessment and evaluation of risks requires relevant information about the legal facts at hand. It is only possible for a citizen to reach a decision regarding, for example, social benefits or certain rights as an employee, if she or he is aware of the various legal rights and obligations as well as possible legal outcomes.

Having stipulated that legal information is the core requirement for being able to exercise one’s rights as a citizen, the next questions would include which type of information is actually necessary, who should be responsible to communicating it and how it should be provided. These questions I would like to discuss below.  That is, we will talk about why, what, who and how.

Why?

ignorance

Before we move on to the main theme at hand on access to legal information, I would like to highlight a few more things about the why. As already mentioned, and as many legal philosophers have noted, law is the clockwork that makes society click. The principle Ignorantia juris neminem excusat (Ignorance of the law is no excuse) is commonly accepted as one of the foundations of modern civilization. But how would we define ignorance in today’s world? What if a citizen has troubles finding the necessary information despite endless efforts? What if she or he, after finding the relevant information, is not able to understand it? Does this mean she or he is still ignorant?

Public access to legal information is also a question of democracy, because citizens’ insight into politics, governmental work and the lawmaking process is a necessary prerequisite for public trust in the legislative body.

“In shifting from infrastructure to integration and then to transformation, a more holistic framework of connected governance is required. Such a framework recognizes the networking presence of e-government as both an internal driver of transformation within the public sector and an external driver of societal learning and collective adaptation for the jurisdiction as a whole.” (UN e-Government Survey 2008)

In this spirit, governments should consider the management of knowledge an increasing importance. “The essence of knowledge management (KM) is to provide strategies to get the right knowledge to the right people at the right time and in the right format.” (UN e-Government Survey 2008) What, then, is the right knowledge?

What?

The term legal information is as obvious as the word law. It is both apparent and imprecise, and yet we use it rather often. Several scholars have tried to define legal information and legal knowledge, inter alia, Peter Wahlgren in 1992, Erich Schweighofer in 1999, and Robert Richards in 2009.

books

If we consider the term from a layperson’s perspective, one could define it as the data, the facts and figures, that are necessary to solve an issue–one that cannot be handled amicably–between two persons (either legal or physical). In order for a layperson to be able to utilize legal information she or he has to be able to access, read, understand and apply the information.

The accessing element is one of the tasks that legal information institutes fulfill so elegantly. The term “reading” is here to be understood as information that can be grasped either with one’s eyes or ears. The complexity begins when it comes to understanding and applying the information. A layperson might have difficulties understanding and applying the Act on income tax even though the law is accessible and readable.

Is this information then still legal information if we assume that the word “information” means that somebody can receive certain signs and data and use this data meaningful in order to increase her or his knowledge? “Knowledge and information […] influence in a reciprocal way. Information modifies knowledge and knowledge guides potential use of information.” (Schweighofer)

If a layperson does not understand the information provided by official sources, she or he might refer to other information sources, for example by utilizing a Google search. In this case, the question arises how reliable the retrieved information is, however comprehensible. A high ranking in Google search does not automatically relate to high quality of the information even though this might be a common misconception, especially for laypersons not trained in source criticism. Here the importance of providing citizens with some basic and comprehensible information becomes apparent.

This comprehensible information might include more than plain text-based legislation and court decisions. Of interest for the layperson (both in business-consumer as well as government-citizen situations) can furthermore be, inter alia,

  • additional requirements according to terms and conditions or specific procedural rules in public administrations
  • possible legal outcomes and necessary facts that lead to them
  • estimated time of delivery of the product or the decision
  • creditability of the business, including the amount of pending cases before the courts or complaints before the consumer protection authorities.

For a citizen it might also be very significant to know how she or he could behave differently in order to reach a desired result. Typically, citizens are only provided with the information as to how the legal situation is, but not what they could do to improve it, unless they contact a lawyer.

Commonly all these types of data already exist, if maybe not in one location. The most – technically – accessible information are traditional legal sources, such as legislation and case-law. Again, here the question mainly focuses on how to provide and utilize the existing information in a fashion understandable to the user.

“Like any other content transmitted through a communication system, primary legal sources can be rendered more or less understandable, locatable, and hence effective by structuring and presenting them differently for different audiences. And secondary sources must of course be constructed for a particular market, audience, or level of understanding. “(Tom Bruce)

Who should then be responsible for structuring, presenting and rendering it understandable, especially in the light of source criticism and trust?

Who?

Ignorantia juris neminem excusat presupposes that the legal information provided is correct and of high quality. Who can guarantee such a quality? The state, private entities, research facilities, non-governmental organizations or citizens? My answer would be that all could contribute their part of the game.

One should, however, keep in mind, that user-friendliness is not the same as trustworthiness, which leads to the question of how to ensure that citizens are supplied with the right answers? In a world where even governments do not always take responsibility for the correctness of the provided information, such as in the case of online publications for law gazettes, the question remains who, or what entity, should be held liable for the accuracy of its services. But even if a public authority would sustain accountability, to what extent could that influence an already reached legal decision?

The answer of who should provide a certain legal information service could also depend on who the target group of the information is.

“The legal information market is really no longer conceivable as bipolar – it can no longer be seen as a question of lawyers on the one hand versus a largely legally ignorant everyone else on the other. […] Internet-based legal information systems are used by many cases and conditions of people for many different reasons. […] Probably the most interesting group [are] non-lawyer professionals. These are people whose interest in law is vital, ongoing, and professional rather than either being casual and hobby-like or sporadic and trauma-driven. […] Such new and diverse audiences require new and diverse legal information architectures. They will want specialized collections of law of particular relevance to them. They will want those collections organized and presented in ways that reflect their profession or their situation, in ways that collections organized according to the legal abstractions and legal terms in use by lawyers do not. They are concerned with situations and fact-patterns rather than theories, doctrines, and concepts. They are, in short, a very intelligent and exciting type of lay users, and a potentially enormous audience. ‘(Tom Bruce)

Non-lawyer professionals probably constitute a large market for businesses that can tailor their services to a specific group and therefore render them profitable, as the services are considered of value for these professionals.

Traditional laypersons, however, typically do not represent a large market power simply because they will not always be willing to pay for services of this kind. This leaves them to the hands of other stakeholders such as public administrations, research institutes, non-governmental organizations and private initiatives. As already mentioned, conventionally the raw data is supplied by public administrations.  The question, then, is how to deliver it to the end-user.

How?

The Austrian civil code knows two concepts regarding fulfilling one’s part of the contract, Holschuld and Bringschuld. Holschuld means a debt to be collected from the debtor at his residence. Bringschuld constitutes an obligation to be performed at creditor’s habitual residence. In today’s terminology, one could compare Holschuld with pull technology and Bringschuld with push technology. In other words, should the citizens pick up the relevant legal information or should the government actively deliver it at people’s doorsteps, so to speak?

delivery

In the offline paper world, the only way to reach a citizen was to send a letter to her or his house. Obviously, information technology offers many more possibilities when it comes to communicating with citizens, either via a computer or even a mobile phone, taking privacy concerns into consideration.

Several e-government and initiatives (video feed from European Parliament sessions and EU’s channel at Youtube) increase the public participation and insight into politics. While these programs are an important contribution to democracy, they typically do not facilitate daily encounters with legal issues of employment, family, consumer, taxes or housing, or provide citizens with the necessary information to do so.

In this respect, technologies enabling interactivity and re-use of public information are of greater importance, the latter also being a strategic concern of the European Union.  In particular, semantic technology offers solutions for transforming raw data into comprehensible information for citizens. Here, practical examples that utilize at least part of this technology can also be found within e-government projects as well as in private initiatives.

The next step would be law being built into the code already. Intelligent agents negotiate the most advantageous terms and conditions for their owner, cars prevent being switched on if the driver exceeds the permitted alcohol level (Ignition interlock device) and music songs do not play unless your device is authorized (iTunes).

So, from a technological point of view, anything from presenting legal information on a website to implementing law directly into the end device is possible. In practice, though, most governments are content with providing textual legal information, at best in a structured format so it can be re-used easier. The technical implementation of more advanced functions is often left to other market players and businesses.

There are two initiatives in this respect that are worth mentioning, one being a true private project in Sweden and the other one being provided by the Austrian government.

Lagen.nu (law now) has been around for some time now as a private initiative offering free access to Swedish legislation and case law. Recently the site was extended by adding commentaries to specific statutes, which should enable laypersons to understand certain legislation. The site includes explanations for certain terminology and particular comments are also categorized and include links to other laws and cases.

The other example, HELP, a service provided by the Austrian Government Agency, structures and presents legal information depending on the factual situation, e.g. it contains categories such as employment, housing, education, finances, family and social services. The relevant legal requirements are then explained in plain text and the responsible authority is listed and linked to.  In some cases the necessary procedure can even be initiated through the web site.

Both projects are fine examples of the possible transformation of legal information from pull to push technology. They are not quite there yet, though.

The answer

The question we are faced with now is not so much how or which technique would be the best, but rather in which situation a citizen might need certain legal information. Somebody trying to purchase a book via a web site might need information at that moment, and either as a warning text or a check list or its intelligent agent, the purchaser might go to another web site that has better ratings and more favorable legal terms and conditions and no pending law suits. In some other cases, the citizen might need certain information in a specific situation right at the spot.  For example, while filling out a form she or he might want to know what would be most favorable choice, rather than simply the type of personal data required for the form. Depending on the situation, different approaches might be more valuable than others.

The larger issue at hand is where the information is retrieved and who is the provider of the information. In other words, trust is an important factor, particularly trust of the information provider. As previously stated, legal information is not usually provided by public bodies but instead is rerouted through various other entities, such as businesses, organizations and individual efforts. This increases the importance of source criticism even more.

In many cases citizens will use general portals such as Google or Wikipedia to search for information, rather than going directly to the source, most often because citizens are not aware of the services offered. This underlines the importance for legal information providers to co-operate with other communication channels in order to increase their visibility.

The necessary legal information is out there, it just remains to be seen if and how it reaches the citizens. Or to put it in other words: The prophet still has to come to the mountain, but in time, with the increasing use of technology, maybe the mountain will come a bit closer.

ChristineKirchberger

Christine Kirchberger has been a junior lecturer at the Swedish Law and Informatics Research Institute, Stockholm University) since 2001. Besides teaching law and IT she is currently writing her PhD thesis on Legal information as a tool where she focuses on legal information retrieval, the concept of legal information within the framework of the doctrine of legal sources and also takes a look at the information-seeking behavior of lawyers.

VoxPopuLII is edited by Judith Pratt.

Bookmark and Share

The Recipe for Better Legal Information Services

A new style of legal research

An attorney/author in Baltimore is writing an article about state bans of teachers’ religious clothing. She finds one of the tersely written statutes online. The website then does a query of its own and tells her about a useful statute she wasn’t aware of—one setting out the permitted disciplinary actions. When she views it, the site makes the connection clear by showing her the where the second statute references the original. This new information makes her article’s thesis stronger.Recipe card

Meanwhile, 2800 miles away in Oregon, a law student is researching the relationship between the civil and criminal state codes. Browsing a research site, he notices a pattern of civil laws making use of the criminal code, often to enact civil punishments or enable adverse actions. He then engages the website in an interactive text-based dialog, modifying his queries as he considers the previous results. He finally arrives at an interesting discovery: the offenses with the least additional civil burdens are white collar crimes.

A new kind of research system

A new field of computer-assisted legal research is emerging: one that encompasses research in both the academic and the practical “legal research” senses. The two scenarios above both took place earlier this year, enabled by the OregonLaws.org research system that I created and which typifies these new developments.

Interestingly, this kind of work is very recent; it’s distinct from previous uses of computers for researching the law and assisting with legal work. In the past, techniques drawn from computer science have been most often applied to areas such as document management, court administration, and inter-jurisdiction communication. Working to improve administrative systems’ efficiency, people have approached these problem domains through the development of common document formats and methods of data interchange.

The new trend, in contrast, looks in the opposite direction: divergently tackling new problems as opposed to convergently working towards a few focused goals. This organic type of development is occurring because programming and computer science research is vastly cheaper—and much more fun—than it has ever been in the past. Here are a couple of examples of this new trend:

“Computer Programming and the Law”

Law professor Paul Ohm recently wrote a proposal for a new “interdisciplinary research agenda” which he calls “Computer Programming and the Law.” (The law review article is itself also a functioning computer program, written in the literate programming style.) He envisions “researcher-programmers,” enabled by the steadily declining cost of computer-aided research, using computers in revolutionary ways for empirical legal scholarship. He illustrates four new methods for this kind of research: developing computer programs to “gather, create, visualize, and mine data” that can be found in diverse and far-flung sources.

“Computational Legal Studies”

Grad students Daniel Katz and Michael Bommarito (researcher-programmers, as Paul Ohm would call them) created the Computational Legal Studies Blog in March, 2009. The web site is a growing collection of visualization applied to diverse legal and policy issues. The site is part showcase for the authors’ own work and part catalog of the current work of others.

OregonLaws.org

I started the OregonLaws.org project because I wanted faster and and easier access to the 2007 Oregon Revised Statutes (ORS) and other primary and secondary sources. I had a couple of very statute-heavy courses (Wills & Trusts, and Criminal Law) and I frequently needed to quickly find an ORS section. But as I got further into the development, I realized that it could become a platform for experimenting with computational analysis of legal information, similar to the work being done on the Computational Legal Studies Blog.

I developed the system using pretty much the the steps that Paul Ohm discussed:

  1. Gathering data: I downloaded and cleaned up the ORS source documents, converting them from MS Word/HTML to plain text;
  2. Creating: I parsed the texts, creating a database model reflecting the taxonomy of the ORS: Volumes, Titles, Chapters, etc.;
  3. Creating: I created higher-level database entities based on insights into the documents. For example, by modeling textual references between sections explicitly as reference objects which capture a relationship between a referrer and a referent, and;
  4. Mining and Visualizing: Finally, I’ve begun making web-based views of these newly found objects and relationships.Object Model

The object database is the key to intelligent research

By taking the time to go through the steps listed above, powerful new features can be created. Below are some additions to the features described in the introductory scenarios:

We can search smarter. In a previous VoxPopulii post, Julie Jones advocates dropping our usual search methods, and applying techniques like subject-based indexing (a la Factiva’s) to legal content.

This is straightforward to implement with an object model. The Oregon Legislature created the ORS with a conceptual structure similar to most states:  The actual content is found in Sections.  These are grouped into Chapters, which are in turn grouped into Titles.  I was impressed by the organization and the architecture that I was discovering—insights that are obscured by the ways statutes are traditionally presented.

search-filter.png

And so I sought out ways to make use of the legislature’s efforts whenever it made sense.  In the case of search results, the Title organization and naming were extremely useful.  Each Section returned by the search engine “knows” what Chapter and Title it belongs to. A small piece of code can then calculate what Titles are represented in the results, and how frequently. The resulting bar graph doubles as an easy way for users to specify filtering by “subject area”. The screenshot above shows a search for forest.

The ORS’s framework of Volumes, Titles, and Chapters was essentially a tag cloud waiting to be discovered.

We can get better authentication. In another VoxPopulii post, John Joergensen discussed the need for authentication of digital resources. One aspect of this is showing the user the chain of custody from the original source to the current presentation. His ideas about using digital signatures are excellent: a scenario of being able to verify an electronic document’s legitimacy with complete assurance.

glossary-citations.png

We can get a good start towards this goal by explicitly modeling content sources. A source is given attributes for everything we’d want to know to create a citation; date last accessed, URL available at, etc.  Every content object in the database is linked to one of these source objects.  Now, every time we display a document, we can create properly formatted citations to the original sources.

The gather/create/mine/visualize and object-based approaches open up so many new possibilities, they can’t all be discussed in one short article. It sometimes seems that each new step taken enables previously unforeseen features. A few these others are new documents created by re-sorting and aggregating content, web service APIs, and extra annotations that enhance clarity. I believe that in the end, the biggest accomplishment of projects like this will be to raise our expectations for electronic legal research services, increase their quality, and lower their cost.

Robb ShecterRobb Shecter is a software engineer and third year law student at Lewis & Clark Law School in  Portland, Oregon.   He is Managing Editor for the Animal Law Review, plays jazz bass, and has published articles in Linux Journal, Dr. Dobbs Journal, and Java Report.

VoxPopuLII is edited by Judith Pratt.

Bookmark and Share

Tidying Up the Law

The recent attention given to government information on the Internet, while laudable in itself, has been largely confined to the Executive Branch. While there is a technocratic appeal to cramming the entire federal bureaucracy into one vast spreadsheet with a wave of the president’s Blackberry, one cannot help but feel that this recent push for transparency has ignored government’s central function, to pass and enforce laws.

Advertisement on data.gov

Whether seen from the legislative or judicial point of view, law is a very prose-centric domain. This is a source of frustration to the mathematicians and computer scientists who hope analyze it. For example, while the United States Code presents a neat hierarchy at first glance, closer inspection reveals a sprawling narrative, full of quirks and inconsistencies. Even our Constitution, admired worldwide for its brevity and simplicity, has been tortured with centuries of hair-splitting over every word.

Nowhere is this more apparent than in judicial opinions. Unlike most government employees, who must adhere to rigid style manuals; or the general public, who interact with their government almost exclusively through forms; judges are free to write almost anything. They may quote Charles Dickens, or cite Shakespeare. A judicial opinion is one part newspaper report, one part rhetorical argument, and one part short story. Analyzing it mathematically is like trying to understand a painting by measuring how much of each color the artist used. Law students spend three years learning, principally, how to tease meaning out of form, fact out of fiction.

Why does a society, in which a President can be brought down by the definition of is, tolerate such ambiguity at the heart of its legal system? (And why, though we obsessively test our children, our athletes, and our attorneys, is our testing of judges such a farce?)

Engineers such as myself cannot tolerate ambiguity, so we feel a natural desire to bring order out of this chaos. The approach du jour may be top-down (taxonomy, classification) or bottom-up (tagging, clustering) but the impulse is the same: we want to tidy up the law. If code is law, as Larry Lessig famously declared, why not transform law into code?

Visualization of the structure of the U.S. Code

This transformation would certainly have advantages (beyond putting law firms out of business). Imagine the economic value of knowing, with mathematical certainty, exactly what the law is. If organizations could calculate legal risk as efficiently as they can now calculate financial risk (recession notwithstanding), millions of dollars in legal fees could be rerouted toward economic growth. All those bright liberal arts graduates who suffer through law school, only to land in dismal careers, could apply themselves to more useful and rewarding occupations.

And yet, despite years of effort, years in which the World Wide Web itself has submitted to computerized organization, the law remains stubbornly resistant to tidying. Why?

There are two answers, depending on what goal we have in mind. If the goal is truly to make tenets of law provable by mechanical (i.e., algorithmic) means, just as the tenets of mathematics are, we fail before we begin. Contrary to lay perception, law is not an exact science. It’s not a science at all (says a lawyer). Computers can answer scientific questions (“What is the diameter of Neptune?”) or bibliographic ones (“What articles has Tim Wu written?”) but cannot make value judgments. Law is all about value judgments, about rights and wrongs. Like many students of artificial intelligence, I believe that I will live to see computers that can make these kinds of judgments, but I do not know if I will live to see a world in which we let them.

The second answer speaks to the goal of information management, and the forms in which law is conveyed. The indexing of the World Wide Web succeeded for two reasons, form and scale. Form, in the case of the Web, means hypertext and universal identifiers. Together, they create a network of relationships among documents, a network which, critically, can be navigated by a computer without human aid. This fact, when realized at the scale of billions of pages containing trillions of hyperlinks, allows a computer to derive useful patterns from a seemingly chaotic mass of information.

3-d visualization of hypertext documents in XanaduSpace™

Law suffers from inadequacies of both form and scale. For example, all federal case law, taken together, would comprise just a few million pages, only a fraction of which are currently available in free, electronic form. In spite of the ubiquity of technology in the nation’s courts and legislatures, the dissemination of law itself, both statutory and common, remains a paper-centric, labor-intensive enterprise. The standard legal citation system is derived from the physical layout of text in bound volumes from a single publisher. Most courts now routinely publish their decisions on the Web, but almost exclusively in PDF form, essentially a photograph of a paper document, with all semantic information (such as paragraph breaks) lost. One almosts suspects a conspiracy to keep legal information out of the hands of any entity that lacks the vast human resources needed to reformat, catalog, and cross-index all this paper — in essence, to transform it into hypertext. It’s not such a far-fetched notion; if law were universally available in hypertext form, Google could put Wexis out of business in a week.

Social network of federal judges based on their clerks

But the legal establishment need not be quite so clannish with regard to Silicon Valley. For every intellectual predicting law’s imminant sublimation into the Great Global Computer, there are a hundred more keen to develop useful tools for legal professionals. The application is obvious; lawyers are drowning in information. Not only are dozens of court decisions published every day, but given the speed of modern communications, discovery for a single trial may turn up hundreds of thousands of documents. Computers are superb tools for organizing and visualizing information, and we have barely scratched the surface of what we can do in this area. Law is created as text, but who ever said we have to read it that way? Imagine, for example, animating a section of the U.S. Code to show how it changes over time, or “walking” through a 3-d map of legal doctrines as they split and merge.

Of course, all this is dependent on programmers and designers who have the time, energy, and financial support to create these tools. But it is equally dependent on the legal establishment — courts, legislatures, and attorneys — adopting information-management practices that enable this kind of analysis in the first place. Any such system has three essential parts:

  1. Machine-readable documents, e.g., hypertext
  2. Global identifiers, e.g., URIs
  3. Free and universal access

These requirements are not technically difficult to understand, nor arduous to implement. Even a child can do it, but the establishment’s (well-meaning) attempts have failed both technically and commercially. In the mean time, clever engineers, who might tackle more interesting problems, are preoccupied with issues of access, identification, and proofreading. (I have participated in long, unfruitful discussions about reverse-engineering page numbers. Page numbers!) With the extremely limited legal corpora available in hypertext form — at present, only the U.S. Code, Supreme Court opinions, and a subset of Circuit Court opinions — we lack sufficient data for truly innovative research and applications.

This is really what we mean when we talk about “tidying” the law. We are not asking judges and lawyers to abandon their jobs to some vast, Orwellian legal calculator, but merely to work with engineers to make their profession more amenable to computerized assistance. Until that day of reconciliation, we will continue our efforts, however modest, to make the law more accessible and more comprehensible. Perhaps, along the way, we can make it just a bit tidier.

stuart.jpgStuart Sierra is the technical guy behind AltLaw.  He says of himself, ” I live in New York City.  I have a degree in theatre from NYU/Tisch, and I’m a master’s student in computer science.  I work for the Program on Law & Technology at Columbia Law School, where I spend my day hacking on AltLaw, a free legal research site. I’m interested in the intersection of computers and human experience, particularly artificial intelligence, the web, and user interfaces.”

VoxPopuLII is edited by Judith Pratt.

Bookmark and Share

Is Free Access to Law here to stay?

Recently, LexUM, SAFLII and friends commenced a global study on Free Access to Law. It poses the pertinent question: is free access to law here to stay? The goal of the project is to produce a best-practices handbook, collect open-access case studies, and publish an online library on the subject.  The ultimate aim of all these activities is to enable future free access to law projects to choose best practices and adapt to local contexts that may have more in common than it initially appears.

This April we kicked off the project in the beautiful African bush with three days of introspection, sharing (for the teenage LIIs) and learning (for the toddler LIIs).

The project seeks to study and link two central concepts – the concept of success of a free access to law project and the concept of sustainability. Ivan Mokanov, who wrote the original project proposal, puts forward a simple thesis that relates the two:

By making law freely available, a legal information institute (LII) produces outcomes that benefit its target audience, thereby creating incentives among the target audience or other stakeholders to sustain the LII’s ongoing operations and development.

Linking free delivery of legal information to core benefits such as  support for the rule of law, open and accountable government and the importance of reducing insecurity in economic life can be difficult. Defining the subtler aspects of success thus involves exploration and new methodologies.

In a broad definition, sustainability is seen as the ability to deliver services that provide sufficient value to their target audience, so that either that audience or other stakeholders acting on its behalf choose to fund the ongoing operation and evolution of that service.

The project looks at a sustainability chain:

The Sustainability Chain Cycle

The words and brilliant logic of fictional psychopath Hannibal Lecter to Clarice Starling in The Silence of the Lambs  might serve as a guide to the sustainability chain:

First principles, Clarice. Simplicity. Read Marcus Aurelius. Of each particular thing ask: what is it in itself? What is its nature? What does he do, this man you seek?

The Need
Start with a need or a problem which a LII has to address. An often seen example in my part of the world is a country completely lacking any structures for providing legal information, even commercial print publishers. Sometimes, too, legal information is available, and sometimes freely so, via official printers, government bodies or other creators of the information, but that availability does not necessarily equate to usability.

Different stages of development highlight different needs, and the sustainability chain gives equal weight to addressing each. A free access to law project is just as successful if it manages to provide up=to-date information to judges who until recently applied the law from their 1970s law school textbooks as it is if it provides a state-of-the-art point in time legislation service to practicing lawyers. There seems to be an agreement that, as it grows through its stages of development — as Tom Bruce kindly defined them: establishment, incubation and “going concern” — a sustainable LII closes a chapter of success and continues on to respond to a new need within its target audience.

The Environment
The context in which an LII operates is equally important in determining the success and subsequently the sustainability of a free access to law operation. An LII will thrive in an environment that provides rich data sources, and is amenable to and capable of reform and change, with a policy and legal framework favourable to the free dissemination of legal information. To bring this to the nitty-gritty level – a free access to law project needs support all the way from the political top down to the secretary and clerk of the court.

An important environmental indicator is the availability of an infrastructure to support circulation of information. LIIs operating in developing countries often face the roadblock of lack of technology or lack of knowledge of the use of technology at the source. While computers have made their way into most judges’ chambers and courtrooms, most are not connected to the Internet, or if they are, the connections are so slow as to bar convenient use. A judge once relayed that in the rural areas, courts, well-computerized using donor money, are unable to make use of the technology due to lack of electricity. It is not unheard of a clerk or court secretary to delete judgments from a computer once they have been printed, thus leaving one single hard copy of a judgment for the court files.

Developing countries’ LIIs, as aptly pointed by my colleague Kerry Anderson in a post below, often involves getting information from this:

The Gutenberg Press into this Data

This inevitably begs a couple or more open questions for a “toddler” LII:

Who should foot the bill for the expensive digitization of legal material?

It has been the norm since the Middle Ages that the rulers of the land make the laws known to the subjects and citizens (I tend to include lawyers and government officials in the plebs) . But — assuming that it even recognizes this obligation, as the government of South Africa does — how does a cash-strapped government of a developing country fulfill its obligation to do so?  Donors donating directly to government to set up law reporting or print legislation? Donors via support for a LII? The interested public via support for a LII?

Would a successful strategy for a LII be to, first and foremost, address the policy and technology issues of provision of digital information?

Digitizaion of print materials and/or manual capturing of metadata, for example, cannot be deemed a successful strategy in the long run - it is simply uneconomical to continue to do so past a certain stage. Engaging stakeholders in education of use of technology or development of IT solutions to support workflows for delivering of judgments or passing legislation may be a way of dealing with issues of digitizing and automating delivering of law to the public. Standards of preparation of legal material, such as the ones developed by the Canadian Citation Committee or the endeavours of the Africa i-Parliaments project, adopted by all originators of legal information in a particular jurisdiction, will ease its dissemination and re-use. Campaigning for the passing policy or legislation, such as the PSI Directive, may be another strategy for a LII to enhance the efficacy of its operations.

How does an NGO engage with government over the digitization strategy?

The Ability

The need is recognized, we have a favourable environment for operations. Now, how does a lawyer, frustrated by the lack of user-friendly legal resources, with little to no know-how, convert the circumstances to a successful and sustainable free access to law project?

An LII needs to have the ability to respond to the need for legal information on at least two levels: organizational and technological.   At the  organizational level, a free access to law project needs to have organizational structure and support that is responds correctly to the context in which the project operates.

What are the benefits of the different organizational structures? Are they dependent on the development stage of an LII?

Among the members of the project’s workshop group, there seemed to be an agreement on the need for any new access to law project to possess adequate technology (in general),  and to possess or have access to legal information technology and expertise. The right technological and information-standards choices for particular environments are crucial to the correct response to the audience’s need.  While this carries a positive connotation, an emerging free access to law project needs to learn also to respond to user feedback in a measured way, and sometimes to even say NO.

Availability of Financial Resources

This couple-of-hundred-thousand-dollar question is one that every LII – from babies to teenagers – asks every financial cycle. Free access to law is quite expensive. While the product can be free, as in free beer, the process of creating the value and benefit to the audience can be quite expensive.

The ability of any LII to financially plan a few years ahead is crucial to the success of the project. It is important to have an audience that has identified value in a free access to law project – be it lawyers (CanLII and to an extent AustLII), or in addition to the preceeding, advertisers and aligned services (LII), or government (Kenya Law Reports), or international donors (SAFLII) , or individuals and lawyers (OpenJudis). Holding on to an audience and maintaining its willingness to pay relies on maintaining and improving existing product value. It is also crucial that money is spent in the right places.

The Handbook to be produced at the end of this study will identify how successful LIIs allocate budgets, according to value produced and how and who supports creation of specific benefits.

The next 6-8 months
In the next six to eight months the research team consisting of Dorsaf El Mekki (LexUM), Bobson Coulibaly (JuriBurkina, West Africa), Prashant Iyengar (Centre for Internet and Society, India), myself at SAFLII and a number of researchers from Namibia, Uganda, Zimbabwe and Kenya, will endeavour to collect data via interviews, questionnaires, case studies and web statistics on a number of indicators to reinforce the assumptions and prove the hypotheses of the study into free access to law – a study very much needed.

Mariya Badeva-Bright is the Head of Legal Informatics and Policy at the
Southern African Legal Information Institute (SAFLII). She is a
sessional lecturer at the School of Law, University of the
Witwatersrand
, South Africa where she teaches the LLM course in Cyberlaw
and the undergraduate course in Legal Information Literacy.
Ms. Badeva-Bright received her Magister Iuris degree from Plovdiv
University
, Bulgaria and an LL.M degree in Law and Information
Technology from Stockholm University, Sweden.

VoxPopuLII is edited by Judith Pratt.

Bookmark and Share

Bentham and the Privacy of the Grave

I first met Jeremy Bentham as a newly arrived philosophy student walking through the South Cloisters of University College London.  Behind the plate glass of a huge mahogany case, I looked in upon a seated life-size wax figure of a man in an 18th Century coat and knee britches, happily wearing a straw hat.  Only it was not Bentham’s wax figure; it was his embalmed corpse – his “auto-icon.”  Apparently, Bentham’s will left his executor no choice but to have his body stuffed and placed on public display.  There he has been ever since.

Bentham famously believed that publicity was the key to truth.  His ideal was a Panoptic  universe, where all in the world would believe themselves to be constantly observed, listened to, and monitored.  Thus all would become good — or at least would behave (which, for Bentham, amounted to the same thing).  Bentham felt that claims to privacy were no more real or substantial than claims to natural rights, which he despised as pernicious fictions.  Both were harmful beliefs that entrenched privilege and maintained humankind in its misery.  Publicity was the key to truth and human happiness. 

It is easy to make fun of Bentham’s ideas.  But much of what Bentham meant to address in the context of his Panoptic structures we now take for granted.  In Bentham’s lifetime, Parliamentary deliberations were confidential.  Bentham’s arguments forced them into the sunlight.  Legal decisions and statute books were accessible only to lawyers and judges.  Bentham’s arguments led to codification of the law, and increasingly accessible legal rules.  Bentham was far ahead of his time — the first modern information theorist.  The idea that all actions of government would be presumptively available for public review did not become part of U.S. law until the passage of the Freedom of Information Act (FOIA)  in 1967.  As we speak, it appears the English parliament is only now learning Bentham’s message about publicity.

Bentham’s contemporary William Blackstone celebrated the fact that “private vices” were beyond the jurisdiction of the state.  Privacy for him was an organizing principle of civilized society.  But Blackstone believed in an all-seeing God to whom we would be accountable even for our private sins and thoughts.  Bentham, a thoroughgoing atheist, hated Blackstone and all he stood for.  For him the logical truth remained that people who believed themselves to be monitored behaved more responsibly than those who believed themselves to be alone.  So Bentham asked himself: in the absence of God, how can a secular society operate without perpetuating Panoptic structures of surveillance? Foucault's Panopticon When Michael Foucault argued that Bentham’s Panoptic structures had become essential to the functioning of a modern secular state, he did not claim originality for the insight.

But why do our intuitions revolt?  What can our brains say to explain this revulsion?  What is so important about privacy?  Judge Posner has pointed out that when people are given a right to privacy, they use it to conceal discrediting information about themselves from others – and consequently mislead and defraud them.  In a world increasingly characterized by exchanges of information, should we not all just abandon the attempt to maintain privacy, and embrace the Panopticon?  We are, of course, all familiar with the dark side of the Panopticon – the fictional surveillance state of George Orwell’s 1984, or the actual surveillance states in Eastern Europe in the second half of the last century.  But as Bentham knew, and his modern disciple David Brin has explained at greater length, the Orwellian nightmare state is impossible when the Panopticon works both ways – when the government itself is watched – when the surveillor knows himself to be surveilled.  Still, our intuitions rebel, but we are unable to respond to Bentham’s utilitarian logic.

So let us return again to the South Cloisters of University College and the question we began with: what could have possessed Bentham to do what he did with his last remains?  The answer seems to be compelled by the same bloodless logic the man applied in all other aspects of his life.  Bentham, the great apostle of publicity, rejected even the privacy of the grave – he remains the eternal observer, continuing his surveillance of the living from his perch among the dead.

Further reading:

OF PUBLICITY AND PRIVACY, AS APPLIED TO JUDICATURE IN GENERAL, AND TO THE COLLECTION OF THE EVIDENCE IN PARTICULAR. - Jeremy Bentham, The Works of Jeremy Bentham, vol. 6 [1843]

 Bentham’s Panopticon Letters

Peter A. Winn has served as an Assistant U.S. Attorney in the United States Department of Justice since 1994. He is also is an part-time lecturer at the University of Washington Law School where he teaches privacy law and health care fraud and abuse, and is a Senior Fellow at the University of Melbourne where he teaches cybercrime.  The views represented in this article are Mr. Winn’s personal views and not those of the United States Department of Justice.

VoxPopuLII is edited by Judith Pratt.

Bookmark and Share

A Law Librarian Looks at Legal Informatics Scholarship

Recently I, like many law librarians (including Dean Richard Danner, James Donovan, and the panelists at the University of South Carolina School of Law’s colloquium on “The Law Librarian’s Role in the Scholarly Enterprise” [scroll down & click on “Part 9: Roundtable”]), began to devote more thought to disintermediation in legal information services.  One way that law librarians can adapt to disintermediation is by learning more about the study of legal information systems, that is, legal informatics.  When I began looking closely at legal informatics scholarship last fall, I was dismayed at not being able to locate any single resource that aggregated all of the major scholarly information resources in the field.   As a result, I decided to build one; it’s called Legal Information Systems & Legal Informatics Resources. To provide current information, the site has an accompanying blog , the Legal Informatics Blog, and a Twitter feed.   Building these sites has allowed me to cast a novice’s eye on the field of legal informatics.

Eye

Here is what I’ve glimpsed in the past few months:

I. Surveying the Sources

My exploration of legal informatics has focused initially on information resources. A relatively circumscribed set of scholarly journals, other article sources, preprint services, indexing & abstracting services, blogs, and listservs regularly report research results in legal informatics. A small set of subject headings will retrieve most monographs and dissertations in the field. Accordingly, aggregating access to these resources has been relatively easy, and automating discovery and delivery of many of these sources seems feasible sooner rather than later.

Conferences are trickier.   The number of conferences at which legal informatics issues are addressed is substantial, for several reasons: a large number of researchers from industry as well as academia (see, e.g., the lists of individuals compiled by Dr. Adam Wyner and the organizers of the DEON deontic logic conferences, and this list of departments & institutes), energetically engaged in applied as well as theoretical research, are producing a sizeable output; many of those researchers work in multiple fields; and the pace of technological change is accelerating the research and communication processes.  Several Websites, such as those of the International Association for Artificial Intelligence and Law (IAAIL) and the DEON deontic logic conferences, monitor these meetings, however. Access to proceedings is available from several sources, including ACM’s Portal service, the other major information science indexing services, OCLC WorldCat, and the Legal Information Systems site. As a result, access to most legal informatics conference information and proceedings can be streamlined and hopefully largely automated before too long.

Projects have proven even trickier. Much legal informatics research takes the form of grant-funded projects, of which a great number, particularly in Europe, have been undertaken during the past decade. Political integration in Europe and democratization in many regions encouraged certain governments during the past two decades to fund applied research on legal information systems. Identifying and linking to all of these legal informatics projects seems important for enabling access to legal informatics scholarship. Such a process is quite labor intensive, however, because of the great number of such projects, the lack of a comprehensive list of them, and the many languages in which project documentation is written. A long-term goal of the Legal Information Systems site is to build a database of as many of these projects as can be identified, with links to project Websites, deliverables, and publications.

Since standards and protocols, such as those respecting descriptive metadata and knowledge representation, and data sets constitute additional key resources for legal informatics research, links to many of them have been collected on the Legal Information Systems site. Because many researchers in the field focus on a particular research topic or category of legal information, aggregations of resources on major topics in the field, such as e-rulemaking, evidence, and information behavior, to which the Legal Information Systems site has dedicated pages, and argumentation, to which Dr. Adam Wyner’s blog devotes several pages, may yield efficiencies for researchers. In addition, collections of resources on applied topics such as citation standards, computer-assisted legal research (CALR) services, court technology, the Free Access to Law movement (discussed here by Ginevra Peruginelli & Enrico Francesconi of ITTIG-CNR, with links to resources here), institutional repositories, instructional technology, law practice technology, and open access may be of use to researchers and practitioners alike.

II. Detecting a Communications Gap

From a preliminary scan of the field of legal informatics I’ve learned that legal informaticists and law librarians do not appear to be communicating to any significant extent. For example, law librarians seem to play little or no role at legal informatics conferences and are rarely published in legal informatics journals. (Sarah Rhodes & Dana Neacsu’s recent paper seems an exception.) This seems particularly odd, given that law libraries are developing some of today’s most innovative digital legal information systems, such as the Chesapeake Project Legal Information Archive (a project of the Georgetown University Law Library, the Maryland State Law Library, the Virginia State Law Library, and the Legal Information Preservation Alliance), the Law Library of Congress’s Global Legal Information Network (GLIN), the Harvard Law School Library’s Digital Collections, the digital law libraries created by the Rutgers Camden and Rutgers Newark law libraries, and the USC Law Library’s English Medieval Legal Documents Wiki. Law library scholarship — although it often addresses legal informatics topics such as legal citation (as in studies that reveal information resources utilized by courts), legal information behavior (as in the work of Dean Joan Howland & Nancy Lewis, Dr. Yolanda Jones, and Judith Lihosit ), and the functioning or design of legal information systems such as computer assisted legal research (CALR) services (as in recent studies by Julie Jones, John Doyle, and Dean Mason) — rather infrequently refers to legal informatics scholarship. That is, two communities of experts respecting the same subject — legal information systems — seem for the most part to be talking past each other.

Communication failure

Yet information sharing between law librarians and legal informaticists would substantially benefit both groups.   Law librarians would gain valuable insights into the functioning of the legal information systems they use every day and the likely direction of the legal information industry, as may be gleaned from recent monographs collecting conference papers in the field as well as from the program of the 2009 International Conference on Artificial Intelligence and Law (ICAIL 2009).   Those works show that the primary topics of recent legal informatics scholarship include argumentation and deontic logic (as discussed, for example, in recent dissertations by Dr. Adam Wyner & Dr. Régis Riveret); agent/multi-agent systems; decision support systems; document modeling; several natural language processing issues including multi-language systems, text mining including automated classification and indexing, summarization, segmentation, and information retrieval, as, for example, discussed in proceedings of the TREC Legal Track, and notably in the context of electronic discovery; other applied research topics, particularly concerning e-rulemaking, online dispute resolution, negotiation systems, digital rights management, electronic commerce and contracts, and evidence; and the use of XML, ontologies, and the development of the Semantic Web respecting legal information.

By cooperating with law librarians, legal informaticists for their part would gain access to expert users of legal information systems, quality input respecting the contexts of legal information use (ranging from the information lifecycle to the information behavior of lawyers), and ideas for further research.

Here are some specific suggestions respecting how law librarians could make meaningful contributions to legal informatics research.   First, law librarians could continue to perform legal information behavior research, building on the important recent activity in this area. Second, law librarians who are developing innovative legal information systems could present papers on those systems at legal informatics conferences and write articles about those systems for legal informatics journals.

Third, as expert users of legal information systems and close observers of lawyers, judges, law students, and lay users of legal information, law librarians could generate legal informatics research questions based on their experience and observations. For example, law librarians could recommend research on such little-studied but important legal information systems as conflict of interest control systems and bankruptcy claims agents’ Websites, or on the application of information science and computer science concepts to legal information systems errors, such as those arising from faulty legal drafting practices and overly complex statutory and regulatory schemes.

Fourth, law librarians could provide legal informaticists with expert practitioner and policy perspectives on issues that law librarians have prioritized as a profession, such as authentication, digital preservation, metadata content and management, and user interface design.   Fifth, law librarians could furnish legal informatics researchers with input respecting system capabilities from the vantage of an “expert user,” as Dr. Stephann Makri recently did by including law librarians in his study of lawyers’ information behavior.

Sixth, law librarians engaged in developing innovative digital legal information systems could partner with legal informaticists to study those systems. Seventh, law librarians who are also lawyers could contribute their knowledge of substantive and procedural law to legal informatics research projects, particularly where not all of the legal informaticists involved have legal training.

Finally, law librarians could draw on their in-depth knowledge of legal information systems and users to partner with legal informaticists on the design of research studies.   In particular, those law librarians with training in social science research methods could encourage legal informaticists to employ those methods in their studies of legal information systems, which might benefit from increased use of multiple methodologies.

Handshake

III. Bright Prospects

Greater cooperation between legal informaticists and law librarians would benefit both communities.  The Legal Information Systems site will be developed with an eye toward demonstrating and fostering that cooperation.

Robert Richards  edits Legal Information Systems & Legal Informatics Resources and its accompanying blog , the Legal Informatics Blog, and  Twitter feed.

VoxPopuLII is edited by Judith Pratt.

Bookmark and Share

Search Result Lists are Dead To Me

Most legal publishers, both free and fee, are primarily concerned with content. Regardless of whether they are academic or corporate entities providing electronic access to monographs, the free providers of the world giving primary source access, Westlaw or Lexis (hereinafter Wexis) providing access to both primary and secondary sources, or any other legal information deliverer, content has ruled the day. The focus has remained on the information in the database. The content. The words themselves.

If trends remain stable, primary source content, at least among politically stable jurisdictions, will be a given. Everyone will have equal access to the laws, regulations, and court decisions of their country online. In the U.S., new free open source access points are emerging every day. Here, the public currently has their choice of LII, Justia, Public Library of Law, AltLaw, FindLaw, PreCYdent, and most recently, OpenJurist, to discover the law. And hopefully, that content will be official and authentic.

The issue then refocuses to secondary sources and user interfaces. These will be where the battle lines will be drawn among legal publishers. Both assist in making meaning out of primary sources, though in fundamentally different ways. Secondary sources explain, analyze, and provide commentary on the law. They can be highly academic and esoteric, or provide nuts and bolts instructions and guidance. They also include finding aids to primary sources, like annotations to statutes, indexes, headnotes, citator services, and the like. While access to government-produced primary sources is a right, access to secondary sources is not, although for lay persons and lawyers alike, primary sources alone are typically insufficient to fully understand the law. I leave the not insignificant issue secondary sources for another day, and focus here on content access and the user interface.

“The eye is the best way for the brain to understand the world around us.”

— Quote reported identically by multiple users on Twitter from a recent talk by Dr. Ben Shneiderman at the #nycupa.

Despite the advances made in adding legal online content, equal attention has not been given to how users may optimally access that content to fulfill their information-seeking needs. We continue to use the same basic Boolean search parameters that we have used for nearly fifty years. We continue to presume that sorting through search result lists is the best way to find information. We continue to presume that research is simply a matter of finding the right keywords and plugging them into a search box. We presume wrong. Even though keyword searching is beloved by many because it provides the illusion of working, it consistently fails.

There is, in fact, another method of finding information that is inherently contextual, and that educates the user contemporaneously with the discovery process. This method is called browsing. Wexis, through their user interfaces, encourage searching over browsing because they are profit centers whose essential product is the search. It is commonly assumed that their product is the database, i.e. the content, because they negotiate access to specific databases with their customers.   And while some databases are worth more than others, they charge by the number of searches, not by the number of documents retrieved, not by the amount of content extracted. (This describes the transactional costs, which are probably most frequently employed. Of course, the per search charge varies by database. Users may alternately choose to be charged by time instead. )

Therefore, their profits are maximized by creating a search product that is not too good and not too bad. They are, in fact, rewarded for their search mediocrity. If it is too good, users will find what they need too quickly, decreasing the number of searches and amount of time spent researching, and profits will decline. If it is too bad, users will get frustrated, complain, and, perhaps eventually, try a different vendor. Though with our current two-party system, there is little real choice for legal professionals who have sophisticated legal research needs not satisfied by the open access options available. (And then there is the distasteful possibility that law firms themselves want to keep legal research costs inflated to serve as their own profit centers.)

As such, Wexis will not be optimally motivated to improve their user interfaces and enhance the information-seeking process to increase efficiency for their customers. This leaves the door wide open for others in the online legal information ecology to innovate and force needed change, create a better product themselves, and apply pressure on the Ferraris and Lamborghinis of the legal world to do the same.

“A picture is worth a thousand words. An interface is worth a thousand pictures.”

— Quote reported identically by multiple users on Twitter from a recent talk by Dr. Ben Shneiderman at the #nycupa.

The time is ripe to create a new information discovery paradigm for legal materials based on semantics. Outside the legal world, advances are being made in more contextual information discovery platforms. Instead of a user issuing keywords and a computer server spitting back results, adjusting input via trial and error ad infinitum, graphic interfaces allow the user to comprehend and adjust their conception and results visually with related parameters. These interfaces encourage an environment where research is more like a conversation between the researcher and the data, rather than dueling monologues.

Lee Rainie, Director of the Pew Internet & American Life Project, recently discussed the emerging information literacies relevant to the evolving online ecology. These literacies should inform how search engines adapt themselves to human needs. Their application in the legal world is a natural fit. Four literacies most applicable to legal research include:

Graphic Literacy. People think visually and process data better with visual representations of information. Translation: make database interfaces and search results graphic.

Navigation Literacy. People have to maneuver online information in a disorganized nonlinear text screen. This creates comprehension and memory problems. We want our lawyers and legal researchers to have good comprehension and memory when serving clients.

Skepticism Literacy. Normally referring to basic critical thinking skills, this should apply to critically assessing user interfaces, particularly in a profit-seeking environment like Wexis where the interface can affect how and what you search, as well as your wallet.

Context Literacy. People need to see connections both between and within information in a hyperlinked environment. Simply providing hyperlinks is good, but graphically visualizing the connections is better.

Some subscription databases and internet search features serve these literacies well. Many of these are in early stages and not necessarily fit for legal research, but can give an idea of possibilities. I’ll discuss a few, and consider how these might apply in the legal context.

wonderwheelGoogle has recently re-released their wonder wheel which helps users figure out what they are looking for. This is a frequent stumbling block for novices to legal research, and even for seasoned attorneys faced with a new subject. The researcher simply doesn’t know enough to know what exactly to look for. A tool like this helps the researcher find terms and concepts that they might not have otherwise considered (of course, secondary sources are excellent for this as well). Pictured here, the small faded hub at the bottom was for my original search of “legal research.” I then clicked on the “legal research methodology” spoke which expanded above the first wheel with different spokes and further ideas.

A common problem with keyword searching is finding the right words in the correct combination that exemplify a concept and are not over or under inclusive. Wexis offers thesauri which can be helpful, though they require actual searching to test. Some free sites, like PreCYdent, have this feature as well. They work to greater and lesser degrees. A recent search for “Title VII sex retaliation” resulted in a suggestion to also search for Title III, which is clearly not my intended subject. And while helpful, thesauri and other word and concept suggestors are still tied to the search paradigm which we want to move away from.

FactivaFactiva is a subscription database provider supplying news and business information. It provides a graphical “discovery pane” with “intelligent indexing” that clusters results by subjects related to search terms. This allows the user to select the most relevant results to their purpose. It also features word clouds (not pictured here) with text size indicating prominence of these terms in search results. Date graphs indicate when search results were published, so the user can visually assess when a topic is most frequently covered in the news.

Subject-based indexing is an excellent contextual tool to guide the user to relevant content without searching. Legal context literacy is supported by indexes to subject-based compilations, such as statutes and regulations. It’s great to have the full text of statutes available for free online, but some kind of subject-based entry port to that collection is needed to render it maximally useful. For databases like these, given the non-natural language used by legislators and lobbyists alike in constructing laws, keyword searching is frequently an inefficient and frustrating discovery method. Currently, Westlaw is the only legal information provider that provides online subject indexing to state and federal codes (though they like to hide that fact in their interface because their product is the search, not the content).

weighting wordsWeighting words, graphically represented by the size of the term, is another method users can employ to improve their results with keyword searching. Factiva uses weighted word clouds to indicate the frequency of terms in search results. SearchCloud allows users to manually weight search terms to indicate their importance within the search and adjusts results accordingly. For example, a researcher may need to find documents with five different words in them, but three are essential in symbolizing the idea sought, and the other two are needed, but not as important. As pictured here, I searched for copyright legal research guides, giving most importance to the words copyright and guide, and less to the words legal and research to ensure that I retrieved guides on copyright and not just any list of research guides that might mention copyright, and that it was in fact a legal research guide and not some other document that just mentions the word guide. Results were significantly more relevant here than the same un-weighted search on Google.

Weighted words can easily be employed in legal research. For example, with case law search results and citator reports, instead of a list of cases and other documents arranged either by date, jurisdiction, or algorithmic relevancy, citator information can be graphically indicated. Cases that are cited the most would appear near the top of the list in the largest fonts. Cases cited the least would appear in a smaller font at the bottom of the list. It adds immediate meaning-making visual cues to an otherwise non-contextual list, letting the researcher know at a glance which are the most important cases.

It would be a boon to researchers if the connection between results was made apparent graphically. KartooKartOO attempts this with their search engine which links various web pages in results with associated terms and similar pages. Mousing over links allows the user a preliminary peek at the search result to further determine its relevancy. The benefits to lawyers for this type of graphic display of search results for cases could be enormous. To be able to tell at a glance how a body of law is interconnected would give immeasurable context and meaning to what would otherwise be a simple list, each result visually disconnected from the other.

Some type of contextual map like the wonder wheel or a concept chart like KartOO, potentially combined with weighted words, could be employed that would illustrate the interconnectedness between all the cites to the case at issue, or to search results of cases. The biggest, most precedential, most frequently inter-cited cases would live near the center of the web with large hubs, less important cases would live at the peripheries. Most cases are never cited and are jurisprudentially less significant. This should be made clear through visual cues. Westlaw just launched something similar for patents.

These are just a few examples, based on developing technology, of how the legal search paradigm might develop. The beauty of our legal corpus is its fundamental interconnectedness. The web of cites within and between documents gives semantic developers a preconstructed map of relevancy and importance so that they need only create a way to symbolize that pattern graphically.

“Semantics rule, keywords drool.”

– Quote at twitter.com/scagliarini. See also http://www.expertsystem.net/blog/?p=68.

The future of legal information discovery interfaces combines searching and browsing, text and context, graphics and metadata. Because content without meaning thwarts understanding. Laws without context do not serve democracy. We need “interactive discovery.” Which is why search result lists are dead to me.

Julie JonesJulie Jones, formerly a librarian at the Cornell Law School,  is the “rising” Associate Director for Library Services at the University of Connecticut Law School, beginning later this month. She received her J.D. from Northwestern University School of Law, M.L.I.S. from Dominican University, and B.A. from U.C. Santa Barbara.

VoxPopuLII is edited by Judith Pratt

.

Bookmark and Share

Authentication of Digital Repositories

When I first started off in the field of Internet publishing of legal materials, I briefly considered the topic of authenticity, and its importance to the end user. My conclusion back then rested on the simple consideration that since I was a librarian and was acting under the aegis of a university, I had no problem. In fact, my initial vision of the future of legal publishing was that as academic libraries strove to gain ownership and control over the content they needed in electronic form, they would naturally begin to harvest and preserve electronic documents. Authentication would be a natural product of everyone having a copy from a trusted source, because the only axe we have to grind is serving accurate information. Ubiquitous ownership from trustworthy sources.

Of course, I was new to the profession then, and very idealistic. I grossly underestimated the degree to which universities, and the librarians that serve them, would be willing to place their futures in the hands of a very small number of commercial vendors (See e.g. www.westlaw.com, www.lexis.com), who keep a very tight grip on their information, gradually reducing the librarian to the role of local administrator of other people’s information.

So much for us librarians. Even without us, however, end users still need trustworthy information. We are confronted with a new set of choices. On the one hand, there is expensive access to the collection of a large vendor, who has earned the trust of their users through both long tradition and by their sheer power over the market. One the other, there are court and government-based sources, which generally make efforts to avoid holding themselves out as either official or as reliably authenticated sources, and a number of smaller enterprises, offering lower cost or free access to materials that they harvest, link to, or generate from print themselves.

For the large publishers, the issue of authentication is not a serious issue. Their well earned reputations for reliability are not seriously questioned by the market that buys their product. And, by all accounts, their editorial staffs ensure that this continues.

So, what about everyone else? In the instance of publishing court decisions, for example, Justia.com, Cornell’s LII, etc, collect their documents from the same “unofficial” court sources as the large publishers, but the perceived trustworthiness is not necessarily the same with some user communities. This is understandable, and, to a great extent, can only be addressed through the passing of time. The law is a conservative community when it comes to its information.

Along with this, I think it also important to realize that this lack of trust has another, deeper component to it. I see signs of it when librarians and others insist on the importance of “official” and “authentic” information, while at the same time putting complete and unquestioned trust in the unofficial and unauthenticated offerings of traditional publishers.

Of course, a great deal of this has to do with the already-mentioned reputations of these publishers. But I think there is also a sense in which there has been a comfort in the role of publishers-as-gatekeepers that makes it easy to rely on their offerings, and which is missing from information that comes freely on the Internet.

In the context of scholarly journals, this has been discussed explicitly. In that case, the role of gatekeeper is easily defined in terms of the editorial boards that vet all submissions. In the case of things like court decisions, however, the role of the gatekeeper is not so clear, but the desire to have one seems to remain. In my experience, this has resulted in discussions about the possibility of errors and purposeful alterations in free Internet law sources that often seem odd and strangely overblown. They seem that way to us publishers, that is. (See, e.g. Here and Here for examples of the American Association of Law Libraries positions on authentication of digital resources.)

So, for me, the crux of any debate about authentication comes down to this disconnect between the perceptions and needs of the professional and librarian communities, and what most Internet law publishers do to produce accurate information for the public.

As I said earlier, time will play a role in this. The truly reliable will prove themselves to be such, and will survive. The extent to which the Cornell LII is already considered an authoritative source for the U.S. Supreme Court is good evidence of this. At the same time, there is much to be gained from taking some fairly easy and reasonable measures to demonstrate the accuracy of the documents being made available.

The utility of this goes beyond just building trust. The kind of information generated in authenticating a document is also important in the context of creating and maintaining durable electronic archives. As such, we should all look to implement these practices.

The first element of an authentication system is both obvious and easy to overlook: disclosure. An explanation of how a publisher obtains the material on offer, and how that material is handled, should be documented and available to prospective users. For the public, this explanation needs to be on a page on the website. It’s a perfect candidate for inclusion on an FAQ page. (Yes, even if no one has asked. I mean really, how many people really count questions received before creating their FAQ’s?) For the archive, it is essential that this information also be embedded in the document metadata. A simple Dublin Core source tag is a start, but something along the lines of the TEI and tags are essential here (See http://www.tei-c.org/release/doc/tei-p5-doc/html/HD.html) .

An explanation of the source of a document will show the user a chain of custody leading back to an original source. The next step is to do something to make that chain of custody verifiable.

At this point,things can either stay reasonable, or can spin off toward some expensive extremes, so let’s be clear about the ground rules. We are concerned with public-domain documents that are not going to be sold (so no money transfer is involved), and where no confidential information is being passed. For these reasons, site encryption and SSL certificates are overkill. We are not concerned with the transmission of the documents, only their preparation and maintenance. The need is for document-level verification. For that, the easy and clear answer is in a digital signature.

At the GPO, the PDF version of new legislative documents are being verified with a digital signature provided by GeoTrust CA and handled by Adobe, Inc. (See here for an example.) These are wonderful, and provide a high level of reliability. For the initial provider, they make a certain amount of sense. However, I question the need for an outside provider to certify the authenticity of a document that is being provided directly from GPO. Note what the certification really amounts to: an MD5 hash that has been verified and “certified” by a private company (GeoTrust). It’s nice because anyone can click on the GPO logo and see the certificate contents. The certificate itself, however, doesn’t do anything more than that. The important thing is the MD5 sum upon which the certificate relies.

In addition, the certificate is invalid as soon as any alterations whatsoever are made to the document. Again, the makes some sense, but does not address the need and utility of adding value to the original document. Added value includes format conversion to HTML, XML or other useful formats, insertion of hypertext links, addition of significant metadata, etc.

The answer to this problem is to retain the MD5 hash, while dropping the certificate. The retained MD5 hash can still be used to demonstrate a chain of custody. For example, here at Rutgers-Camden, we collect N.J. Appeals decisions provided to us by the courts. As they are downloaded from the court’s server in MS Word format, we have started generating an MD5 hash of this original file. The document is converted to HTML with embedded metadata and hypertext links, but the MD5 hash of the original is included in the metadata. It can be compared to the original Word document on the court’s server to verify that what we got was exactly what they provided.

The next step is to produce an additional MD5 hash of the HTML file that we generated from the original. Of course, this can’t be embedded in the file, but it is retained in the database that has a copy of all the document metadata, and can be queried anytime needed. That, combined with an explanation of the revisions performed on the document, completes the chain of custody. As embedded in our documents, the revision notes are put in as an HTML’ized variation on the TEI revision description, and look like this:

<META NAME=”revisionDate” CONTENT=”Wed May 6 17:05:56 EDT 2009″>
<META NAME=”revisionDesc” CONTENT=”Downloaded from NJAOC as MS Word
document, converted to HTML with wvHtml. Citations converted to
hypertext links.”>
<META NAME=”orig_md5” CONTENT=”8cc57f9e06513fdd14534a2d54a91737”>

Another possible method for doing this sort of thing would be the strategy suggested by Thomas Bruce and the Cornell LII. Instead of generating an original and subsequent MD5 hash, one might generate a single digital signature of the document’s text stream, stripped of all formatting, tagging, and graphics. The result should be an MD5 hash that would be the same for both the original document, and the processed document, no matter what the subsequent format alterations or other legitimate value-added tagging that were done.

The attraction of a single digital signature that would identify any accurate copy of a document is obvious, and may ultimately be the way to proceed. In order for it to work, however, things like minor inconsistencies in the treatment of “insignificant” white space (See. e.g. http://www.oracle.com/technology/pub/articles/wang-whitespace.html for an explanation), and the treatment of other odd things (like macro generated dates, etc. in MS Word), would have to be carefully accounted for and consistently treated.

Finally, I don’t think any discussion of authenticity and reliability of legal information on the Internet should leave out a point I hinted to at the beginning of this piece. In the long run, information does not, and will not survive without widespread distribution. In this time of cheap disk space and fast Internet connections, we have the unprecedented opportunity to preserve information better than ever before, through widespread distribution. Shared and mirrored repositories among numbers of educational and other institutions would be a force for enormous good. Imagine an institution recovering from the catastrophic loss of their collections by merely re-downloading from any of hundreds of sister institutions. Such a thing is possible now.

In such an environment, with many sources and repositories easily accessible, all of which were in the business only of maintaining accurate information, reliable data would tend to become the market norm. You simply could not maintain a credible repository that contained errors, either intentional or accidental, in a world where there are many accurate repositories openly available.

Widespread distribution, along with things like the above suggestions, are the keys to a reliable and durable information infrastructure. Each of us who would publish and maintain a digital repository needs to take steps to insure that their information is verifiably authentic. And then, we need to hope that sooner or later, we will be joined by others.

There. I am still a naive optimist.


Footnote: An MD5 (Message Digest 5) hash is a128-bit cryptographic hash function. It is officially defined by the IETF as RFC 1321 It is widely used to check the integrity of files, and (along with SHA1 hashes) is often used as the basis for creating digital signatures. (ex. Take a look at the digital signatures that underlie the GPO’s certificate in the example above. It relies on an MD5 hash and an SHA1 hash.) Practically, they are this: a complex calculation run against the content of a computer file, designed to generate a unique 32-bit hexidecimal string, which would look something like this: 8cc57f9e06513fdd14534a2d54a91737. Changing the file name, the creation date, etc. will not alter this signature. However, any alteration to the actual contents of the file will cause the result of the calculation to change. Add a space, get an entirely different result. Although they are not impossible to duplicate (See ,e.g. “>Wikipedia’s article on MD5 hashes), it does require significant effort. So, while the digital signature on the electronic transmission of a multi-million dollar contract would require greater security, they are very suitable for authentication in the context of digital libraries. One of the main reasons MD5 hashes are widely used is that they are fast and fairly easy to generate (standard Unix and Linux distributions have MD5 generator programs installed by default).

John Joergensen is a reference librarian at Rutgers-Camden. Mr. Joergensen received B.A. and M.A. degrees from Fordham University, J.D. from Temple University, and M.S. (LIS)from Drexel University. Mr. Joergensen is publisher of the New Jersey Courtweb Project, publishing the decisions of the N.J. state appellate courts, Tax court, Administrative law decisions, U.S. District Court for the District of New Jersey decisions, and the N.J. Supreme Court’s Ethics opinions on the Internet.VoxPopuLII is edited by Judith Pratt.

Bookmark and Share

Next Page »




Bad Behavior has blocked 285 access attempts in the last 7 days.

FireStats icon Powered by FireStats