When I first started off in the field of Internet publishing of legal materials, I did, briefly consider the topic of authenticity, and its importance to the end user. My conclusion back then rested on the simple consideration that since I was a librarian and was acting under the aegis of a university, I had no problem. In fact, my initial vision of the future of legal publishing was that as academic libraries strove to gain ownership and control over the content they needed in electronic form, they would naturally begin to harvest and preserve electronic documents. Authentication would be a natural product of everyone having a copy from a trusted source, because the only ax we have to grind is serving accurate information. Ubiquitous ownership from trustworthy sources.
Of course, I was new to the profession then, and very idealistic. I grossly underestimated the degree to which universities, and the librarians that serve them, would be willing to place their futures in the hands of a very small number of commercial vendors, who keep a very tight grip on their information. This gradually reduces the librarian to the role of local administrator of other people’s information.
So much for us librarians. Even without us, however, end users still need trustworthy information. We are confronted with a new set of choices. On the one hand, there is expensive access to the collection of a large vendor, who has earned the trust of their users through both long tradition and by their sheer power over the market. On the other, there are court and government-based sources, which generally make efforts to avoid holding themselves out as either official or as reliably authenticated sources, and a number of smaller enterprises, which offer lower cost or free access to materials that they harvest, link to, or generate from print themselves.
For the large publishers, the issue of authentication is not a serious issue. Their well earned reputations for reliability are not seriously questioned by the market that buys their product. And, by all accounts, their editorial staffs ensure this continues.
So, what about everyone else? In the instance of publishing court decisions, for example, Justia, Cornell’s LII, etc., collect their documents from the same “unofficial” court sources as the large publishers, but the perceived trustworthiness is not necessarily the same with some user communities. This is understandable, and, to a great extent, can only be addressed through the passing of time. The law is a conservative community when it comes to its information.
Along with this, I think it also important to realize that this lack of trust has another, deeper component to it. I see signs of it when librarians and others insist on the importance of “official” and “authentic” information, while at the same time putting complete and unquestioned trust in the unofficial and unauthenticated offerings of traditional publishers.
Of course, a great deal of this has to do with the already mentioned reputations of these publishers. But I think there is also a sense in which there has been a comfort in the role of publishers as gatekeepers that makes it easy to rely on their offerings, and which is missing from information that comes freely on the Internet.
In the context of scholarly journals, this has been discussed explicitly. In that case, the role of gatekeeper is easily defined in terms of the editorial boards that vet all submissions. In the case of things like court decisions, however, the role of the gatekeeper is not so clear, but the desire to have one seems to remain. The result is discussions about the possibility of errors and purposeful alterations in free Internet law sources that often seem odd and strangely overblown. They seem that way to us publishers, that is.
So, for me, the crux of any debate about authentication comes down to this disconnect between the perceptions and needs of the professional and librarian communities, and what most Internet law publishers do to produce accurate information for the public.
As I said earlier, time will play a role in this. The truly reliable will prove themselves to be such, and will survive. The extent to which the Cornell LII is already considered an authoritative source for the U.S. Supreme Court is good evidence of this. At the same time, there is much to be gained from taking some fairly easy and reasonable measures to demonstrate the accuracy of the documents being made available.
The utility of this goes beyond just building trust. The kind of information generated in authenticating a document is also important in the context of creating and maintaining durable electronic archives. As such, we should all look to implement these practices.
The first element of an authentication system is both obvious and easy to overlook: disclosure. An explanation of how a publisher obtains the material on offer, and how that material is handled should be documented and available to prospective users. For the public, this explanation needs to be on a page on the website. It’s a perfect candidate for inclusion on a FAQ page. (Yes, even if no one has asked. I mean really, how many people really count questions received before creating their FAQs?). For the archive, it is essential that this information also be embedded in the document metadata. A simple Dublin Core source tag is a start, but something along the lines of the TEI <sourceDesc> and <revisionDesc> tags are essential here (See http://www.tei-c.org/release/doc/tei-p5-doc/html/HD.html) .
An explanation of the source of a document will show the user a chain of custody leading back to an original source. The next step is to do something to make that chain of custody verifiable.
It is at this point where things can either stay reasonable, or can spin off toward some expensive extremes, so let’s be clear about the ground rules. We concerned with public-domain documents, which are not going to be sold (so no money transfer is involved), and where no confidential information is being passed. For these reasons, site encryption and SSL certificates are overkill. We aren’t concerned with the <i>transmission</i> of the documents, only their preparation and maintenance. The need is for document-level verification. For that, the easy and clear answer is in a digital signature.
At the Government Printing Office (GPO), the PDF version of new legislative documents are being verified with a digital signature provided by GeoTrust CA and handled by Adobe, Inc. These are wonderful, and provide a high level of reliability. For the initial provider, they make a certain amount of sense.
However, I question the need for an outside provider to certify the authenticity of a document that is being provided directly from GPO. Note what the certification really amounts to: an MD5 hash that has been verified and “certified” by GeoTrust. It’s nice because anyone can click on the GPO logo and see the certificate contents. The certificate itself, however, doesn’t do anything more than that. The important thing is the MD5 sum upon which the certificate relies.
In addition, the certificate is invalid as soon as any alterations whatsoever are made to the document. Again, this makes some sense, but does not address the need and utility to add value to the original document, such as format conversion to HTML, XML or other useful formats, insertion of hypertext links, addition of significant metadata, etc.
The answer to this problem is to retain the MD5 sum, while dropping the certificate. The retained MD5 sum can still be used to demonstrate a chain of custody. For example, here at Rutgers–Camden, we collect New Jersey Appeals decisions provided to us by the courts. As they are downloaded from the court’s server in MS Word format, we have started generating an MD5 sum of this original file. The document is converted to HTML with embedded metadata and hypertext links, but the MD5 sum of the original is included in the metadata. It can be compared to the original Word document on the court’s server to verify that what we got was exactly what they provided.
The next step is to generate an additional MD5 sum of the HTML file that we generated from the original. Of course, this can’t be embedded in the file, but it is retained in the database that has a copy of all the document metadata, and can be queried anytime needed. That, combined with an explanation of the revisions performed on the document completes the chain of custody. As embedded in our documents, the revision notes are put in as an HTML-ized variation on the TEI revision description, and look like this:
<META NAME=”revisionDate” CONTENT=”Wed May 6 17:05:56 EDT 2009″>
<META NAME=”revisionDesc” CONTENT=”Downloaded from NJAOC as MS Word document, converted to HTML with wvHtml. Citations converted to hypertext links.”>
<META NAME=”orig_md5” CONTENT=”8cc57f9e06513fdd14534a2d54a91737”>
Another possible method for doing this sort of thing would be the strategy suggested by Thomas Bruce and the Cornell LII. Instead of generating an original and subsequent MD5 sum, one might generate a single digital signature of the document’s text stream, stripped of all formatting, tagging, and graphics. The result should be an MD5 sum that would be the same for both the original document, and the processed document, no matter what the subsequent format alterations or other legitimate value-added tagging that were done.
The attraction of a single digital signature that would identify any accurate copy of a document is obvious, and may ultimately be the way to proceed. In order for it to work, however, things like minor inconsistencies in the treatment of “insignificant” whitespace (See http://www.oracle.com/technology/pub/articles/wang-whitespace.html for an explanation), and the treatment of other odd things (like macro generated dates, etc. in MS Word), would have to be carefully accounted for and consistently treated.
Finally, I don’t think any discussion of authenticity and reliability of legal information on the Internet should leave out a point I hinted at earlier in this article. In the long run, information does not, and will not survive without widespread distribution. In this time of cheap disk space and fast Internet connections, we have the unprecedented opportunity to preserve information through widespread distribution. Shared and mirrored repositories among numbers of educational and other institutions would be a force for enormous good. Imagine an institution recovering from the catastrophic loss of their collections by merely re-downloading from any of hundreds of sister institutions. Such a thing is possible now.
In such an environment, with many sources and repositories easily accessible, all of which are in the business only of maintaining accurate information, reliable data would tend to become the market norm. You simply could not maintain a credible repository that contained errors, either intentional or accidental, in a world where there are many accurate repositories openly available.
Widespread distribution, along with things like the above suggestions, are the keys to a reliable and durable information infrastructure. Each of us who would publish and maintain a digital repository needs to take steps to insure that their information is verifiably authentic. And then, we need to hope that sooner or later, we will be joined by others.
There. I am still a naive optimist.
John Joergensen is the digital services librarian at Rutgers University
School of Law – Camden. He publishes a wide range of New Jersey primary
source materials, and is digitizing the libraries collection of
congressional documents. These are available at