skip navigation



Authentication of Digital Repositories

When I first started off in the field of Internet publishing of legal materials, I briefly considered the topic of authenticity, and its importance to the end user. My conclusion back then rested on the simple consideration that since I was a librarian and was acting under the aegis of a university, I had no problem. In fact, my initial vision of the future of legal publishing was that as academic libraries strove to gain ownership and control over the content they needed in electronic form, they would naturally begin to harvest and preserve electronic documents. Authentication would be a natural product of everyone having a copy from a trusted source, because the only axe we have to grind is serving accurate information. Ubiquitous ownership from trustworthy sources.

Of course, I was new to the profession then, and very idealistic. I grossly underestimated the degree to which universities, and the librarians that serve them, would be willing to place their futures in the hands of a very small number of commercial vendors (See e.g. www.westlaw.com, www.lexis.com), who keep a very tight grip on their information, gradually reducing the librarian to the role of local administrator of other people’s information.

So much for us librarians. Even without us, however, end users still need trustworthy information. We are confronted with a new set of choices. On the one hand, there is expensive access to the collection of a large vendor, who has earned the trust of their users through both long tradition and by their sheer power over the market. One the other, there are court and government-based sources, which generally make efforts to avoid holding themselves out as either official or as reliably authenticated sources, and a number of smaller enterprises, offering lower cost or free access to materials that they harvest, link to, or generate from print themselves.

For the large publishers, the issue of authentication is not a serious issue. Their well earned reputations for reliability are not seriously questioned by the market that buys their product. And, by all accounts, their editorial staffs ensure that this continues.

So, what about everyone else? In the instance of publishing court decisions, for example, Justia.com, Cornell’s LII, etc, collect their documents from the same “unofficial” court sources as the large publishers, but the perceived trustworthiness is not necessarily the same with some user communities. This is understandable, and, to a great extent, can only be addressed through the passing of time. The law is a conservative community when it comes to its information.

Along with this, I think it also important to realize that this lack of trust has another, deeper component to it. I see signs of it when librarians and others insist on the importance of “official” and “authentic” information, while at the same time putting complete and unquestioned trust in the unofficial and unauthenticated offerings of traditional publishers.

Of course, a great deal of this has to do with the already-mentioned reputations of these publishers. But I think there is also a sense in which there has been a comfort in the role of publishers-as-gatekeepers that makes it easy to rely on their offerings, and which is missing from information that comes freely on the Internet.

In the context of scholarly journals, this has been discussed explicitly. In that case, the role of gatekeeper is easily defined in terms of the editorial boards that vet all submissions. In the case of things like court decisions, however, the role of the gatekeeper is not so clear, but the desire to have one seems to remain. In my experience, this has resulted in discussions about the possibility of errors and purposeful alterations in free Internet law sources that often seem odd and strangely overblown. They seem that way to us publishers, that is. (See, e.g. Here and Here for examples of the American Association of Law Libraries positions on authentication of digital resources.)

So, for me, the crux of any debate about authentication comes down to this disconnect between the perceptions and needs of the professional and librarian communities, and what most Internet law publishers do to produce accurate information for the public.

As I said earlier, time will play a role in this. The truly reliable will prove themselves to be such, and will survive. The extent to which the Cornell LII is already considered an authoritative source for the U.S. Supreme Court is good evidence of this. At the same time, there is much to be gained from taking some fairly easy and reasonable measures to demonstrate the accuracy of the documents being made available.

The utility of this goes beyond just building trust. The kind of information generated in authenticating a document is also important in the context of creating and maintaining durable electronic archives. As such, we should all look to implement these practices.

The first element of an authentication system is both obvious and easy to overlook: disclosure. An explanation of how a publisher obtains the material on offer, and how that material is handled, should be documented and available to prospective users. For the public, this explanation needs to be on a page on the website. It’s a perfect candidate for inclusion on an FAQ page. (Yes, even if no one has asked. I mean really, how many people really count questions received before creating their FAQ’s?) For the archive, it is essential that this information also be embedded in the document metadata. A simple Dublin Core source tag is a start, but something along the lines of the TEI and tags are essential here (See http://www.tei-c.org/release/doc/tei-p5-doc/html/HD.html) .

An explanation of the source of a document will show the user a chain of custody leading back to an original source. The next step is to do something to make that chain of custody verifiable.

At this point,things can either stay reasonable, or can spin off toward some expensive extremes, so let’s be clear about the ground rules. We are concerned with public-domain documents that are not going to be sold (so no money transfer is involved), and where no confidential information is being passed. For these reasons, site encryption and SSL certificates are overkill. We are not concerned with the transmission of the documents, only their preparation and maintenance. The need is for document-level verification. For that, the easy and clear answer is in a digital signature.

At the GPO, the PDF version of new legislative documents are being verified with a digital signature provided by GeoTrust CA and handled by Adobe, Inc. (See here for an example.) These are wonderful, and provide a high level of reliability. For the initial provider, they make a certain amount of sense. However, I question the need for an outside provider to certify the authenticity of a document that is being provided directly from GPO. Note what the certification really amounts to: an MD5 hash that has been verified and “certified” by a private company (GeoTrust). It’s nice because anyone can click on the GPO logo and see the certificate contents. The certificate itself, however, doesn’t do anything more than that. The important thing is the MD5 sum upon which the certificate relies.

In addition, the certificate is invalid as soon as any alterations whatsoever are made to the document. Again, the makes some sense, but does not address the need and utility of adding value to the original document. Added value includes format conversion to HTML, XML or other useful formats, insertion of hypertext links, addition of significant metadata, etc.

The answer to this problem is to retain the MD5 hash, while dropping the certificate. The retained MD5 hash can still be used to demonstrate a chain of custody. For example, here at Rutgers-Camden, we collect N.J. Appeals decisions provided to us by the courts. As they are downloaded from the court’s server in MS Word format, we have started generating an MD5 hash of this original file. The document is converted to HTML with embedded metadata and hypertext links, but the MD5 hash of the original is included in the metadata. It can be compared to the original Word document on the court’s server to verify that what we got was exactly what they provided.

The next step is to produce an additional MD5 hash of the HTML file that we generated from the original. Of course, this can’t be embedded in the file, but it is retained in the database that has a copy of all the document metadata, and can be queried anytime needed. That, combined with an explanation of the revisions performed on the document, completes the chain of custody. As embedded in our documents, the revision notes are put in as an HTML’ized variation on the TEI revision description, and look like this:

<META NAME=”revisionDate” CONTENT=”Wed May 6 17:05:56 EDT 2009″>
<META NAME=”revisionDesc” CONTENT=”Downloaded from NJAOC as MS Word
document, converted to HTML with wvHtml. Citations converted to
hypertext links.”>
<META NAME=”orig_md5” CONTENT=”8cc57f9e06513fdd14534a2d54a91737”>

Another possible method for doing this sort of thing would be the strategy suggested by Thomas Bruce and the Cornell LII. Instead of generating an original and subsequent MD5 hash, one might generate a single digital signature of the document’s text stream, stripped of all formatting, tagging, and graphics. The result should be an MD5 hash that would be the same for both the original document, and the processed document, no matter what the subsequent format alterations or other legitimate value-added tagging that were done.

The attraction of a single digital signature that would identify any accurate copy of a document is obvious, and may ultimately be the way to proceed. In order for it to work, however, things like minor inconsistencies in the treatment of “insignificant” white space (See. e.g. http://www.oracle.com/technology/pub/articles/wang-whitespace.html for an explanation), and the treatment of other odd things (like macro generated dates, etc. in MS Word), would have to be carefully accounted for and consistently treated.

Finally, I don’t think any discussion of authenticity and reliability of legal information on the Internet should leave out a point I hinted to at the beginning of this piece. In the long run, information does not, and will not survive without widespread distribution. In this time of cheap disk space and fast Internet connections, we have the unprecedented opportunity to preserve information better than ever before, through widespread distribution. Shared and mirrored repositories among numbers of educational and other institutions would be a force for enormous good. Imagine an institution recovering from the catastrophic loss of their collections by merely re-downloading from any of hundreds of sister institutions. Such a thing is possible now.

In such an environment, with many sources and repositories easily accessible, all of which were in the business only of maintaining accurate information, reliable data would tend to become the market norm. You simply could not maintain a credible repository that contained errors, either intentional or accidental, in a world where there are many accurate repositories openly available.

Widespread distribution, along with things like the above suggestions, are the keys to a reliable and durable information infrastructure. Each of us who would publish and maintain a digital repository needs to take steps to insure that their information is verifiably authentic. And then, we need to hope that sooner or later, we will be joined by others.

There. I am still a naive optimist.


Footnote: An MD5 (Message Digest 5) hash is a128-bit cryptographic hash function. It is officially defined by the IETF as RFC 1321 It is widely used to check the integrity of files, and (along with SHA1 hashes) is often used as the basis for creating digital signatures. (ex. Take a look at the digital signatures that underlie the GPO’s certificate in the example above. It relies on an MD5 hash and an SHA1 hash.) Practically, they are this: a complex calculation run against the content of a computer file, designed to generate a unique 32-bit hexidecimal string, which would look something like this: 8cc57f9e06513fdd14534a2d54a91737. Changing the file name, the creation date, etc. will not alter this signature. However, any alteration to the actual contents of the file will cause the result of the calculation to change. Add a space, get an entirely different result. Although they are not impossible to duplicate (See ,e.g. “>Wikipedia’s article on MD5 hashes), it does require significant effort. So, while the digital signature on the electronic transmission of a multi-million dollar contract would require greater security, they are very suitable for authentication in the context of digital libraries. One of the main reasons MD5 hashes are widely used is that they are fast and fairly easy to generate (standard Unix and Linux distributions have MD5 generator programs installed by default).

John Joergensen is a reference librarian at Rutgers-Camden. Mr. Joergensen received B.A. and M.A. degrees from Fordham University, J.D. from Temple University, and M.S. (LIS)from Drexel University. Mr. Joergensen is publisher of the New Jersey Courtweb Project, publishing the decisions of the N.J. state appellate courts, Tax court, Administrative law decisions, U.S. District Court for the District of New Jersey decisions, and the N.J. Supreme Court’s Ethics opinions on the Internet.VoxPopuLII is edited by Judith Pratt.

Bookmark and Share

9 Responses to “Authentication of Digital Repositories”


  1. 1 frank_bennett May 15, 2009 at 5:01 am

    Ruminating on this issue of authentication together with Tom Bruce’s recent “Two really simple ideas” post, and turning the latter on its head a bit, I wonder whether no electronic copy of primary legal text should be considered authentic unless (a) it can be walked back to the original (checksum accurate) version using a publicly accessible tool, and (b) that original source text is freely and publicly available.

  1. 1 Joergensen on Authentication in Digital Legal Repositories « Legal Informatics Blog Trackback on May 14, 2009 at 10:29 am
  2. 2 Where Have All Our Cowboys Gone? Authentication of Online Legal Resources « Legal Research Plus Trackback on May 14, 2009 at 11:53 am
  3. 3 » Search Result Lists are Dead To Me VoxPopuLII Trackback on June 1, 2009 at 6:26 am
  4. 4 » A Law Librarian Looks at Legal Informatics Scholarship VoxPopuLII Trackback on June 15, 2009 at 7:04 am
  5. 5 Ideas for Now: Authentication for legal blogs | Jason Wilson | Ideas for Now Trackback on June 17, 2009 at 10:00 pm
  6. 6 On the Durham Statement « Ipso Factor – Musings and more Trackback on July 14, 2009 at 3:26 pm
  7. 7 » The Recipe for Better Legal Information Services VoxPopuLII Trackback on August 18, 2009 at 2:30 pm
  8. 8 » Hey, Joe, whaddaya know? b-screeds Trackback on October 20, 2009 at 6:52 am

Leave a Reply






Bad Behavior has blocked 288 access attempts in the last 7 days.

FireStats icon Powered by FireStats