{"id":119,"date":"2009-05-14T18:02:46","date_gmt":"2009-05-14T23:02:46","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/voxpop\/2009\/05\/14\/authentication-of-digital-repositories\/"},"modified":"2010-01-19T18:08:19","modified_gmt":"2010-01-19T23:08:19","slug":"authentication-of-digital-repositories","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/voxpop\/2009\/05\/14\/authentication-of-digital-repositories\/","title":{"rendered":"Authentication of Digital Repositories"},"content":{"rendered":"

<\/a><\/a><\/a>\"good-housekeeping-logonew.gif\"<\/a>When I first started off in the field of Internet publishing of legal materials, I did, briefly consider the topic of authenticity, and its importance to the end user.\u00a0 My conclusion back then rested on the simple consideration that since I was a librarian and was acting under the aegis of a university, I had no problem.\u00a0 In fact, my initial vision of the future of legal publishing was that as academic libraries strove to gain ownership and control over the content they needed in electronic form, they would naturally begin to harvest and preserve electronic documents.\u00a0 Authentication would be a natural product of everyone having a copy from a trusted source, because the only ax we have to grind is serving accurate information.\u00a0 Ubiquitous ownership from trustworthy sources.<\/p>\n

Of course, I was new to the profession then, and very idealistic.\u00a0 I grossly underestimated the degree to which universities, and the librarians that serve them, would be willing to place their futures in the hands of a very small number of commercial vendors, who keep a very tight grip on their information. This gradually reduces the librarian to the role of local administrator of other people’s information.<\/p>\n

So much for us librarians.\u00a0 Even without us, however, end users still need trustworthy information.\u00a0 We are confronted with a new set of choices.\u00a0 On the one hand, there is expensive access to the collection of a large vendor, who has earned the trust of their users through both long tradition and by their sheer power over the market.\u00a0\u00a0 On the other, there are court and government-based sources, which generally make efforts to avoid holding themselves out as either official or as reliably authenticated sources, and a number of smaller enterprises, which offer lower cost or free access to materials that they harvest, link to, or generate from print themselves.<\/p>\n

For the large publishers, the issue of authentication is not a serious issue.\u00a0 Their well earned reputations for reliability are not seriously questioned by the market that buys their product.\u00a0 And, by all accounts, their editorial staffs ensure this continues.<\/p>\n

So, what about everyone else?\u00a0 In the instance of publishing court decisions, for example, Justia,<\/a> Cornell’s LII<\/a>, etc.,\u00a0 collect their documents from the same \u201cunofficial\u201d court sources as the large publishers, but the perceived trustworthiness is not necessarily the same with some user communities.\u00a0 This is understandable, and, to a great extent, can only be addressed through the passing of time.\u00a0 The law is a conservative community when it comes to its information.<\/p>\n

Along with this, I think it also important to realize that this lack of trust has another, deeper component to \"ul.jpg\"<\/a>it.\u00a0 I see signs of it when librarians and others insist on the importance of \u201cofficial\u201d and \u201cauthentic\u201d information, while at the same time putting complete and unquestioned trust in the unofficial and unauthenticated offerings of traditional publishers.
\nOf course, a great deal of this has to do with the already mentioned reputations of these publishers.\u00a0 But I think there is also a sense in which there has been a comfort in the role of publishers as gatekeepers that makes it easy to rely on their offerings, and which is missing from information that comes freely on the Internet.<\/p>\n

In the context of scholarly journals, this has been discussed explicitly.\u00a0 In that case, the role of gatekeeper is easily defined in terms of the editorial boards that vet all submissions.\u00a0 In the case of things like court decisions, however, the role of the gatekeeper is not so clear, but the desire to have one seems to remain.\u00a0 The result is discussions about the possibility of errors and purposeful alterations in free Internet law sources that often seem odd and strangely overblown. They seem that way to us publishers, that is.<\/p>\n

So, for me, the crux of any debate about authentication comes down to this disconnect between the perceptions and needs of the professional and librarian communities, and what most Internet law publishers do to produce accurate information for the public.<\/p>\n

As I said earlier, time will play a role in this.\u00a0 The truly reliable will prove themselves to be such, and will survive.\u00a0 The extent to which the Cornell LII is already considered an authoritative source for the U.S. Supreme Court is good evidence of this.\u00a0 At the same time, there is much to be gained from taking some fairly easy and reasonable measures to demonstrate the accuracy of the documents being made available.<\/p>\n

The utility of this goes beyond just building trust.\u00a0 The kind of information generated in authenticating a document is also important in the context of creating and maintaining durable electronic archives.\u00a0 As such, we should all look to implement these practices.<\/p>\n

The first element of an authentication system is both obvious and easy to overlook: disclosure.\u00a0 An explanation of how a publisher obtains the material on offer, and how that material is handled should be documented and available to prospective users.\u00a0 For the public, this explanation needs to be on a page on the website.\u00a0 It’s a perfect candidate for inclusion on a FAQ page. (Yes, even if no one has asked. I mean really, how many people really count questions received before creating their FAQs?). For the archive, it is essential that this information also be embedded in the document metadata.\u00a0 A simple Dublin Core source tag is a start, but something along the lines of the TEI <sourceDesc> and <revisionDesc> tags are essential here (See http:\/\/www.tei-c.org\/release\/doc\/tei-p5-doc\/html\/HD.html<\/a>) .<\/p>\n

An explanation of the source of a document will show the user a chain of custody leading back to an original source.\u00a0 The next step is to do something to make that chain of custody verifiable.<\/p>\n

It is at this point where things can either stay reasonable, or can spin off toward some expensive extremes, so let’s be clear about the ground rules.\u00a0 We concerned with public-domain documents, which are not going to be sold (so no money transfer is involved), and where no confidential information is being passed.\u00a0 For these reasons, site encryption and SSL certificates are overkill.\u00a0 We aren’t concerned with the <i>transmission<\/i> of the documents, only their preparation and maintenance.\u00a0 The need is for document-level verification.\u00a0 For that, the easy and clear answer is in a digital signature.
\nAt the
Government Printing Office<\/a> (GPO), the PDF version of new legislative documents are being verified with a digital signature provided by GeoTrust CA<\/a> and handled by Adobe, Inc<\/a>.\u00a0 These are wonderful, and provide a high level of reliability.\u00a0 For the initial provider, they make a certain amount of sense.<\/p>\n

However, I question the need for an outside provider to certify the authenticity of a document that is being provided directly from GPO.\u00a0 Note what the certification really amounts to:\u00a0 an MD5 hash that has been verified and \u201ccertified\u201d by GeoTrust.\u00a0 It’s nice because anyone can click on the GPO logo and see the certificate contents.\u00a0 The certificate itself, however, doesn’t do anything more than that.\u00a0 The important thing is the MD5 sum upon which the certificate relies.
\nIn\u00a0 addition, the certificate is invalid as soon as any alterations whatsoever are made to the document.\u00a0 Again, this makes some sense, but does not address the need and utility to add value to the original document, such as format conversion to HTML, XML or other useful formats, insertion of hypertext links, addition of significant metadata, etc.<\/p>\n

The answer to this problem is to retain the MD5 sum, while dropping the certificate.\u00a0 The retained MD5 sum can still be used to demonstrate a chain of custody.\u00a0 For example, here at Rutgers\u2013Camden<\/a>, we collect New Jersey Appeals <\/a>decisions provided to us by the courts.\u00a0 As they are downloaded from the court’s server in MS Word format, we have started generating an MD5 sum of this original file.\u00a0 The document is converted to HTML with embedded metadata and hypertext links, but the MD5 sum of the original is included in the metadata.\u00a0 It can be compared to the original Word document on the court’s server to verify that what we got was exactly what they provided.<\/p>\n

The next step is to generate an additional MD5 sum of the HTML file that we generated from the original.\u00a0 Of course, this can’t be embedded in the file, but it is retained in the database that has a copy of all the document metadata, and can be queried anytime needed.\u00a0 That, combined with an explanation of the revisions performed on the document completes the chain of custody.\u00a0 As embedded in our documents, the revision notes are put in as an HTML-ized variation on the TEI revision description, and look\u00a0 like this:
\n<META NAME=”revisionDate” CONTENT=”Wed May\u00a0 6 17:05:56 EDT 2009″>
\n<META NAME=”revisionDesc” CONTENT=”Downloaded from NJAOC as MS Word document, converted to HTML with wvHtml. Citations converted to hypertext links.”>
\n<META NAME=\u201dorig_md5\u201d CONTENT=\u201d8cc57f9e06513fdd14534a2d54a91737\u201d><\/p>\n

Another possible method for doing this sort of thing would be the strategy suggested by Thomas Bruce and the Cornell LII.\u00a0 Instead of generating an original and subsequent MD5 sum, one might generate a single digital signature of the document’s text stream, stripped of all formatting, tagging, and graphics.\u00a0 The result should be an MD5 sum that would be the same for both the original document, and the processed document, no matter what the subsequent format alterations or other legitimate value-added tagging that were done.<\/p>\n

The attraction of a single digital signature that would identify any accurate copy of a document is obvious, and may ultimately be the way to proceed.\u00a0 In order for it to work, however, things like minor inconsistencies in the treatment of\u00a0 \u201cinsignificant\u201d whitespace (See\u00a0 http:\/\/www.oracle.com\/technology\/pub\/articles\/wang-whitespace.html<\/a> for an explanation), and the treatment of other odd things (like macro generated dates, etc. in MS Word), would have to be carefully accounted for and consistently treated.
\nFinally,\u00a0 I don’t think any discussion of authenticity and reliability of legal information on the Internet should leave out a point I hinted at earlier in this article.\u00a0 In the long run, information does not, and will not survive without widespread distribution.\u00a0 In this time of\u00a0 cheap disk space and fast Internet connections, we have the unprecedented opportunity to preserve information through widespread distribution.\u00a0 Shared and mirrored repositories among numbers of educational and other institutions would be a force for enormous good.\u00a0 Imagine an institution recovering from the catastrophic loss of their collections by merely re-downloading from any of hundreds of sister institutions.\u00a0 Such a thing is possible now.<\/p>\n

In such an environment, with many sources and repositories easily accessible, all of which are in the business only of maintaining accurate information, reliable data would tend to become the market norm.\u00a0 You simply could not maintain a credible repository that contained errors, either intentional or accidental, in a world where there are many accurate repositories openly available.<\/p>\n

Widespread distribution, along with things like the above suggestions, are the keys to a reliable and durable information infrastructure.\u00a0 Each of us who would publish and maintain a digital repository needs to take steps to insure that their information is verifiably authentic.\u00a0\u00a0 And then, we need to hope that sooner or later, we will be joined by others.<\/p>\n

There. I am still a naive optimist.<\/p>\n

<\/a><\/a>\"Joergensen<\/a>
\nJohn Joergensen is the digital services librarian at Rutgers University
\nSchool of Law – Camden. \u00a0He publishes a wide range of New Jersey primary
\nsource materials, and is digitizing the libraries collection of
\ncongressional documents. \u00a0These are available at
\n
http:\/\/lawlibrary.rutgers.edu<\/a>.<\/p>\n

VoxPopuLII is edited by Judith Pratt<\/a>.\u00a0 Editor in Chief is Rob Richards<\/a>.<\/p>\n

<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"

When I first started off in the field of Internet publishing of legal materials, I did, briefly consider the topic of authenticity, and its importance to the end user.\u00a0 My conclusion back then rested on the simple consideration that since I was a librarian and was acting under the aegis of a university, I had […]<\/a><\/p>\n","protected":false},"author":17,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[220,276,277,278],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/posts\/119"}],"collection":[{"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/users\/17"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/comments?post=119"}],"version-history":[{"count":0,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/posts\/119\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/media?parent=119"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/categories?post=119"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/tags?post=119"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}