How well does current practice measure up?
To judge by the examples presented so far, current practice in legislative identifiers for US materials might best be described as “coping”, and specifically “coping in a way that was largely designed to deal with the problems of print”. Current practice presents a welter of "identifiers", monikers, names, and titles, all believed by those who create and use them to be sufficiently rigorous to qualify as identifiers whether they are or not. It might be useful to divide these into four categories:
- Well-understood monikers, issued in predetermined ways as part of the legislative process by known actors. Their administrative stability may well be the product of statutory requirement or of requirements embedded in House or Senate rules. Many of these will also correspond to definite stages in the legislative process. Examples would include House and Senate bill and resolution numbers.
- Monikers arising from need and possibly semi-formalized, or possibly “bent” versions of monikers created for a purpose other than that they end up serving. Monikers of this kind are widely relied-on, but nobody is really responsible for them. Some end up being embedded in retrieval systems because they’re all there is. A variety of such approaches are on display in the world of House committee prints.
- Monikers imposed after the fact in an effort to systematize things or otherwise compensate for any deficiencies of monikers issued at earlier stages of the process. Certainly internal database identifiers would fit this description; so would most official citation.
- A grab-bag of other monikers. These might be created within government ( as with GPO’s SuDoc numbers), or outside government altogether (as with accession numbers or other schemes that identify historical papers held in other libraries). Here, a good model would provide a set of properties enabling others to relate their schemes to ours.
Identifiers in a Linked Data context
John Sheridan (of legislation.gov.uk) has written eloquently about the use of legislative Linked Data to support the development of “accountable systems”. The key idea is that exposing legislative data using Linked Data techniques has particular informational and economic value when that data defines real-world objects for legal purposes. If we turn our attention from statutes to regulations, that value becomes even more obvious.
Valuable features of Linked Data approaches to legislative information
Ability to reference real-world objects
“On the Semantic Web, URIs identify not just Web documents, but also real-world objects like people and cars, and even abstract ideas and non-existing things like a mythical unicorn. We call these real-world objects or things.” -- Tim Berners-Lee
There are no unicorns in the United States Code. Nevertheless, legislative data describes and references many, many things. More, it provides fundamental definitions of how those things are seen by Federal law. It is valuable to be able to expose such definitions -- and other fundamental information -- in a way that allows it to be related to other collections of information for consumption by a global audience.
Avoiding cumbersome standards-building processes
In a particularly insightful blog post that discusses the advantages of the Linked Data methods used in building legislation.gov.uk, Jeni Tennison points out the ability that RDF and Linked Data standards have to solve a longstanding problem in government information systems: the social problem of standard-setting and coordination:
RDF has this balance between allowing individuals and organisations complete freedom in how they describe their information and the opportunity to share and reuse parts of vocabularies in a mix-and-match way. This is so important in a government context because (with all due respect to civil servants) we really want to avoid a situation where we have to get lots of civil servants from multiple agencies into the same room to come up with the single government-approved way of describing a school. We can all imagine how long that would take.
The other thing about RDF that really helps here is that it’s easy to align vocabularies if you want to, post-hoc.RDFS andOWL define properties that you can use to assert that this property is really the same as that property, or that anything with a value for this property has the same value for that other property. This lowers the risk for organisations who are starting to publish using RDF, because it means that if a new vocabulary comes along they can opportunistically match their existing vocabulary with the new one. It enables organisations to tweak existing vocabularies to suit their purposes, by creating specialised versions of established properties.
While Tennison’s remarks here concentrate on vocabularies, a similar point can be made about identifier schemes; it is easy to relate multiple legacy identifiers to a “gold standard”.
Layering and API-building
Well-designed, URI-based identifier schemes create APIs for the underlying data. At the moment, the leading example for legislative information is the scheme used by legislation.gov.uk, described in summary at http://data.gov.uk/blog/legislationgovuk-api and in detail in a collection of developer documentation linked from that page. Because a URI is resolvable, functioning as a sort of retrieval hook, it is also the basis of a well-organized scheme for accessing different facets of the underlying information. legislation.gov.uk uses a three-layer system to distinguish the abstract identity of a piece of legislation from its current online expression as a document and from a variety of format-specific representations.
That is an inspiring approach, but we would want to extend it to encompass point-in-time as well as point-in-process identification (such as being able to retrieve all of the codified fragments of a piece of legislation as codified, using its original bill number, popular name, or what-have-you). At the moment, legislation.gov.uk does this only via search, but the recently announced Dutch statutory collection at http://doc.metalex.eu/ does support some point-in-time features. It is worth pointing out that the American system presents greater challenges than either of these, because of our more chaotic legislative drafting practices, the complexity of the legislative process itself, and our approach to amendment and codification.
Identifier challenges arising from Linked Data (and Web exposure generally)
The idea that we would publish legislative information using Linked Data approaches has obvious granularity implications (see above), but there are others that may prove more difficult. Here we discuss three: uniqueness over wider scope, resolvability, and the practical needs of “identifier manufacturing”:
Uniqueness over wider scope
Many of the identifiers developed in the closed silo of the world of legal citation could be reused as URIs in a linked data context, exposing them to use and reuse in environments outside the world where legal citation has developed. In the open world, identifiers need to carry their context with them, rather than have that context assumed or dependent on bespoke processes for resolution or access. For the most part, citation of judicial opinions survives wide exposure in fair style. Other identifiers used for government documents do not cope as well. Above, we mentioned bill numbers as being limited in chronological scope; other identifiers (particularly those that rely heavily on document titles or dates as the sole means of distinction from other documents in the same corpus) may not fare well either.
The differences between URNs (Uniform Resource Names) and URLs (Uniform Resource Locations, the URIs based on the HTTP protocol) are significant. Wikipedia notes that the URNs are similar to personal names, the URLs to street addresses--the first rely on resolution services to function. In many cases, URNs can provide the basis for URLs, with resolution built into the http address, but in the world we’re now working in, URNs must be seen as insufficient for creating linked open data.
In reality, they have different goals. URIs provide resolvability -- that is, the ability to actually find your way to an information resource, or to information about a real-world thing that is not on the web. As Jeni Tennison remarks in her blog#, they do that at the expense of creating a certain amount of ambiguity. Well-designed URN schemes, on the other hand, can be unambiguous in what they name, particularly if they are designed to be part of a global document identification scheme from the beginning, as they are in the emerging URN:Lex specification .
For our purposes, we probably want to think primarily in terms of URIs, but (as with legacy identifier schemes) there will be advantages to creating sensible linkages between our system, which emphasizes reliability, and others that emphasize a lack of ambiguity and coordination with other datasets.
Things not on the Web
Legislation is created by real people and it acts on real things. It is incredibly valuable to be able to relate legislative documents to those things. The challenge lies, as it always has, in eliminating ambiguity about which object we are talking about. A newer and more subtle need is the need to distinguish references to the real-world object itself from references to representations of the object on the web. The problems of distinguishing one John Smith from another are already well understood in the library community. URIs present a new set of challenges. For instance, we might want to think about how we are to correctly interpret a URI that might refer to John Smith, the off-web object that is the person himself, and a URI that refers to the Wikipedia entry that is (possibly one of many) on-web representations of John Smith. This presents a variety of technical challenges that are still being resolved.
Practical manufacturing and assignment of Web-oriented identifiers
Thinking about the highly-granular approach needed to make legislative data usefully recombinant -- as suggested in the section on fragmentation and recombination above -- quickly leads to practical questions about where all those granular identifiers will come from. The problem becomes more acute when we being to think about retrofitting such schemes to large bodies of legacy information. For these among other reasons, the ability to manufacture and assign high-quality identifiers by automated means has become the Philosopher’s Stone of digital legal publishers. It is not that easy to do.
The reasons are many, and some arise from design goals that may not be shared by everyone, or from misperceptions about the data. For example, it’s reasonable to assume that a sequence of accession numbers represents a chronological sequence of some kind, but as we’ve already seen, that’s not always the case. Legacy practices complicate this. For example, it would be interesting to see how the sequence of Supreme Court cases for which we have an exact chronological record (via file datestamping associated with electronic transmission) corresponds to their sequence as officially published in printed volumes. It may well be that sequence in print has been dictated as much by page-layout considerations as by chronology. It might well be that two organizations assigning sequential identifiers to the same corpus retrospectively would come up with a different sequence.
Those are the problems we encounter in an identifier scheme that is, theoretically, content-independent. Content-dependent schemes can be even more challenging. Automatic creation of identifiers typically rests on the automated extraction of one or more document features that can be concatenated to make a unique identifier of wide scope. There are some document collections where that may be difficult or impossible, either because there is no combination of extractable document features that will result in a unique identifier, or because legacy practices have somehow obliterated necessary information, or because it is not easy to extract the relevant features by automated means. We imagine that retroconversion of House Committee prints would present exactly this challenge.
At the same time, it is worth remembering that the technologies available for extracting document features are improving dramatically, suggesting that a layered, incremental approach might be rewarded in the future. While the idea of “graceful degradation” seems at first blush to be less applicable to identifiers than to other forms of metadata, it is possible to think about the problem a little differently in the context of corpus retroconversion. That is a complicated discussion, but it seems possible that the use of provisional, accession-based identifiers within a system of properties and relationships designed to accomodate incomplete knowledge about the document might yield good results.
A final note on economics
Identifiers have special value in an information domain where authority is as important as it is for legal information. In the event of disputes, parties need to be able to definitively identify a dispositive, authoritative version of a statute, regulation, or other legal document. There is, then, a temptation toward a soft monopoly in identifiers: the idea that there should be a definitive, authoritative copy somewhere leads to the idea of a definitive, authoritative identifier administered by a single organization. Very often, challenges of scale and scope have dictated that that be a commercial publisher. Such a scheme was followed for many years in the citation of judicial opinions, resulting in an effective monopoly for one publisher. That is proving remarkably difficult and expensive to undo, even though it has had serious cost implications and other detrimental effects on the legal profession and for the public. Care is needed to ensure that the soft, natural monopoly that arises from the creation of authoritative documents by authoritative sources does not harden into real impediments to the free flow of public information, as it did in the case of judicial opinions.
What we recommend
This is not a complete set of general recommendations -- really more a series of guideposts or suggestions, to be altered and tempered by institutional realities:
- At the most fundamental level, everything should have an identifier. It should be available for use by the public. For example, Congressional committee reports appear not to have any identifiers, but it would be reasonable to assume that some system is in use in the background, at least for their publication by GPO.
- Many legacy identifier systems will need to be extended or modified to create a gold standard system, probably issued by a third party and not by the document creators themselves. That is especially the case because there is nobody in a position to compel good practice by document creators over the long term. Such a gold-standard will need to be:
- Unambiguous. For example, existing bill and resolution numbers would need to be extended by, eg., a date of introduction.
- Designed to resist tampering. When things are numbered and labelled, there is a temptation to alter numbers and labels to serve short-term interests. The reservation of “important” bill numbers under House procedural rules is an example; another (from the executive branch) is the long-standing practice of manipulating RIN numbers to color assessments of agency activity.
- Clear as to the separation of titling, dating, and identification functions. Presidential documents provide a good example of something currently needing improvement in this respect.
- Taking advantage of carefully designed relationships among identifiers to allow the retention of well-understood legacy monikers for foreground use, while making use of a well-structured “gold standard” from the beginning. Those relationships should enable automated linkage that will allow retrieval across multiple, related identifier systems.
- Where possible, retain useful semantics in identifiers as a way of increasing access and reducing errors. It is possible that different audiences will require different semantics, making this unlikely to happen in the background, but it should be possible to retain this functionality in the foreground.
- Maintain granularity at the level of common citation and crossreferencing practice, but with a distinction between identifiers and labels. Identifiers should be assigned at the whole-document level, with the notion of “whole document” determined on a corpus-by-corpus basis. Labels may be assigned to subdocuments (eg., a section of a bill) for purposes of navigation and retrieval. This is similar in function and purpose to the distinction between HREF and NAME attributes in HTML anchor tags.
- Use a layered approach. In our view, it is important not to hold future systems hostage to what is practicable in legacy document collections. In general, it will be much harder to implement good practices over documents that were not “born digital”. That is not a good reason to water down our prospective approach, but it is a good reason to design systems that degrade gracefully when it becomes difficult or impossible to deal with older collections. That is particularly true at a time when the technologies for extracting metadata from legacy documents are improving dramatically, suggesting that a layered, incremental approach might produce great gains in the future.