{"id":33,"date":"2012-06-11T13:53:54","date_gmt":"2012-06-11T18:53:54","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/metasausage\/?p=33"},"modified":"2012-06-12T07:06:48","modified_gmt":"2012-06-12T12:06:48","slug":"identifiers-part-3","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/metasausage\/2012\/06\/11\/identifiers-part-3\/","title":{"rendered":"Identifiers, part 3"},"content":{"rendered":"
[This is part 3 of a three-part post on identifiers. Here are parts 1<\/a> and 2<\/a><\/em>]<\/p>\n To judge by the examples presented so far, current practice in legislative identifiers for US materials might best be described as \u201ccoping\u201d, and specifically \u201ccoping in a way that was largely designed to deal with the problems of print\u201d. Current practice presents a welter of “identifiers”, monikers, names, and titles, all believed by those who create and use them to be sufficiently rigorous to qualify as identifiers whether they are or not. \u00a0It might be useful to divide these into four categories:<\/span><\/span><\/p>\n John Sheridan (of legislation.gov.uk) has written eloquently about the use of legislative Linked Data<\/a> to support the development of \u201caccountable systems\u201d.\u00a0The key idea is that exposing legislative data using Linked Data techniques has particular informational and economic value when that data <\/span>defines real-world objects<\/span> for legal purposes. \u00a0If we turn our attention from statutes to regulations, that value becomes even more obvious.<\/span><\/p>\n There are no unicorns in the United States Code. Nevertheless, legislative data describes and references many, many things. \u00a0More, it provides fundamental definitions of how those things are seen by Federal law. \u00a0It is valuable to be able to expose such definitions — and other fundamental information — in a way that allows it to be related to other collections of information for consumption by a global audience.<\/span><\/p>\n In a particularly insightful blog post that discusses the advantages of the Linked Data methods<\/a> used in building legislation.gov.uk, Jeni Tennison points out the ability that RDF and Linked Data standards have to solve a longstanding problem in government information systems: the social problem of standard-setting and coordination:<\/span><\/p>\n RDF has this balance between allowing individuals and organisations complete freedom in how they describe their information and the opportunity to share and reuse parts of vocabularies in a mix-and-match way. This is so important in a government context because (with all due respect to civil servants) we really want to avoid a situation where we have to get lots of civil servants from multiple agencies into the same room to come up with the single government-approved way of describing a school. We can all imagine how long that would take.<\/span><\/p>\n The other thing about RDF that really helps here is that it\u2019s easy to align vocabularies if you want to, post-hoc.<\/span>RDFS<\/span><\/a> and<\/span>OWL<\/span><\/a> define properties that you can use to assert that this property is really the same as that property, or that anything with a value for this property has the same value for that other property. This lowers the risk for organisations who are starting to publish using RDF, because it means that if a new vocabulary comes along they can opportunistically match their existing vocabulary with the new one. It enables organisations to tweak existing vocabularies to suit their purposes, by creating specialised versions of established properties.<\/span><\/p>\n While Tennison\u2019s remarks here concentrate on vocabularies, a similar point can be made about identifier schemes; it is easy to relate multiple legacy identifiers to a \u201cgold standard\u201d.<\/span><\/p>\n Well-designed, URI-based identifier schemes create APIs for the underlying data. \u00a0At the moment, the leading example for legislative information is the scheme used by legislation.gov.uk, described in summary at <\/span>http:\/\/data.gov.uk\/blog\/legislationgovuk-api<\/span><\/a> \u00a0and in detail in a collection of developer documentation linked from that page. \u00a0Because a URI is resolvable, functioning as a sort of retrieval hook, it is also the basis of a well-organized scheme for accessing different facets of the underlying information. \u00a0<\/span>legislation.gov.uk<\/span><\/a> \u00a0uses a three-layer system to distinguish the abstract identity of a piece of legislation from its current online expression as a document and from a variety of format-specific representations. \u00a0<\/span><\/p>\n That is an inspiring approach, but we would want to extend it to encompass point-in-time as well as point-in-process identification (such as being able to retrieve all of the codified fragments of a piece of legislation as codified, using its original bill number, popular name, or what-have-you). \u00a0At the moment, <\/span>legislation.gov.uk<\/span><\/a> does this only via search, but the recently announced Dutch statutory collection at <\/span>http:\/\/doc.metalex.eu\/<\/span><\/a> does support some point-in-time features. \u00a0\u00a0It is worth pointing out that the American system presents greater challenges than either of these, \u00a0because of our more chaotic legislative drafting practices, the complexity of the legislative process itself, and our approach to amendment and codification.<\/span><\/p>\n The idea that we would publish legislative information using Linked Data approaches has obvious granularity implications (see above), but there are others that may prove more difficult. \u00a0Here we discuss three: \u00a0uniqueness over wider scope, resolvability, and the practical needs of \u201cidentifier manufacturing\u201d:<\/span><\/p>\n Many of the identifiers developed in the closed silo of the world of legal citation could be reused as URIs in a linked data context, exposing them to use and reuse in environments outside the world where legal citation has developed. \u00a0In the open world, identifiers need to carry their context with them, rather than have that context assumed or dependent on bespoke processes for resolution or access. \u00a0\u00a0For the most part, citation of judicial opinions survives wide exposure in fair style. \u00a0Other identifiers used for government documents do not cope as well. \u00a0\u00a0Above, we mentioned bill numbers as being limited in chronological scope; other identifiers (particularly those that rely heavily on document titles or dates as the sole means of distinction from other documents in the same corpus) may not fare well either.<\/span><\/p>\n In reality, they have different goals. \u00a0URIs provide resolvability — that is, the ability to actually find your way to an information resource, \u00a0or to information about a real-world thing that is not on the web. \u00a0As Jeni Tennison remarks in her blog#, they do that at the expense of creating a certain amount of ambiguity. \u00a0Well-designed URN schemes, on the other hand, can be unambiguous in what they name, particularly if they are designed to be part of a global document identification scheme from the beginning, as they are in the emerging URN:Lex specification<\/a> . \u00a0\u00a0<\/span><\/p>\n For our purposes, we probably want to think primarily in terms of URIs, but (as with legacy identifier schemes) there will be advantages to creating sensible linkages between our system, which emphasizes reliability, and others that emphasize a lack of ambiguity and coordination with other datasets. \u00a0<\/span><\/p>\n \u00a0<\/strong><\/p>\n Legislation is created by real people and it acts on real things. \u00a0It is incredibly valuable to be able to relate legislative documents to those things. \u00a0The challenge lies, as it always has, \u00a0in eliminating ambiguity about which object we are talking about. \u00a0A newer and more subtle need is the need to distinguish references to the real-world object itself from references to representations of the object on the web. \u00a0The problems of distinguishing one John Smith from another are already well understood in the library community. \u00a0URIs present a new set of challenges. \u00a0For instance, we might want to think about how we are to correctly interpret a URI that might refer to John Smith, the off-web object that is the person himself, and a URI that refers to the Wikipedia entry that is (possibly one of many) on-web representations of John Smith. \u00a0This presents a variety of technical challenges that are still being resolved<\/a>.\u00a0<\/span><\/p>\n Thinking about the highly-granular approach needed to make legislative data usefully recombinant — as suggested in the section on fragmentation and recombination above — quickly leads to practical questions about where all those granular identifiers will come from. The problem becomes more acute when we being to think about retrofitting such schemes to large bodies of legacy information. \u00a0For these among other reasons, the ability to manufacture and assign high-quality identifiers by automated means has become the Philosopher\u2019s Stone of digital legal publishers. \u00a0It is not that easy to do. \u00a0<\/span><\/p>\n The reasons are many, and some arise from design goals that may not be shared by everyone, or from misperceptions about the data. \u00a0For example, it\u2019s reasonable to assume that a sequence of accession numbers represents a chronological sequence of some kind, but as we\u2019ve already seen, that\u2019s not always the case. \u00a0Legacy practices complicate this. \u00a0For example, it would be interesting to see how the sequence of Supreme Court cases for which we have an exact chronological record (via file datestamping associated with electronic transmission) corresponds to their sequence as officially published in printed volumes. \u00a0It may well be that sequence in print has been dictated as much by page-layout considerations as by chronology. \u00a0It might well be that two organizations assigning sequential identifiers to the same corpus retrospectively would come up with a different sequence.<\/span><\/p>\n Those are the problems we encounter in an identifier scheme that is, theoretically, content-independent. \u00a0Content-dependent schemes can be even more challenging. \u00a0Automatic creation of identifiers typically rests on the automated extraction of one or more document features that can be concatenated to make a unique identifier of wide scope. \u00a0There are some document collections where that may be difficult or impossible, either because there is no combination of extractable document features that will result in a unique identifier, or because legacy practices have somehow obliterated necessary information, or because it is not easy to extract the relevant features by automated means. \u00a0We imagine that retroconversion of House Committee prints would present exactly this challenge. \u00a0<\/span><\/p>\n At the same time, it is worth remembering that the technologies available for extracting document features are improving dramatically, suggesting that a layered, incremental approach might be rewarded in the future. \u00a0While the idea of \u201cgraceful degradation\u201d seems at first blush to be less applicable to identifiers than to other forms of metadata, it is possible to think about the problem a little differently in the context of corpus retroconversion. \u00a0That is a complicated discussion, but it seems possible that the use of provisional, accession-based identifiers within a system of properties and relationships designed to accomodate incomplete knowledge about the document might yield good results.<\/span><\/p>\n Identifiers have special value in an information domain where authority is as important as it is for legal information. \u00a0In the event of disputes, parties need to be able to definitively identify a dispositive, authoritative version of a statute, regulation, or other legal document. \u00a0There is, then, a temptation toward a soft monopoly in identifiers: the idea that there should be a definitive, authoritative copy somewhere leads to the idea of a definitive, authoritative identifier administered by a single organization. Very often, challenges of scale and scope have dictated that that be a commercial publisher. \u00a0Such a scheme was followed for many years in the citation of judicial opinions, resulting in an effective monopoly for one publisher. \u00a0That is proving remarkably difficult and expensive to undo, even though it has had serious cost implications and other detrimental effects on the legal profession and for the public. \u00a0Care is needed to ensure that the soft, natural monopoly that arises from the creation of authoritative documents by authoritative sources does not harden into real impediments to the free flow of public information, as it did in the case of judicial opinions.<\/span><\/p>\n This is not a complete set of general recommendations — really more a series of guideposts or suggestions, to be altered and tempered by institutional realities:<\/span><\/p>\n We conclude, as always, with a musical selection<\/a> or two<\/a>. \u00a0Next time, some stuff about people and organizations as we find them in the legislative world.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":" [This is part 3 of a three-part post on identifiers. Here are parts 1 and 2] How well does current practice measure up? To judge by the examples presented so far, current practice in legislative identifiers for US materials might best be described as \u201ccoping\u201d, and specifically \u201ccoping in a way that was largely designed […]<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/posts\/33"}],"collection":[{"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/comments?post=33"}],"version-history":[{"count":8,"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/posts\/33\/revisions"}],"predecessor-version":[{"id":43,"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/posts\/33\/revisions\/43"}],"wp:attachment":[{"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/media?parent=33"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/categories?post=33"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/tags?post=33"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}How well does current practice measure up?<\/span><\/strong><\/h2>\n
\n
Identifiers in a Linked Data context<\/span><\/h2>\n
Valuable features of Linked Data approaches to legislative information<\/span><\/strong><\/h3>\n
Ability to reference real-world objects<\/span><\/h4>\n
\u201c<\/span>On the Semantic Web, URIs identify not just Web documents, but also real-world objects like people and cars, and even abstract ideas and non-existing things like a mythical unicorn. We call these real-world objects or things.\u201d — Tim Berners-Lee<\/a><\/span><\/h4>\n
Avoiding cumbersome standards-building processes<\/span><\/h4>\n
Layering and API-building<\/span><\/strong><\/h4>\n
Identifier challenges arising from Linked Data (and Web exposure generally)<\/span><\/strong><\/h3>\n
Uniqueness over wider scope<\/span><\/h4>\n
Resolvability<\/span><\/h4>\n
The differences between URNs (Uniform Resource Names) and URLs (Uniform Resource Locations, the URIs based on the HTTP protocol) are significant. \u00a0Wikipedia notes that the URNs are similar to personal names, the URLs to street addresses–the first rely on resolution services to function. \u00a0In many cases, URNs can provide the basis for URLs, with resolution built into the http address, but in the world we\u2019re now working in, URNs must be seen as insufficient for creating linked open data.<\/span><\/strong><\/h4>\n
Things not on the Web<\/span><\/h4>\n
Practical manufacturing and assignment of Web-oriented identifiers<\/span><\/h4>\n
A final note on economics<\/span><\/h2>\n
What we recommend<\/span><\/h2>\n
\n
\n