{"id":20,"date":"2012-05-15T10:29:55","date_gmt":"2012-05-15T15:29:55","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/metasausage\/?p=20"},"modified":"2012-06-12T06:58:50","modified_gmt":"2012-06-12T11:58:50","slug":"identifiers-part-2","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/metasausage\/2012\/05\/15\/identifiers-part-2\/","title":{"rendered":"Identifiers, Part 2"},"content":{"rendered":"
<\/a>[ Part 2 in a 3 part series. Last time we talked about some general characteristics of identifiers for legislation<\/a>, and some sources of confusion in legacy systems. This time: some design problems having to do with granularity and use, and the fact that identifiers are situated in legal and bureaucratic process. ]<\/em><\/span><\/p>\n Identifier granularity<\/span><\/p>\n How small a thing should we try to identify? It\u2019s difficult to make general prescriptions about that, for needs vary from corpus to corpus. \u00a0For the most part, we assume that identifier granularity should follow common citation or cross-referencing practice — that is, the smallest thing we identify or label should be the smallest thing that standard citation practice would allow a user to navigate to. \u00a0That will vary from collection to collection, and from context to context. For example, it\u2019s quite common for citation to the US Code to refer to objects at the subsection level, sometimes right down to the paragraph level. \u00a0On the other hand, references to the Code in the Parallel Table of Authorities and Rules generally refer to a full section. \u00a0Similarly, although cross-references within the Code of Federal Regulations can be very granular, external references typically target the Part level. In any corpus, amendments can be expressed in ways that are very granular indeed.<\/span><\/strong><\/strong><\/p>\n Our citation and cross-referencing practices have evolved in the context of print, and we may be able to do things that better reflect the dynamic nature of legislative text. \u00a0The move from print to digital overturns background assumptions about practicality. \u00a0For example, print typically makes different assumptions about identifier stability than you would find, say, in an online legislative drafting system. \u00a0Good examples of this are found in citation practice for the Code of Federal Regulations, which typically cites material at the Part level because (one imagines) changes in numbering and naming of sections are so frequent as to render identifiers tied to such fine divisions unstable — at least in print, where the shelf life of such fine-grained identifiers is shorter than the shelf life of the edition by an order of magnitude. In a digital environment, it is possible to manage identifiers more closely, permitting graceful failure of those that are no longer valid, and providing automated navigation to things that have moved. We look at some of the possibilities and implications in sections on granularity, fragmentation, and recombination below. \u00a0All of those capabilities carry costs, and over-design is a real possibility.<\/span><\/p>\n Thinking about granularity leads to ideas about the linkages between metadata and the target object itself. \u00a0Often metadata applies to chunks of documents rather than whole documents. \u00a0Cross-referencing in statutes and legislation is usually done at the subdocument level, for instance, and subject-matter classification of a bill containing multiple unrelated provisions would be better if the subject classifications could be separately tied to specific provisions within the bill. That need becomes particularly acute when something important, but unrelated to the main purpose of the bill, has been \u201csnuck in\u201d to a much larger piece of legislation. \u00a0A stunning example of such a Frankenstein\u2019s monster appears at \u00a0111 Pub. L. 226<\/a> . It is described in its preamble as modernizing the air-traffic control system, but its first major Title heading describes it as an \u201cEducation Jobs Fund\u201d, \u00a0and its second major Title contemplates highly technical apparatus for providing fiscal relief to state governments.<\/span><\/strong><\/strong><\/p>\n We are aware that sometimes we are thinking in terms that are not currently supported by the markup of documents in existing XML-creating systems. \u00a0However, we think it makes sense to design identifier systems that are more capable than some document collections will currently support via markup, in the expectation that \u00a0markup in those collections will evolve to the same granularity as current cross-referencing and citation practice, and that point-in-time systems supporting the full lifecycle of legislative drafting, passage, and codification will become the norm. \u00a0Right now, \u00a0divisions of statutory and regulatory text below the section level (\u201csubsection containers\u201d) are among the most prominent examples of \u201cmissing markup\u201d; they are provided for in the legislative XML DTDs at (eg.) xml.house.gov<\/a>, but do not survive into the FD\/SYS versions from GPO.<\/span><\/p>\n Most often, we imagine that the flow of document processing leads from markup to metadata, since as a practical matter a lot of metadata is generated simply by extracting text features that have been tagged with some XML or HTML element. \u00a0Sometimes the flow is in the other direction; we may want to embed metadata in the documents for various purposes. \u00a0Use of microformats, microdata, and other such schemes can be handy for various applications; the use of research-management software like Zotero<\/a>, or the embedding of information about document authenticity comes to mind. \u00a0These are not part of a legislative data model per se, but represent use cases worth thinking about.<\/span><\/p>\n At the other end of the spectrum, identifier systems that are heavily burdened with semantics have problems with uniqueness, length, persistence, language, and other issues arising from inherent ambiguity of labels and other home-brewed identifier components. \u00a0It is worth remembering, too, that one person\u2019s helpful semantics are another\u2019s mumbo-jumbo; just walk up to someone at random and ask them the dates of the 75th Congress if you need proof of that. Useful systems find a middle ground between extremes of incomprehensible rigor and mindlessly verbose recitation of loosely-constructed labels. <\/span><\/p>\n It\u2019s worth noting in passing that it can be very difficult to prevent the unwanted exposure of \u201cback-end\u201d identifier to end users. \u00a0For example, URIs constructed from back-end components often find their way into the browser bars of authors researching online, who then paste them into documents that would be better served by more brain-compatible, human-digestible versions. <\/span><\/p>\nMetadata, markup, and embedding<\/span><\/h2>\n
Stresses and strains<\/span><\/h2>\n
Next, we turn to things that affect the design of real-world identifier systems, perhaps rendering them less \u201cpure\u201d in information-design terms than we might like.<\/span>
\n<\/strong><\/strong><\/span><\/span><\/h2>\nSemantics versus purity<\/span><\/h3>\n
Some systems enforce notions of identifier purity — often defined as some combination of uniqueness, orderliness, and ease of collation and sorting — by rigorously stripping all semantics from identifiers. \u00a0That is an approach that can function reasonably well in back-end systems, but greatly reduces the usefulness of the identifiers to humans (because understanding what the identifier identifies requires database reflection), and introduces extra possibilities for error in application because (among other reasons) errors caused by human transcription are hard to catch when the identifiers are meaningless strings of letters and numbers. \u00a0On the other hand, \u201cpure\u201d opaque identifiers counter a tendency to assume that one knows what a semantically laden identifier means, when in fact one might not. \u00a0And sometimes opaque identifiers can be used to provide stability in situations where labels change frequently but the labelled objects do not. \u00a0<\/span><\/strong><\/strong><\/span><\/span><\/h2>\n