{"id":20,"date":"2012-05-15T10:29:55","date_gmt":"2012-05-15T15:29:55","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/metasausage\/?p=20"},"modified":"2012-06-12T06:58:50","modified_gmt":"2012-06-12T11:58:50","slug":"identifiers-part-2","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/metasausage\/2012\/05\/15\/identifiers-part-2\/","title":{"rendered":"Identifiers, Part 2"},"content":{"rendered":"

\"The<\/a>[ Part 2 in a 3 part series. Last time we talked about some general characteristics of identifiers for legislation<\/a>, and some sources of confusion in legacy systems. This time: some design problems having to do with granularity and use, and the fact that identifiers are situated in legal and bureaucratic process. ]<\/em><\/span><\/p>\n

Identifier granularity<\/span><\/p>\n

How small a thing should we try to identify? It\u2019s difficult to make general prescriptions about that, for needs vary from corpus to corpus. \u00a0For the most part, we assume that identifier granularity should follow common citation or cross-referencing practice — that is, the smallest thing we identify or label should be the smallest thing that standard citation practice would allow a user to navigate to. \u00a0That will vary from collection to collection, and from context to context. For example, it\u2019s quite common for citation to the US Code to refer to objects at the subsection level, sometimes right down to the paragraph level. \u00a0On the other hand, references to the Code in the Parallel Table of Authorities and Rules generally refer to a full section. \u00a0Similarly, although cross-references within the Code of Federal Regulations can be very granular, external references typically target the Part level. In any corpus, amendments can be expressed in ways that are very granular indeed.<\/span><\/strong><\/strong><\/p>\n

Our citation and cross-referencing practices have evolved in the context of print, and we may be able to do things that better reflect the dynamic nature of legislative text. \u00a0The move from print to digital overturns background assumptions about practicality. \u00a0For example, print typically makes different assumptions about identifier stability than you would find, say, in an online legislative drafting system. \u00a0Good examples of this are found in citation practice for the Code of Federal Regulations, which typically cites material at the Part level because (one imagines) changes in numbering and naming of sections are so frequent as to render identifiers tied to such fine divisions unstable — at least in print, where the shelf life of such fine-grained identifiers is shorter than the shelf life of the edition by an order of magnitude. In a digital environment, it is possible to manage identifiers more closely, permitting graceful failure of those that are no longer valid, and providing automated navigation to things that have moved. We look at some of the possibilities and implications in sections on granularity, fragmentation, and recombination below. \u00a0All of those capabilities carry costs, and over-design is a real possibility.<\/span><\/p>\n

Metadata, markup, and embedding<\/span><\/h2>\n

Thinking about granularity leads to ideas about the linkages between metadata and the target object itself. \u00a0Often metadata applies to chunks of documents rather than whole documents. \u00a0Cross-referencing in statutes and legislation is usually done at the subdocument level, for instance, and subject-matter classification of a bill containing multiple unrelated provisions would be better if the subject classifications could be separately tied to specific provisions within the bill. That need becomes particularly acute when something important, but unrelated to the main purpose of the bill, has been \u201csnuck in\u201d to a much larger piece of legislation. \u00a0A stunning example of such a Frankenstein\u2019s monster appears at \u00a0111 Pub. L. 226<\/a> . It is described in its preamble as modernizing the air-traffic control system, but its first major Title heading describes it as an \u201cEducation Jobs Fund\u201d, \u00a0and its second major Title contemplates highly technical apparatus for providing fiscal relief to state governments.<\/span><\/strong><\/strong><\/p>\n

We are aware that sometimes we are thinking in terms that are not currently supported by the markup of documents in existing XML-creating systems. \u00a0However, we think it makes sense to design identifier systems that are more capable than some document collections will currently support via markup, in the expectation that \u00a0markup in those collections will evolve to the same granularity as current cross-referencing and citation practice, and that point-in-time systems supporting the full lifecycle of legislative drafting, passage, and codification will become the norm. \u00a0Right now, \u00a0divisions of statutory and regulatory text below the section level (\u201csubsection containers\u201d) are among the most prominent examples of \u201cmissing markup\u201d; they are provided for in the legislative XML DTDs at (eg.) xml.house.gov<\/a>, but do not survive into the FD\/SYS versions from GPO.<\/span><\/p>\n

Most often, we imagine that the flow of document processing leads from markup to metadata, since as a practical matter a lot of metadata is generated simply by extracting text features that have been tagged with some XML or HTML element. \u00a0Sometimes the flow is in the other direction; we may want to embed metadata in the documents for various purposes. \u00a0Use of microformats, microdata, and other such schemes can be handy for various applications; the use of research-management software like Zotero<\/a>, or the embedding of information about document authenticity comes to mind. \u00a0These are not part of a legislative data model per se, but represent use cases worth thinking about.<\/span><\/p>\n

Stresses and strains<\/span><\/h2>\n

Next, we turn to things that affect the design of real-world identifier systems, perhaps rendering them less \u201cpure\u201d in information-design terms than we might like.<\/span>
\n<\/strong><\/strong><\/span><\/span><\/h2>\n

Semantics versus purity<\/span><\/h3>\n

Some systems enforce notions of identifier purity — often defined as some combination of uniqueness, orderliness, and ease of collation and sorting — by rigorously stripping all semantics from identifiers. \u00a0That is an approach that can function reasonably well in back-end systems, but greatly reduces the usefulness of the identifiers to humans (because understanding what the identifier identifies requires database reflection), and introduces extra possibilities for error in application because (among other reasons) errors caused by human transcription are hard to catch when the identifiers are meaningless strings of letters and numbers. \u00a0On the other hand, \u201cpure\u201d opaque identifiers counter a tendency to assume that one knows what a semantically laden identifier means, when in fact one might not. \u00a0And sometimes opaque identifiers can be used to provide stability in situations where labels change frequently but the labelled objects do not. \u00a0<\/span><\/strong><\/strong><\/span><\/span><\/h2>\n

At the other end of the spectrum, identifier systems that are heavily burdened with semantics have problems with uniqueness, length, persistence, language, and other issues arising from inherent ambiguity of labels and other home-brewed identifier components. \u00a0It is worth remembering, too, that one person\u2019s helpful semantics are another\u2019s mumbo-jumbo; just walk up to someone at random and ask them the dates of the 75th Congress if you need proof of that. Useful systems find a middle ground between extremes of incomprehensible rigor and mindlessly verbose recitation of loosely-constructed labels. <\/span><\/p>\n

It\u2019s worth noting in passing that it can be very difficult to prevent the unwanted exposure of \u201cback-end\u201d identifier to end users. \u00a0For example, URIs constructed from back-end components often find their way into the browser bars of authors researching online, who then paste them into documents that would be better served by more brain-compatible, human-digestible versions. <\/span><\/p>\n

\n\n\n\n\n<\/colgroup>\n\n\n\n\n\n\n\n\n
Moniker type<\/span><\/td>\nIdentifier<\/span><\/td>\nNotes<\/span><\/td>\n<\/tr>\n
Citation<\/span><\/td>\n18 USC 47<\/span><\/td>\nStandard citation ignores all but Title and section number; intermediate aggregations not needed, and confusing.<\/span><\/td>\n<\/tr>\n
Popular name<\/span><\/td>\nWild Horse Annie Act<\/span><\/td>\nDescriptive and often used in popular accounts, the press, agency guidance on related rules, etc., but hard to find in the codified version.<\/span><\/td>\n<\/tr>\n
LII URI, (\u201cpresentable\u201d version) <\/span><\/td>\nhttp:\/\/www.law.cornell.edu\/uscode\/18\/47.html<\/span><\/td>\nBased on title and section number<\/span><\/td>\n<\/tr>\n
LII URI, \u201cformal\u201d version<\/span><\/td>\nhttp:\/\/www.law.cornell.edu\/uscode\/18\/usc_sec_18_00000047—-000-.html<\/span><\/a><\/td>\nAlso title and section based, but padded and normalized to allow proper collation; \u201csupersection\u201d aggregations above the section level are similarly disambiguated.<\/span><\/td>\n<\/tr>\n
USGPO URI, GPOAccess<\/span><\/td>\nhttp:\/\/frwebgate.access.gpo.gov\/cgi-bin\/getdoc.cgi?dbname=browse_usc&docid=Cite:+18USC47<\/span><\/td>\nParameterized search returning 1 result.<\/span><\/td>\n<\/tr>\n
FindLaw URI<\/span><\/td>\nhttp:\/\/codes.lp.findlaw.com\/uscode\/18\/I\/3\/47<\/span><\/td>\nSeemingly mysterious, because it interjects subtitle and part numbering, which is not used in citation. \u00a0Note that this hierarchy would also vary from Title to Title of the Code — not all have Subtitles, eg.<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n

The table above shows some \u201cmonikers in the wild\u201d — various real-world approaches to the problem of identifying a particular section of the US Code. \u00a0The \u201cformal\u201d LII identifier, highlighted in yellow, shows just how elaborate an identifier needs to be if it is to accommodate all the variation that is present in US Code section numbering<\/a> (there is, for example, a 12 USC 1749bbb-10c), while still supporting collation<\/span>. \u00a0The FindLaw URI demonstrates the fragility of hierarchical schemes; the intermediate path components would vary enormously from Title to Title, and occasionally lead to some confusion about structure, as intermediate levels of aggregation are called different things in different Titles. It is hard to tell, for example, if Findlaw interpolates “missing” levels into the URIs in order to maintain an identical scheme across Titles with dissimilar “supersection” structure.<\/span>
\n<\/strong><\/strong><\/span><\/span><\/h2>\n

Administrative zones of control and procedural rules<\/span><\/h3>\n

Every identifier implies a zone of administrative control: \u00a0somebody has to assign it, somebody has to ensure its uniqueness, and somebody or something has to resolve it to an actual document location, physical or electronic. \u00a0Though it has taken years, the community has recognized that qualities of persistence and uniqueness are primarily created by administrative rather than technical apparatus<\/a>. \u00a0That becomes a much more critical factor when dealing with government documents, which may be surrounded by legal restrictions on who may assign identifiers and when, and in some cases what the actual formats must be. \u00a0A legislative document may have its roots in ideas and policies formed well outside government, and pass through numerous internal zones of control as it makes its way through the legislature. It may emerge at the other end via a somewhat mysterious intellectual process in which it is blown to bits and the fragments reassigned to a coherent, but altogether different, intellectual structure with its own system of identifiers (we call this \u2018codification\u2019). \u00a0There may be internal or external requirements that, at various points in the process, \u00a0cause the document to be expressed in a variety of publications and formats each carrying its own system of citations and identifiers.<\/span><\/strong><\/strong><\/span><\/span><\/h2>\n

The legacy process, then, is an accretive one in which an object acquires multiple monikers from multiple sources, each with its own requirements and rules. \u00a0Sometimes those requirements and rules are shaped by concerns that are outside, and perhaps at odds with, sound information-organization practice. \u00a0<\/span><\/p>\n

For example, the House and Senate each have their own rules of procedure, in which bill numbering is specified. \u00a0Bill numbers are usually accession numbers that reset with each new Congress, but the rules of procedure create exceptions. \u00a0Under the rules of the House for \u00a0the 106th Congress, the first ten bill numbers were reserved for use by the Speaker of the House for a specified time period. During the 107th and 108th Congresses (at least), the time period was extended to the full first session. \u00a0We surmise that this may have represented an attempt to reserve \u201cimportant\u201d bill numbers for things important to the majority party\u2019s legislative agenda. \u00a0Needless to say, this rendered any relationship between bill numbers and chronology or order of introduction questionable, at least in a limited number of cases. The important point is that identifier usage will be hostage to political considerations for as long as it is controlled by rules of procedure; that situation is not likely to change. \u00a0<\/span><\/p>\n

But there are also virtues to the legacy process, primarily because close association with long-standing institutional practices lends long-term stability to identifier schemes. \u00a0Bill numbers have institutional advocates, are well-understood, and unlikely to change very much in deployment or format. They provide good service within their intended scope, however much they may lose when taken outside it.<\/span><\/p>\n

That being said, a \u201cgold standard\u201d system of identifiers, specified and assigned by a relatively independent body, is needed at the core. \u00a0That gold standard can then be extended via known, stable relationships with existing identifier systems, and designed for extensible use by others outside the immediate legislative community.<\/span><\/p>\n

Status, tracing, versioning and parallel activity<\/span><\/h3>\n

It is useful to distinguish between <\/span>tracing<\/span> the evolution of a bill or other legislative document and <\/span>recording the status<\/span> of that document. \u00a0<\/span>Status<\/span> usually records a strong association between some version of the document and a particular, well-known stage or event in the process by which it is created, revised, or made binding. \u00a0That presents two problems. \u00a0There is a <\/span>granularity<\/span> problem, in that some legislative events that cause alteration of the document are so trivial that to distinguish all of them would be to create an unnecessarily fine-grained, burdensome, and unworkable system. There is a <\/span>stability<\/span> problem in that legislative processes change over time, sometimes in ways that are very significant, as when (in 1975) the House rules changed to allow bills to be considered by multiple committees, and sometimes in ways that are not, as when House procedural rules are revised in trivial, short-lived ways at the beginning of each new Congress. \u00a0Optimally, bill status would be a property related to a small vocabulary of documented legislative milestones or events that remains very stable over time. \u00a0Detailed tracing of the evolution of a bill would be enabled through a series of relationships among documents that would (for instance) identify predecessor and successor drafts as well as other inter-document relationships. \u00a0These properties would exist as part of the data model without need for additional semantics in the identifiers. Such a scheme might readily be extended to accommodate the existence of multiple, parallel drafts, as sometimes happens during committee process.<\/span><\/strong><\/strong><\/span><\/span><\/h2>\n

In this way, the model would answer questions about the \u201cversion\u201d of a given document by making assertions either about its \u201cstatus\u201d — that is, whether it is tied to some well-known milestone in legislative process — or by some combination of properties that are chained back to such a milestone. \u00a0For example, a document might be described as a \u201ccommittee draft from Committee X that is a direct revision of the document submitted to the committee, dated on such-and-such a date\u201d. \u00a0The exact \u201cversion\u201d of the document is given by a chain of relationships tied back to a draft that can be definitively associated with a stable milestone in the legislative process.<\/span><\/p>\n

It\u2019s worth noting that while it would certainly be possible to identify versions using \u201cversion numbers\u201d built out by extending the accession number of the root document with various semantically-derived text strings, it\u2019s not necessary to do so. \u00a0The identifiers could, in fact, be anything at all. \u00a0All that is needed is for them to be linked to well-known \u201cmilestone\u201d documents (e.g., the version reported out of committee) by a chain of relationships ( for example, \u00a0\u201cisSuccessorVersionOf\u201d) that chain back to the milestone. \u00a0This may be particularly important when the document-to-document relationship extends across boundaries between zones of administrative control, or outside government altogether.<\/span><\/p>\n

Granularity<\/span><\/h3>\n

To a great extent, the things that are being ‘identified’ by identifiers are discrete documents, traditionally rendered as discrete print works. There are, however, significant exceptions that should be accommodated. In addition, changes in the nature and structure of documents that may be issued in the future should be anticipated as well.<\/span><\/strong><\/strong><\/span><\/span><\/h2>\n

The issue of \u201cgranularity\u201d arises from the need to identify parts of a discrete document. For example, although a congressional hearing is published as a single document (sometimes in multi-volume form), it may be useful to make specific references to the testimony of individual witnesses. Even more significant would be mapping the relationships between the U.S. Code and the public laws from which it is derived. In these cases, the granularity of the identifiers available should be more fine-grained than the documents being identified. So, although a Public Law or slip law can be completely identified and described by a given set of identifiers, it is valuable to have additional identifiers available for sub-parts of these documents, so that mapping adequate relationships to sections of the U.S. Code can be described.<\/span><\/p>\n

Of course, admitting such identifiers can be a slippery slope. The set of things that <\/span>could<\/em><\/span> be identified in legislative documents is fairly unbounded, and any identifiers will arguably be useful to <\/span>someone<\/span>. An attempt to label all possible things, however, is madness, and should be avoided. The result would be numbers of unused, or seldom used identifiers which would over-complicate entities and the overall structure of the identifier system.<\/span><\/p>\n

Fragmentation and recombination<\/span><\/h3>\n

Identifiers are used in ways that go well beyond slapping a unique label on a relatively static document. \u00a0They help us keep track of resources that can, in the electronic environment, be highly mobile. \u00a0Legislation is often fragmented and re-combined into new expressions, some official and some not. \u00a0For many legal purposes, it is important for the fragments to be recognized as <\/span>authentic<\/span>, that is, carrying the same weight of authority as the work from which they were originally taken. \u00a0Current practice accommodates this through the use of a variety of officially-published finding aids, including significant ones associated with the US Code: \u00a0the Table of Popular Names, the \u201cShort Title\u201d notes, and Table III of the printed edition of the US Code, which is essentially a codification map. Elsewhere<\/a>, we’ve referred to such a work as a \u201cpont\u201d, that is, something that bridges two isolated legal information resources. \u00a0Encoding \u00a0of ponts in engineered ways that facilitate use in retrieval systems is a particularly crucial function that should be supported by the identifier model.<\/span>\u00a0<\/span>
\n<\/strong><\/strong><\/span><\/span><\/h2>\n

Codification<\/span><\/h4>\n

Codification presents challenges, the more so because it can erect substantial barriers for inexperienced researchers. \u00a0Citizens often seek legislation by popular name (\u201cWild Horse Annie Act\u201d). They don\u2019t get far. \u00a0The problem is usually (though not always) more difficult than simply uncovering an association between the popular name of the act they\u2019re seeking and some coherent chunk of the United States Code, or a fragment within a document that carries a Public Law number. \u00a0Often, the original legislation has been codified in ways that scatter fragments over multiple Titles of the US Code.<\/span><\/strong><\/strong><\/span><\/span><\/h2>\n

That is so because even a coherent piece of legislation — and many are not — \u00a0typically addresses a bundle of issue-related concerns, or the needs of a particular constituency. \u00a0A \u201cfarm bill\u201d might contain provisions related to tax, land use, regulation of commodities, water rights, and so on. \u00a0All of those belong in very different places under the system of topics used by the US Code. \u00a0Thus, legislation is fragmented and recombined during the process of codification. \u00a0While this results in much more coherent intellectual organization of statutes over the long term, it makes it difficult for users to exchange the tokens they have — usually the popular name of an Act, or some other moniker assigned by the press <\/span>(\u201cObamacare\u201d) <\/span><\/strong><\/strong><\/span>— for access to what they are seeking.<\/span><\/strong><\/strong><\/span><\/p>\n

Table III of the United States Code<\/a> provides a map from provisions of Public Laws to their eventual destination within the US Code, as the Code existed at the time of classification. \u00a0That is potentially very useful to a present-day audience, provided that the relationships expressed in it can be traced forward through time; changes to the Code from the time of classification forward \u00a0would need to be accounted for. \u00a0That would rest on two things: \u00a0an identifier system capable of tracking the fragments of the original Act as they are codified, and a series of relationships that account for both the process of codification and the processes by which the Code itself subsequently evolves.<\/span>
\n<\/strong><\/strong><\/span><\/span><\/h2>\n

Fragmentary re-use<\/span><\/h4>\n

Codification is really a special case of something we might call \u201cfragmentary re-use\u201d — an application in which a document snippet, or other excerpt from an object, is reused outside its parent. \u00a0Next time, we’ll discuss the problems of identifier exposure in a Linked Data context, noting that identifiers must carry their own context. \u00a0A noteworthy example of this is the legislative fragment that needs to carry some link back to its provenance, and specifically its legal status or authority. \u00a0Minimally, this would be an identifier resolvable to a data resource describing the provenance of the fragment. \u00a0Such an approach might fit well into a \u201clayered\u201d URI scheme such as that used by legislation.gov.uk. <\/span><\/strong><\/span><\/span><\/h2>\n

[ <\/span>Fragmented and recombined as we are, we’ll stop here with a song about codification<\/a>, a granular, highly-recombinant and NSFW musical selection<\/a>, and a third that queries the object itself (at 2:45<\/a>) and makes heavy use of visual and audio recombination . Next time: some problems with current practice, identifier manufacturing, and what happens when we think about Linked Data, as we surely should<\/a><\/em><\/span> ]<\/span><\/span><\/p>\n


\n<\/strong><\/span><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"

[ Part 2 in a 3 part series. Last time we talked about some general characteristics of identifiers for legislation, and some sources of confusion in legacy systems. This time: some design problems having to do with granularity and use, and the fact that identifiers are situated in legal and bureaucratic process. ] Identifier granularity […]<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[6753,6752],"_links":{"self":[{"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/posts\/20"}],"collection":[{"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/comments?post=20"}],"version-history":[{"count":12,"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/posts\/20\/revisions"}],"predecessor-version":[{"id":41,"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/posts\/20\/revisions\/41"}],"wp:attachment":[{"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/media?parent=20"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/categories?post=20"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/metasausage\/wp-json\/wp\/v2\/tags?post=20"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}