{"id":7,"date":"2012-05-07T07:52:35","date_gmt":"2012-05-07T12:52:35","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/metasausage\/?p=7"},"modified":"2012-05-18T04:09:10","modified_gmt":"2012-05-18T09:09:10","slug":"identifiers-part-1","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/metasausage\/2012\/05\/07\/identifiers-part-1\/","title":{"rendered":"Identifiers, Part 1"},"content":{"rendered":"

[NB: Making MetaSausage is a new blog on legislative metadata and legislative systems. \u00a0It’s a place to talk geek about legislation. \u00a0We make no promises, but we think posts will appear every couple of weeks. \u00a0Comments encouraged. ]<\/em><\/p>\n

\"The<\/a>The law-creating process described in <\/span>How Our Laws Are Made<\/a><\/span> (HOLAM), and other civics texts like it, is a lot like the Mississippi River: \u00a0formed out of a zillion small tributaries, many of them nameless, joined into a stream that passes through a number of jurisdictions and has lots of side passages, loops and eddies, eventually breaking up again into a series of tiny streams passing through a delta. \u00a0\u00a0There is a central part of the process \u00a0— the mainstream — that is fairly well mapped, with placenames and milestones that are pretty well understood. \u00a0There are hundreds of smaller streams and brooks at either end of the process that are not well understood or named at all, and a few places in the middle where the main stream branches unpredictably. \u00a0It is a complicated map<\/a>, and it describes a territory where many people, places and things are named \u00a0— but \u00a0many are not, and some are named in ways that are ambiguous, confusing, or conflicting.<\/span><\/strong><\/strong><\/em><\/p>\n

This post is about <\/span>identifiers, <\/span>and particularly <\/span>document identifiers<\/span> : snippets of text that uniquely identify \u00a0documents that are either generated by the legislative process or are found in its vicinity. \u00a0That idea is simple enough. But well-thought-out, carefully constructed identifiers are an important foundation of any data model — and are surprisingly difficult to design. \u00a0Legislative data models have (at least) two purposes: \u00a0first, they are a kind of specification that precisely describes data encountered in and around the legislative process, the precise relationships among the data items and elements, and (significantly) relationships between the data and the real-world people, groups, and processes that create and manipulate the data. \u00a0Second, they are a device to enable communication among system-builders, stakeholders, and users about what is to be collected, what is to be expressed or retrieved, and so on. \u00a0\u00a0Before any of that can be built in a way that is both precise and communicative, we must be sure of what exactly we are talking about. \u00a0Identifiers should answer that question — <\/span>what the hell are we talking about?<\/span><\/em> — unambiguously. \u00a0\u00a0Or at least we would like them to. \u00a0Often, our legacy identifier systems don\u2019t do that very well. \u00a0As we shall see, \u00a0many existing identifier schemes are burdened with competing constraints and conflicting expectations, with less-than-ideal results.<\/span><\/p>\n

What do identifiers do?<\/span><\/h2>\n

In print, identifiers have worked differently than we really want them to in an electronic environment. \u00a0The conventions of printed books — use of pagination, difficulty of recall once issued, relative stability of editions, and most of all the assumption that identifiers will be interpreted by human readers with some knowledge of their context and purpose — result in identifiers that are less rigorous than what we need in a world of granular data consumed and processed by machines. \u00a0Some illustrations are found below. In reality our legacy “identifiers” are often less-rigorous monikers serving multiple functions, and in a digital environment we must unpack them into separate items with separate functions. \u00a0Here are some of the functions:<\/span><\/strong><\/p>\n

a)<\/span> Unique naming.<\/span> \u00a0The diverse monikers that document creators and administrators use in current practice are supposed to provide unique names for documents. \u00a0Sometimes they do; often they don\u2019t. \u00a0Usually that is because a moniker that is unique within a particular scope loses uniqueness in some wider, unanticipated arena. That is especially likely to happen when a collection of objects is moved from its original, intended scope on to the open Web, but you can find examples closer to home. \u00a0A Congressional bill number is a good example: it is unique only within the Congress during which it was assigned. \u00a0There might be an \u201cH.R. 1234\u201d for several Congresses; \u201c108 H.R. 1234\u201d is made unique by the addition of the number of the Congress during which it was introduced. \u00a0Of course, human error is often at fault, as when (for one year in the mid-1990s), there were two very different section 512s in Title 17 of the US Code. \u00a0\u00a0<\/span><\/p>\n

b) <\/span>Navigational reference<\/span>. \u00a0Identifiers often serve as search terms or convenient handles for taking the reader to another document, or for retrieving it (we discuss retrieval in the next section). \u00a0Standard caselaw citation practice is a special case of this, created specifically for printed books. \u00a0In that legacy context, unique identification and citation functions are often run together badly, usually because numbered pages are not sufficiently granular to uniquely identify individual items. \u00a0\u00a0For example, two briefly-reported judicial opinions might well appear on the same page of a print reporter, and thus carry an identical citation. \u00a0The citation is then a perfectly good tool for navigating to each case within a series of printed volumes, but is not a unique name or identifier for either of them. \u00a0\u00a0A look at <\/span>http:\/\/bulk.resource.org\/courts.gov\/c\/F3\/173\/<\/span><\/a> will show that numerous cases, each quite short, originally appeared on page 421 of Volume 173 of West\u2019s Federal Reporter, 3rd Series. \u00a0\u00a0A sample is here: <\/span>http:\/\/liicr.nl\/rimZJe<\/span><\/a> . \u00a0Any of the cases listed might be cited as 173 F.3d 421.<\/span><\/p>\n

c) <\/span>Retrieval hook\/container label<\/span>. \u00a0\u00a0Here, we distinguish use of a citation as a retrieval hook from its use as a navigational device. As we make our way around the Web, that distinction is usually blurred. Following a link to its destination puts a chunk of text in front of our eyes, and so it\u2019s hard to remember that the link might refer to the contents of a container for which it also provides a label, rather than to a simple destination milestone. \u00a0<\/span><\/p>\n

To make the distinction clear<\/a>, it\u2019s useful to think about incorporation-by-reference or other forms of embedding. \u00a0Suppose that we wish to present the current text of a subsection of a statute inside some other online document — a citizen\u2019s guide to Social Security benefits, for example. \u00a0We would likely do that via machine retrieval of the particular statutory subsection based on its identifier — but our goal would be to summon up a chunk of text, not navigate to a particular destination. \u00a0Put another way, our current practice conflates the use of citation as a means of identifying a <\/span>point, milestone, or destination<\/span> in a document (a retrieval hook) with a means of identifying a <\/span>labelled subdocument<\/span> that can be referenced or retrieved for other purposes ( a container label).<\/span><\/p>\n

As an example, the THOMAS <\/a>pages for individual bills and resolutions aggregate a great deal of information from the Congressional Record (CR), linking from the Bill Summary \u2018Actions\u2019 to both a textual representation of the CR page beginning with the desired text (but sometimes extending past the desired text into other information about unrelated issues) as well as a PDF representation of the page which shows the whole page (where the desired text may start towards the end, plus subsequent pages if the relevant section extends past the initial page). <\/span><\/p>\n

For a specific example of this, the Lily Ledbetter Fair Pay Act of 2009<\/a> has a list of major actions on Thomas, one of which is a \u201cmotion to proceed to consideration of measure withdrawn in Senate\u201d on Jan. 13, 2009. \u00a0The link for information on that motion is to CR S349: a specific page of the Congressional Record. Invoking that link leads to this display:<\/span><\/p>\n


\n\"\"
\nThe Thomas page lists the four items on the particular Congressional Record page, the last of which is the item sought. \u00a0When that item is invoked a default page with the specific text of the motion is retrieved, but an additional link to the PDF version of that page can be viewed via a link at the head of the text, with the Lily Ledbetter motion at the bottom of the retrieved PDF.<\/span><\/strong><\/p>\n

d) <\/span>Thread tag\/associative marker<\/span>. \u00a0\u00a0Some monikers group related documents into threads — aggregations whose internal arrangement is implicitly chronological. \u00a0An insurance company claim number is, in exactly this way, a dual-purpose tool. \u00a0On the one hand, it refers uniquely to a document (a claim form) that you submit after your fender-bender. \u00a0On the other, the insurance company tells you that you must \u201cuse this claim number in all correspondence\u201d \u00a0— that is, \u00a0use it to prospectively tag related documents. \u00a0That creates a labelled group of documents. If we then sort the group chronologically, it becomes a kind of narrative thread. \u00a0<\/span><\/p>\n

In this way, the moniker implies a relationship between the documents without explicitly naming or describing it, as well as being pressed into service as the identifier for one or more documents in the cluster. Regulatory docket numbers function in this manner. That is intentional, because dockets are meant to be gathering places for documents. What is confusing \u00a0— and important to remember — is that a moniker that uniquely identifies a <\/span>process<\/span> — a regulatory rulemaking — has been bent to identify a <\/span>collection of items<\/span> associated with that process, and neither the association, the collection of items, nor any particular document have been uniquely identified.<\/span><\/p>\n

Another conceptually-related but distinct example of this is the use of \u201ccaptive search\u201d URIs to meet a user\u2019s need to dynamically assemble a set of related documents. For instance, one can retrieve all the environmental law decisions of the Supreme Court at this link:<\/span><\/p>\n

http:\/\/www.law.cornell.edu\/supct\/search\/index.html?query=environment+or+environmental%20or%20EPA<\/span><\/a><\/p>\n

Such URIs embed search terms (\u201cenvironment\u201d, \u201cenvironmental\u201d, \u201cEPA\u201d) and, when used in links, retrieve the set of documents found by searching on those terms. \u00a0Typically, they are used to deal with instability or growth in the underlying corpus of things being searched. They are \u201cautomatically\u201d kept up to date as the collection changes, inasmuch as they just provoke a search of the changed collection that presents results based on the current collection contents. <\/span><\/p>\n

In that way, they are a great help to site designers. Problems can arise, however, if the user imagines that the URI somehow identifies the exact set of items retrieved for any time period other than the moment of retrieval. Precisely because the method is dynamic, the user may or may not retrieve the same document set at a later invocation. \u00a0\u00a0As a low-cost, low-effort alternative to semantic tagging, however, the approach is irresistible. \u00a0<\/span><\/p>\n

Some newer systems, \u00a0such as VIAF<\/a>, do allow the ad-hoc construction of URIs for dynamically assembled sets of objects that are then fixed as a permanent group identified by the newly-minted URI. Assuming that an appropriate search could be designed, one might thus construct URIs for any useful group of items found in an authority file, for example a list of all subcommittees of the House Armed Services Committee that have existed up to the present:<\/span><\/p>\n

http:\/\/viaf.org\/viaf\/search?query=local.names+all+%22house%20armed%20services%20committee%20subcommittee%22+and+local.sources+any+%22lc%22<\/span><\/a><\/p>\n

e) <\/span>Process milestone<\/span>. \u00a0The grant of a moniker by an official body can be an acknowledgement that official notice must now be taken, or that some process has begun, ended, or reached some other important stage. \u00a0That is obviously the case with bills, where a single piece of legislation may receive a number of identifiers as it makes its way through the process, culminating in a Public Law number at the time of signing. The existence of such a PL number can be taken as evidence that the bill has been passed into law.<\/span><\/p>\n

f) <\/span>Proxy for provenance<\/span>. \u00a0Again because monikers are often assigned by officials or organizations with special standing, they become proxies for provenance. \u00a0The existence of a bill number is evidence that the Clerk of the House has seen something and acted in a particular way with respect to it; it is valuable evidence in any attempt to establish authority.<\/span><\/p>\n

g) <\/span>Popular names, professional terms of art, and other vernacular uses.<\/span> \u00a0Monikers notably find their way into popular and professional use, some in ways that are quite persistent. \u00a0News media frequently refer to legislation by a popular name created by Congress based on the names of sponsors (the \u201cTaft-Hartley Act\u201d) or by the press itself (\u201cObamacare\u201d). \u00a0They can be politicized (\u201cdeath tax\u201d), or serve as a kind of marketing tool (\u201cUSA-PATRIOT Act\u201d). Some labels and identifiers become very closely associated with the things they label, becoming terms of art in their own right. \u00a0Thus, it is common to refer to a \u201c501(c)(3) nonprofit\u201d or a \u201cSubchapter K\u201d \u00a0partnership. \u00a0Vernacular labels have particular importance for citizens, who often use them as input to search systems. \u00a0At this writing, developers at the Sunlight Foundation have just started an initiative to collect such labels<\/a> through crowdsourcing.<\/span><\/p>\n

We’ll break off here with a musical selection<\/a> or two<\/a>.<\/p>\n

[Next time: identifier granularity and some other characteristics; stresses and strains on identifier design<\/a>.]<\/em><\/p>\n

Some other reading you might find useful:<\/p>\n