skip navigation
search

[ Special thanks to Mohammad AL-Asswad, Stevan Gostojic, Sara Frug, Rob Sukol, and of course to Ralph Seep, who contributed many ideas to this.]

This may be the geekiest blog post ever on any subject to do with American legislation, and I would not blame you if you fell asleep before reaching the end. I nearly did.  But references to runs of sections are actually elusive creatures that can be important both to legal research and data modeling, and I suspect that there is more than a little folklore attached to them.  Indeed, we’d developed some of our own, and part of this post is about how we learned better.

A few days ago, we released a feature — based on data from Cato’s Deepbills Project — that permits us to link pending legislation in Congress to the US Code sections it discusses.  It makes a kind of early-warning system for people who what to know what Congress might change sometime soon.  While we were working on it, we were reminded that legislative references to the US Code very often take the form of runs of sections such as “54 USC Secs.123-456” or the even more open-ended “54 USC 123 et seq”.

Such references can be far from clear, both as to what might be contained in them, and as to what the meaning of such a reference might actually be.  Sometimes, it isn’t clear whether they are meant to be considered as a sort of totality or blob, or (instead) as a list of sections that should be individually associated with whatever the reference to the run is asserting.  We could imagine that data modelers, programmers, and even law librarians might be as confused as we were.  Hence this post.

What’s inside a run?

Long ago, we realized that you cannot calculate the contents of a run of sections using its endpoints and an algorithm — you must query an inventory of all the sections in the Code. Runs of sections are not translatable to ordered lists of sections with integer numbers. If you doubt this, spend a little time clicking around in  12 USC Chapter 13 , where you’ll find a chamber of numbering horrors that includes Section 1749bbb-10c and other such delights. And if you were to enumerate all the sections in the run 12 USC 1748-49, you’d come up with more than 2.  A lot more than 2, in fact.  There are easily dozens, maybe a hundred.  Our solution has been to build a database of section numbers encountered while processing each Title of the Code, and when we need to determine which sections fall between the endpoints of a run, we query the database.

“Et seq.” references, which have a lower bound and leave the upper bound to float, are obviously harder to handle – so much so that a separate section describing them appears below.

Blob, or list with unique members?

Another question is whether a run of sections is meant to be a list of sections, each of which is being referenced as an individual, or whether it is meant to be a blob to which the reference applies generally.  A good example of the “blob” approach is found in Table 3 of the US Code, where Stat. L pages are mapped to section runs in the Code.  When I talk about a “blob”,  I mean that  provisions found somewhere in the run of Stat. L. pages map to provisions found somewhere in the run of US Code sections listed in a sort of general way — each blob is a container that has in it the same ideas or provisions, but not in any particular order and not in any way that can be easily mapped to named units within each blob.   It is not  the case that stuff found on the first Stat.L. page is found in the first US Code section listed, and so on — the Table is only claiming that the stuff discussed in this run over here is also discussed in that run over there.  Of course, it could also be that the list is meant to be enumerated, and that the reference is meant to apply equally to all individual sections.  I have a suspicion that that is rare (for reasons that may become clear later), but it does happen.

For data modelers, this raises some important questions.  For example, when we encounter a run “in the wild” — say, when we’re extracting stuff from a legislative document — should we note it as a run, with its own URI, or should we enumerate it out somehow and assign some property to each section individually?  Often, there is no way to tell, and our conclusion is that it is safer to create a URI for the run as a blob, and let applications decide whether it is safe, dangerous, or even necessary to assign properties to the individual items within each blob.

Ghost stories

One theory — fairly difficult to verify, but plausible based on observation — is that most section-runs are ghosts.  They had a life, once, as a named chunk of an Act or of some other document, but they lost that name when they were codified, or moved into positive law, or otherwise assimilated into the Code.  And in fact it is much easier to dereference something that refers to a precise run of sections than it is to figure out where to find a chunk of stuff named in an Act that has long since been absorbed by the Code.  There may be value in knowing what the original name was — in fact, as a practical matter, we’re wrestling with the question of how far that should be pursued. But we do know that such “ghost stories” are important when you try to unravel the meaning of an et-seq reference.

The joy of et seq.

An “et seq” reference is a reference to a run of sections that states a lower bound but no upper bound. There are practical reasons for this that have to do with updating (see below for an example).  Many, if not most, et-seqs have a lower bound that corresponds to the start of a named unit of the Code. For example, the section named as the lower bound of an et-seq is often the first section in a US Code Chapter, in Titles where the Chapter is the first level that aggregates sections.  In such a setting it is reasonable to assume that the implied upper bound of the et-seq is the last section of the Chapter.

For many years, we assumed that was all there was to it.  But while we were working with the Cato Deepbills data, we found a few things that made us question our assumptions. So we asked our friends at the Office of the Law Revision Counsel, the guys who do all the codification, and we got this very helpful reply from Ralph Seep, the Law Revision Counsel (lightly edited here):

The most important principle which informs the use of “et seq.”, as well as many other features in the Code, is that when possible, the Code should somehow be a reflection of statutory structure. Elements that are grouped together in some logical way in a statute should ideally be grouped together in the same way in the Code. As a result, when these elements are referred to as a group (such as a reference to an act or a title of an act) that reference will be relatively easy to carry over into the Code.

In general, the preference is to take such a logical grouping and classify it as a discrete unit in the Code, such as classifying an act containing multiple titles to a new chapter containing multiple subchapters. Therefore, when there is a reference to the act or to one of its titles, the Code can easily refer to that chapter or to the relevant subchapter. Since every section in that unit is being referred to, naturally the reference would be to the first section in the unit and all the ones following it.

It is highly doubtful that “et seq.” would be used at the sub- or supra-section levels not so much because Code practice disfavors it, but simply because statutes tend not to group elements together in that way. Although theoretically possible, of course, it is somewhat unlikely that subsections (d) to the end of the section form some logical unit that would be referred to, thus necessitating a reference to “section 101(d) et seq.”

By the same token, if a run of act sections goes together closely enough to constitute a logical grouping, chances are it would comprise a discrete unit of that act rather than just a random run of sections. So, just as it would be rather unlikely for there to be a reference just to the last 4 sections of an act title (and potentially any future sections later added at the end), there would be no Code equivalent to referring to those last 4 sections using “et seq.”

Here is where the background discussion becomes especially important. In current practice, the preference is to take an act or discrete unit of an act and classify it as a similarly discrete unit of the Code. However, there are numerous instances in the past where this did not happen. For example, prior to its editorial reclassification, the Central Intelligence Agency Act of 1949 was classified to sections 403a to 403w of Title 50, which constituted a run of sections neither starting nor ending subchapter I of chapter 15 of that title. We consistently referred to that act as classified to “50 U.S.C. 403a et seq.”, which means that “et seq.” was being used when the first section was not the first section of the unit and the last section was not the last section of the unit. There are other such instances when an act is classified to a run of sections not constituting a discrete unit of the Code (Pub. L. 92-300, 16 U.S.C. 558a et seq.; act Mar. 2, 1887, ch. 314, 7 U.S.C. 361a et seq.) Such acts have sometimes been referred to by the range of the first to last section (such as 7 U.S.C. 361a to 361i), but such practice has been abandoned in favor of using “et seq.” so that if new sections are added to the end of the act later, the references do not need to be updated.

And there you have it.  For the researcher, the upshot is that all references to runs of sections need to be looked at carefully, first to determine exactly what their extent is, second to see what falls within that extent, and third whether the run is meant to be considered as a totality, or as a list of things to which some assertion or quality applies.

Speaking formally

Data modelers who are working with models that contemplate section-runs may be interested in the collections ontology discussed in this paper and documented further at https://code.google.com/p/collections-ontology/.  Unfortunately, it does not seem to contain a class for an ordered list of unique objects, so we extended it to contain a UniqueList class (we think we might equally well have done an OrderedSet class). As mentioned above, we’re initially modeling all runs as collection objects and leaving it to particular applications to decide whether they should be enumerated as individuals or treated as blobs.

As is traditional, we end with a musical selection.

 

[Editor’s note: this post was co-authored by Tom Bruce, John Joergensen, Diane Hillmann, and Jon Phipps. References to the “model” here refer to the LII data model for legislative information that is described and published elsewhere. ]

This post lays out some design criteria for metadata that apply to compilations of enacted legislation, and to the tools commonly used to conduct research with them.  Large corpora discussed here include Public Laws, the Statutes at Large, and the United States Code.  This “post-passage” category also takes in signing statements, and — perhaps a surprise to some — a variety of finding aids.  Finding aids receive particular attention because

  • they are critically important to researchers and to the public;
  • they are largely either paper-based,  or electronic transcriptions of paper-based aids. They provide an interesting illustration of a major design question: whether legacy data models should simply be re-cast in new technology, or rethought completely.  Our conclusion is that legacy models (especially those designed for consumption by humans) typically embody reductive design decisions that should be rethought.
  • they illustrate particular problems with identifiers. In particular, confusion between volume/page-number citations as identifiers for a whole entity, versus their use as references to a particular page milestone, is a problem. So is alignment with labels or containers that identify granular, structural units like sections or provisions, because such units can occur multiple times within a single page.

Let’s begin with a discussion of signing statements, which might be considered the “first stop” after legislation is passed.

Signing statements

Overarching issues

Existing metadata

Signing statements have been used by many presidents over the years as a way to record their position on new legislation. For most of our history, their use has been rare and noncontroversial. However, during the George W. Bush administration they were used to declare legal positions on the constitutionality of sections of laws being signed.    

Since they had never previously been controversial, there had been little interest in collecting or indexing these documents in any systematic manner. With the change in their use, this attitude has changed, and there is a need to easily and quickly locate these documents, particularly within the context of the legislation to which they are linked.

Currently, Presidential signing statements are collected as part of the Weekly Compilation of Presidential Documents and Daily Compilation of Presidential Documents. These are collected and issued by the White House press secretary, and published by the Office of the Federal Register. As they are not technically required by law to be published, they do not appear in the Federal Register or in Title 3 of the Code of Federal Regulations.

Although they appear in the daily and weekly compilations, they are not marked or categorized in any particular manner. In FD/SYS, the included MODS files includes a subject topic “bill signings”, marking it as related to that category of event.  “Bill Signings” is also included in the MODS <category1> tag that exists in presidential documents. That designation, however, also will be used for remarks as well as formal signing statements. In addition, it is unclear whether that designation has been used with any consistency.  The MODS files for signing statements include no information designating the document as a signing statement, but only as a “PRESDOCU”. The MODS files do, however, have references to the public law to which they refer. They will also have a publication date that will match with the date on which the president signed the subject law.

In order to make signing statements findable, the existing links to relevant legislation which are already represented in the GPO MODS files should be built into the model, along with the publication date information, and designation of the president who is issuing the statement.  In addition to that, however, the categorization of a signing statement as a signing statement needs to be added in the same fashion in which we have categorized other documents, and implemented with consistency. If the implementation and study of signing statements continues as an important area of user inquiry, they will need to be identifiable.

Finally, as with all such documents, there always a desire to assist the researcher and the public by including evaluation aids.  It is tempting, for example, to indicate whether a statement includes a challenge to the constitutionality or enforceability of a law.  We believe, however, that it would be a mistake to build this into the model.  If interpretive aids of this kind are themselves properly linked to their related legislation, they will be easily found.

We have singled out signing statements because they appeared prominently among use cases we have collected, and in other conversations about the “post-passage” corpora.  In reality, many other presidential documents relate closely to legislative materials before and after passage.  We will consider them in later sections of this document as we encounter them in finding aids.

Forms of enacted Federal legislation

Enacted Federal legislation is published by many groups in many formats, including (among versions published by the legislative branch) Public Laws, the Statutes at Large, and the United States Code.   Privately published editions of the US Code are also common (and indeed prevalent), either in electronic or printed form, and it is likely that their use exceeds that of the officially published versions.

Overarching issues

How do post-passage materials relate to existing systems such as THOMAS, congress.gov, or GovTrack?

First, as to necessity: research needs have no respect for administrative boundaries or data stovepipes. Many researchers will wish to trace the history of a law from the introduction of a bill through to its final resting place in the US Code.  As to means, our model incorporates a series of properties that describe the codification of particular legislative measures (or provisions); they might be applied at the whole-document or subdocument level. That essentially replicates what is found in Tables I, II and III as we describe them below.  This area of the model might, however, require extension in light of more detailed information about the codification process itself.  We are aware, for example, that current finding aids and the data in them make it far easier to find out what happened to a particular provision in a bill (forward tracing) than it is to find out where a particular provision in the US Code came from (reverse tracing), and that the finding aids do not support all common use cases with certainty.

Updating

Virtually every document we have encountered in our survey of legislative corpora becomes “frozen” at some point, either by being finalized, or by being captured as a series of sequential snapshots.  That is not the case with the US Code, which is continually revised as new legislation is passed.  This creates a series of updating problems that involve not only modeling the current state of the Code, but also:

  • tracking new codification decisions

  • tracking changes in the state of material that has been changed, moved, or repealed,

  • revising and archiving metadata that has been changed or rendered irrelevant by changes in the underlying material

and so on.

It seems likely to us that there are both engineering and policy decisions involved here. Any legislative data model needs to have hooks that allow connection to more detailed models, maintained by others, that track codification decisions. Most use cases that look at statutes and ask, “what happened to that statute?” or “where did this come from?” will need those features.  The policy question simply involves deciding whether and how to connect to data developed by others (for example, if it were desirable to trace legislation from congress.gov into the US Code).  As to engineering, it may be simpler in the short run to simply model the finding aids that currently assist users in coping with the print-based stovepipes involved.  That has drawbacks that we describe in some detail later on, but has the advantage of being relatively simple to do at the level of functionality that the print-based aids currently provide.

Whatever approach is taken, maintenance will be an issue; most automated approaches will require the direct acceptance of data originated by others.  The Office of the Law Revision Counsel is building a system to track not only codified legislative text but to record the decisions taken.  Linking to such a system would extend, at low cost, the capabilities of existing systems in very useful ways, but it is not clear whether OLRC will expose any of this tracking metadata for public use.

Identifiers and identifier granularity

Bills become Public Laws.  Often, they are then chopped into small bits and sprayed over the US Code.   Even the most coherent bill — and many fall far short of that mark — is a bundle of provisions that are related by common concern with a public policy issue  (eg. an “antitrust law”) or by their relationship to a particular constituency (eg. a “farm bill”).   The individual provisions might most properly relate to very different portions of the US Code;  a farm bill could contain provisions related to income tax, land use, environmental regulation, and so on; many will amend existing provisions in the Code. Mapping and recording of the codification decisions involved is thus a major concern for modelers.

The extreme granularity of the changes involved can be seen (eg.) in the Note to 26 USC 1, which contains literally hundreds of entries like the following:

2004—Subsec. (f)(8). Pub. L. 108–311, §§ 101(c), 105, temporarily amended par. (8) generally, substituting provisions relating to elimination of marriage penalty in 15-percent bracket for provisions relating to phaseout of marriage penalty in 15-percent bracket. See Effective and Termination Dates of 2004 Amendments note below.

For our purposes here it is the mapping of the Public Law subsection to a named paragraph in the codified statute that is interesting. It proclaims the need for identifiers at a very fine-grained level.  The XML standard used by the House and Senate for legislation contains mechanisms for markup and identification down to the so-called “subitem” level, which is the lowest level of named container in bills and resolutions (the text in our example is actually at the “subsection” level of the Act).  It seems to us unlikely that mapping is consistently between particular levels of the substructure (that is, it seems unlikely that sublevel X in the Public Law always maps to something at sublevel Y of the US Code).  Sanity checking, then, will be difficult.

US Code identifiers

Identifiers within the US Code provide some interestingly dysfunctional examples.  They can usefully be thought of as having three basic types:  “section” identifiers, which (sensibly) identify sections, “subsection” identifiers, which apply to named chunks within a section,  and “supersection” identifiers, which identify aggregations of materials above the section level but below the level of the Title:  subtitles, parts, subparts, chapters, and subchapters.

Official citation takes no notice of supersection identifiers, but many topical references in other materials do. Chapters should get particular attention, because they are often containers for the codified version of an entire Act. Supersection identifiers are confusing and problematic when considered across the entire Code,  because identical levels are labelled differently from Title to Title.  For example, in most, the “Part” level occurs above “Chapter” in the hierarchy, and in some, that order is reversed.  It should also be noted that practically any supersection — no matter how many other levels may exist beneath it in the hierarchy — can have a section as its direct descendant.  There are also “anonymous” supersections that are implied by the existence of table-of-contents subheadings that have no official name; these appear in various places in the Code.

To our way of thinking, this suggests that the use of opaque identifiers for the intermediate supersections is the best approach for unique identification. Path-based accessors that use level-labels such as “subtitle” and “section” are obviously useful, too,  however confusing they might seem when accessors from different titles with different labelling hierarchies are compared side by side.

As to section identifiers, the main problem is that years of accumulated insertions have resulted in an identifier system that appears far from rational.  For example, “1749bbb-10c” is a valid section number in Title 12.  It may nevertheless make sense to use citation as the basis for identifier construction rather than making the identifiers fully opaque.  As to subsection labeling, it is pretty consistent throughout the Code, and can be thought of as an extension to the system of section identifiers.

Public Laws, Statutes at Large,  and the United States Code

Existing metadata

Traditional library approaches to these complex sets of materials have been very simple: they’ve been cataloged as ‘serials’ (open ended, continuing publications), with very little detail. That allows libraries to represent the materials in their catalogs, and to provide a bibliographic record that acts as a hook for check-in data, and is used to track receipt and inventory of individual physical volumes. In the law library context, where few users access these basic resources through a catalog, this approach has been sufficient, efficient and low-maintenance.

However, as this information ‘goes digital’, that strategy breaks down in some predictable ways, many of which we’ve documented elsewhere in this project’s papers; the biggest is that much of the time we would like more detailed information about smaller granules than the “serial” approach contemplates. As we make a fuller transition to digital access of this information, these limited approaches no longer provide even minimal access to this critical material.

Finding aids

There are a good many finding aids that can be used to trace Federal legislation through the codification process, and to follow authority relationships between legislative- and executive-branch materials, such as presidential documents and the Code of Federal Regulations.  All were originally designed for distribution in tabular form, at first  on paper, and more recently on Web pages.  In the new environment we imagine, the approach they represent is problematic. It may be nevertheless be worthwhile to model the finding aids themselves for use in the short term, as better implementations require significant analysis and administrative coordination.

General problems

Deficiencies of print representations

A look at the Parallel Table of Authorities [PTOA] shows where the problems are likely to be found.   Like all other tabular finding aids that originate in print, it was designed for consumption by human experts capable of fairly sophisticated interpretation of its contents.  It embeds a series of reductive design decisions that trade conciseness against the need for some “unpacking” by the reader.  Conciseness is a virtue in print, but it is at best unnecessary and at worst confusing when the data is to be consumed and processed by machines.  A couple of examples will illustrate:

  • Some PTOA entries map ranges of US Code sections against ranges of CFR Parts, in what appears to be a many-to-many relationship.  It is unlikely that every pair that we could generate by simple combinatorial expansion represents a valid authority relationship. Indeed, as we shall see, the various finding aids differ considerably in the meaning they assign to a “range” of sections and  in the treatment that they intend for them.

  • The table simply states that there is a relationship between each of the two cells in every row of the table, without saying what it is.  The name of the table would lead the reader to believe that the relationship is one of authorization, but in fact other language around the table suggests that there are as many as four different types of relationship possible.  These are not explicitly identified.

To model the finding aid, in this case, would be to perpetuate a less-than-accurate representation of the data.  As a practical matter of software project planning and management, it might be worth doing so anyway, in order to more quickly provide users with a semi-automated, electronic version of something familiar and useful. But that is not the best we could do.  Most of the finding aids associated with Federal statutes have similar re-modeling issues, and should be reconceived for the Semantic Web environment in order to achieve better results.

Identifier granularity and alignment

Most of the finding aids make use of granular references; in the case of Public Laws, these are often at the section level or below, and in the case of the US Code they are often to named subsections.  The granularity of references may or may not be reflected in the granularity of the structural XML markup of any particular edition of those resources.

The Statutes at Large use a page-based citation system that creates two interesting modeling issues.  First, on its own, a page-based citation is not a unique identifier for a statute in Stat. L., because more than one may appear on one page.   Second, it was not ever thus.  Stat. L. has used three different numbering schemes at various times, each containing ambiguities.  These would be extraordinarily difficult to resolve under any circumstances, and particularly so given the demands of codification we describe later in the section on the Table III finding aid. Taking these two things together, it seems that there is no way to accurately create a pinpoint link between a provision of an Act in its Public Law format and a specific location in the Statutes at Large; the finest resolution possible is at page granularity.

It would thus seem that the most sensible approach would be to use a somewhat “loose and floppy” relationship like “isPublishedAt” to describe the relationship involved, since the information available from the Table does not really support pinpoint accuracy.   That is unfortunate, in that there are important use cases that need such links.  For example, statutes are frequently described in judicial opinions using citations that refer only to the Statutes at Large, sometimes because the case in question predates the US Code and no other reference can exist, and sometimes because the writer has omitted other citation.  It is effectively impossible to construct a pinpoint link if the cite contains a subsection reference; one has to cite to the nearest page, relying on the reader to find the relevant statute on the page somewhere.  It would be equally difficult to trace through a Stat.L. citation to the relevant provision of the US Code in situations where the USC citation has been omitted.

In short, identifiers in this part of the legislative jungle have two problems: first, they sometimes do not exist at a sufficiently granular resolution in the relevant XML versions, and second, granular identifiers do not resolve or map well to materials whose citation has traditionally been based on print volume and page numbers.

Identifiers for Presidential documents: general characteristics

Some of the finding aids we describe below provide mappings between Presidential documents and the codified statutes in the US Code.  Identifiers for Presidential documents are assigned by the Office of the Federal Register, and are typically accession numbers.  It is worth noting that OFR provides a number of finding aids and subject-matter descriptions of Presidential documents, though these are beyond our scope here.

As to GPO, it appears at first blush that the MODS metadata for the US Code as found in FD/SYS does not reflect associations with Executive Orders, although they are vaguely modeled in the MODS files associated with the Executive Orders themselves.  There would be some virtue in being able to find information in both directions.  That is especially true in situations where the state of the law cannot be fully understood without referring to both the Code and related Executive Orders simultaneously.  For example, 4 USC 1, in its most current version, claims that there are 48 stars on the flag of the United States; it is only possible to find out where the other two came from by referencing the Executive Orders that accompanied statehood for Alaska and Hawai’i.

The Table of Popular Names (TOPN)

For the general public, the TOPN is probably the single most useful finding aid for Federal legislation. That is because it bridges the gap between popular accounts of legislation — for example, in the news media — and the codified collections of laws that are in effect.  Where, exactly, do we find the Lily Ledbetter Fair Pay Act in the modern statute book?  The answer to that question isn’t obvious.

Broadly — very broadly — there are two ways in which an Act may be codified.  First, it could be moved into the Code wholesale, typically as a new Chapter containing numbered sections that reflect the section divisions in the Act.  Second, it could be disassembled into a bag of provisions and scattered all over the Code, with each section placed in a region of the Code dictated by its subject matter.   In such cases, the notes to the Code section that describes the “Short Title” of the Act generally contain a roadmap of what has been done with the rest of it.   That also happens when the Act contains language that consists entirely of instructions for amending existing statutes already codified.

For example, the TOPN entry for the Lily Ledbetter Fair Pay Act looks like this:

Lilly Ledbetter Fair Pay Act of 2009

Pub. L. 111-2, Jan. 29, 2009, 123 Stat. 5

Short title, see 42 U.S.C. 2000a note

It maps the identifier for the Public Law version of the Act to the Statutes at Large, with a page reference to the Stat. page on which the Act begins.  It also maps to the “Short Title” section of the USC, whose note contains information about what has been done with the Act.

Short Title of 2009 Amendment

Pub. L. 111–2, § 1,Jan. 29, 2009, 123 Stat. 5, provided that: “This Act [amending sections2000e–5 and 2000e–16 of this title and sections 626, 633a, and 794a of Title 29, Labor, and enacting provisions set out as notes under section 2000e–5 of this title] may be cited as the ‘Lilly Ledbetter Fair Pay Act of 2009’.”

This entry makes an important point about codified legislation.  While it is natural to believe that codification consists of taking something that contains entirely new legislative language, breaking it into pieces, and plugging the pieces into the Code (or substituting them for old ones), that is not exactly what happens much of the time.  Any Act could be, and often is, a laundry list of directives to amend existing codified statutes in some way or other.   In such cases, the text of the Act is not incorporated into the Code itself, but into the Notes, in a manner similar to the example just given.  That is a subtle difference, but an important one, as we shall see in the discussion of Table III below.  It introduces an extra layer of mapping into the process, in a way that is partially obscured by the fact that inclusion is in the Notes rather than in the text of the Code.  One result of this is that, in general, it is easier to look at a current provision and find out where it came from than it is to look at an historical provision and find out what happened to it.

From a data modeler’s perspective, the TOPN is useful but not necessary; the necessary finding aid can be constructed by aggregating data from other tables, or by simply referring to the short titles and popular names given in the text of the Act itself. The relationships modeled by TOPN aggregate information:

  • from the Acts or bills themselves (House and Senate identifiers for bills, and the name of the Act as it’s found in either the bill or (better) in the Public Law version);

  • from Table 3, which describes where the Public Law is codified; and

  • from Table 1, which models an extra “change of address” that is applied in cases where codified legislation has been reorganized for passage into positive law.

US Code Table I

Table I describes the treatment of individual sections in Titles that have been revised for enactment as positive law.  The Table is a straightforward mapping of “old” section numbers in a Title to “new” section numbers that apply after the Title was made into positive law.  As such, Table I entries also have a temporal dimension — the mappings need only be applied when tracing a citation to the Code as it existed before the date of positive law enactment to a location in the Code after that date.

A relational-database expert obsessed with normalization would say that Table I is, then, really two tables — one that maps old sections to new sections within a Title, and a second, implied table that says whether or not each of the 51 Titles has been enacted into positive law, and if so, when.  The researcher wanting to trace a particular reference would follow this heuristic:

  • Does my reference fall within a positive-law Title?

  • If so, does my reference precede the date of enactment into positive law?

  • If so, what is the number of the “new” section?

Thus, the model will need to reflect properties of the Title itself (“enactedAsPositiveLaw”) and of the mapping relationship of old to new (“hasPositiveLawSection”).

US Code Table II

The United States Code was preceded by an earlier attempt at regularized organization, the Revised Statutes of 1878 .  Citations to the Revised Statutes are to sequentially-numbered Sections, with “Rev.Stat.” as the series indicator.  Table II provides a map between Rev. Stat. cites and sections of the US Code, along with a number of status indicators; the two most important (and common) of these indicate that a statute has been repealed, or that Table I needs to be applied because the classification shown was done prior to positive-law enactment of the Title.

Unlike other finding aids we describe, where the meaning of mappings between ranges and lists of things can be both combinatorial and ambiguous, Table II appears straightforward. A list or range of items in the Rev. Stat. columns can be mapped one-to-one to the corresponding list or range in the USC column.  The first element in the list or range in Rev. Stat. maps to the first element in the list in USC, the second to the second, and so on.  Simple reciprocal relationships should obtain.

That is particularly important in light of the relationship between Table II and Table III.  In Table III, for all statutes passed before 1874, Table III references all refer to the Revised Statutes, and not to the US Code.  So, for those statutes, in order to determine where they may still exist as part of the US Code, reference needs to be made first to Table III, to obtain the R.S. section where it was first encoded, and then to Table II, to determine where that R.S. section was re-encoded in the US Code.  Without the straightforward, one-to-one relationship between the R.S. and US Code expressed in Table II, the connection between pre-1874 statutes and current US Code sections would not be possible.

US Code Table III

Table III, which maps individual provisions within Public Laws to pages in the Statutes at Large and to sections of the US Code, exhibits a number of interesting problems.  Here is how one such mapping appears in the LRC’s online tool:

In this case, we’re mapping the individual provisions of PL 110-108 (readable at http://www.gpo.gov/fdsys/pkg/PLAW-110publ108/html/PLAW-110publ108.htm )  to a range of pages in the Statutes at Large and to sections in the US Code (and their notes). The GPO version helpfully contains markers for the Stat. L. page breaks.  Some noteworthy observations:

  • The Public Law needs section-level identifiers. Notes sections within the USC need their own identifiers, as do pages within the Statutes at Large.

  • Since the Stat. L. citation for the Act always goes to the first page of the Act as it appears in Stat.L., there is ambiguity between

    • 121 Stat. 1024, the citation/identifier indicating the whole Act for purposes of external citation, and

    • 121 Stat. 1024, the single-page reference that describes where Section 1 of the Act can be found (and for that matter, supposedly, some of Sections 2-6 as well)

  • For some time periods, chapter numbers would disambiguate individual laws where more than one statute appears on a single page, although as we have seen, chapter numbers have uniqueness problems of their own. Chapter numbers play no role in this example, as they were not used after 1957.

  • The Act is classified to the notes in the relevant USC sections.

    • In the case of section 1 of the Act, the notes simply state the name of the Act.

    • In the case of section 151, the entire text of the legislation appears in the notes for the Act.  It would appear that it is done this way because the legislation’s provisions amount to a series of instructions for amending existing statutes, and thus can’t be codified per se.  Rather, they are a description of what should be done to change things that have been codified already.

  • GPO’s MODS file is evidently created by machine extraction of USC citations, because it incorrectly identifies the Act as modifying 26 USC 4251. It’s possible, though,  that the presence of a USC section in the MODS file might simply mean “found at the scene of the crime by our parser” rather than “changed by the Act”.  The relationship is unclear, and may be impossible to express clearly in XML.

  • GPO’s MODS file for the Act treats the mapping implied by the second line of the example pretty loosely, describing small collections of US Code and Stat.L. pages associated with the Act, but not describing any particular relationship between the items in each collection or between collections.  This is, again, a place where XML falls short of what is possible in an RDF-based, machine-readable model.

The second line of the table entry is the most interesting.  At first glance, it appears to describe a many-to-many relationship between a range of sections in the Act and a range of pages in the Statutes at Large.  But it seems improbable that such a relationship would actually describe anything useful, and a quick side-by-side look at the Act and the Statute shows that such an interpretation is incorrect.  The actual arrangement of page breaks in Stat. L. would indicate that the mapping should be otherwise:

  • Section 2 appears in its entirety on 121 Stat 1024.

  • Section 3 spans the break between 1024 and 1025.

  • Section 4 spans 1025 and 1026

  • Sections 5 and 6 appear in their entirety on 1026

Why is that?  The simplest explanation is that the entries in the table — numbers separated by a dash — do not represent lists of individual sections. Instead, they represent clusters of sections that are related to each other as clusters.  They seem to be saying, “somewhere in this clump of legislative language, you’ll find things that relate to things in this other clump of legislative language, and the clumps span multiple sections or provisions, possibly ordered differently in each document”.

Looking at the text itself — which is a series of detailed, interrelated amending instructions — shows that indeed it would be a horrible (and likely very confusing) task to pick the provisions apart into a fully granular mapping, leaving “cluster-to-cluster” mapping as the only viable strategy for describing the relationship between the two texts.

A detailed model of Table III would then require:

  • clarifying the distinction between a page reference to the first page of an Act as it appears in Stat.L. and the citation of the statute as a whole.

  • describing each section or subsection (granule) within the Public Law as one that is either

    • new statutory language, or

    • a set of instructions for amending existing language

  • describing each target in the USC as either

    • an actual statute, or

    • notes to the statute. It is worth remarking that, in any of the finding aids, the fact that something has been classified to the notes provides a clue as to what that thing is and what the nature of the classified relationship might be. This may indicate a need for subproperties that would be accommodated in some future extension.

  • distinguishing between relationships that involve re-publication (as between Public Laws and Statutes at Large) from those that involve restatement or codification (as between either of those and the US Code)

  • using different properties to describe provision-to-provision and cluster-to-cluster relationships

Taken together, these requirements would form an approach that would more accurately model the relationships the original Table was meant to model.  In some sense this is an interpretive act — any Table that records codification decisions does, after all, record a set of interpretations, and so will its model.  But in this case the interpretation is an official one, entrusted to the Law Revision Counsel and in any case practically unavoidable.

US Code Table IV

Table IV “lists the Executive Orders that implement general and permanent law as contained in the United States Code” .  Executive Orders are instructions from the President mandating an action, reorganization, or policy change in some part of the executive branch.  They are promulgated pursuant to statutory authority and as lawful orders of the chief executive, have force of law.  They are published in the Federal Register and appear in the annual Compilation of Presidential Documents.  They are sequentially numbered, but are also identified by date of signing, title, and the authoring president.  All four of these identifying attributes are specified in the GPO MODS files which accompany these documents in FD/SYS.  In addition, there exists a reference to the volume and issue number of the Weekly Compilation in which the order appears.  Finally, the MODS files typically include a reference to the enabling law as well.

The Table shows that:

  • Executive Orders have identifiers, apparently accession numbers that run from the beginning of time.

  • Nearly all refer to the “notes” attached to sections of the USC, since (as the description says) Executive Orders are typically implementation instructions independent of the language of the statute itself.

References to the notes have special features worth remarking.  Often, the mapping is given to the note preceding (“nt. prec.”) a particular section.  That distinctive language is rooted in the way that the LRC conceives of the Code’s structure.  In the minds of the LRC, the Code consists of Titles that are divided into sections.  Intermediate levels of aggregation — subtitles, parts, subparts, chapters, and subchapters — are convenient fictions used to organize the material in a manner similar to the tabs found in a card catalog.  Thus, the “note preceding” a section is most often a note that is attached to the chapter of which the section is a part (chapters are typically, but not always, the level that aggregates sections, and often correspond to an Act as a whole).  As modelers, we’re presented with a choice between fictions: either we join LRC in pretending that the intermediate levels of aggregation don’t exist, or we make use of them.  The latter presents other problems with representing parent-child relationships in the structure, but fortunately that is a concern for XML markup designers and not so much for us.

It would seem that the best approach might be to model both sets of relationships: a hierarchical structure based on aggregations, and a sequential structure suggested by the “insertion model” just described.  In terms of the model, this is just a matter of making sure that identifiers are in place that will facilitate both approaches.  The main issues raised by this approach have to do with XML markup and encoding; as with other corpora we have encountered (eg. the Congressional Record) user needs demand, and the model can accommodate, far more than the current publicly available XML encoding of the document will support.

Thus, we would end up with:

  • a set of unique identifiers for sections, based on title and section numbers and thus reflecting current citation practice;

  • a set of sub-section identifiers that extend section identifiers in a way that is based on nested subsection labeling.

  • a set of super-section identifiers that is based on human readable hierarchy, represented as paths, eg. “/uscode/title42/subtitle1/part3/subpart5/chapter7/subchapterA”

  • a set of completely opaque identifiers for both section and supersection levels.  There is less need for this at the subsection level, but any such system could easily be extended;

  • parent-child relationships between

    • subsections and sections

    • sections and supersections

    • supersections and containing supersections

  • next-previous relationships between sections.  These should take no account of supersection boundaries.

As we’ve said in other contexts, it is worthwhile to remember that nothing limits us to a single identifier for any object.

US Code Table V

Table V maps Presidential proclamations to the US Code.  Proclamations differ from Executive Orders in that they do not “legislate” as such.  Rather, they are issued to commemorate a significant event, or other similar occasion.  Like Executive Orders, they are published in the Federal Register, and appear in the Compilation of Presidential Documents. Like Executive Orders, they are sequentially numbered (without reference to year, president, etc.), and are also identified by date, title, issuing president, and the volume and issue number of the Weekly Compilation.  All these identifiers are typically present in the GPO MODS files in FD/SYS.

Before 1950 or so, the vast majority of proclamations establish national monuments. More recently, other topics as diverse as the maintenance of the Strategic Petroleum Reserve, tariff schedules, and the celebration of Armed Forces Day show up frequently.  As with Executive Orders and Table IV, the proclamations have accession numbers, and the vast majority of references are to notes attached to the Code and not the Code itself.

US Code Table VI

Table VI maps reorganization plans to the US Code.  Reorganization plans are essentially executive orders that describe major alterations to executive-branch agencies and organization, though they do not carry executive-order identifiers.  For example, Reorganization Plan 3 of 1970 establishes the Environmental Protection Agency and expands the structure of the National Oceanographic and Atmospheric Administration.  Generally they carry citations to the Statutes at Large and to the Federal Register (the FR cite does not appear in Table VI).   While no concise identifier exists for them in and of themselves, it appears that they could be identified by a year-number combination (eg. “RP-1970-3”).  These associations can readily be modeled by associating an identifier for the plan itself with the page references, through one or more “isPublishedAt” relationships.

The Parallel Table of Authorities (PTOA)

The Parallel Table of Authorities and Rules describes relationships between statutes in the US Code and the CFR Parts that they authorize. For the most part, the PTOA maps ranges of sections in the US Code to lists of Parts in the Code of Federal Regulations. It has limitations, described by GPO as follows:

Entries in the table are taken directly from the rulemaking authority citation provided by Federal agencies in their regulations. Federal agencies are responsible for keeping these citations current and accurate. Because Federal agencies sometimes present these citations in an inconsistent manner, the table cannot be considered all-inclusive. The portion of the table listing the United States Code citations is the most comprehensive, as these citations are entered into the table whenever they are given in the authority citations provided by the agencies. United States Statutes at Large and public law citations are carried in the table only when there are no corresponding United States Code citations given.

The suggestions made here, then, are observations about a critically important finding aid, strongly related to legislative material, that is in need of some help.  Thinking about the PTOA and the various ways in which modeling techniques such as the ones we recommend might improve it provides an interesting overview of the problems of legislative finding aids in general.

Richards and Bruce have written extensively about its organization and improvement.  They note four major areas to address:

  • Ambiguity in the description of the relationships themselves.  The Table supposedly models four different types of relationship: express authorization, implied authorization, interpretation, and application.  These are not distinguished in the PTOA entries.

  • Ambiguity in relationship targeting.  Entries on both sides of the table are typically given as ranges or lists, implying many-to-many relationships that can be combinatorially expanded.  It is not clear whether, in fact, all the sections of the US Code that could be enumerated from a range on the left side of the table would relate to particular Parts of the CFR enumerated from the lists on the right side of the table.  It seems unlikely.

  • Granularity problems related to citation of the CFR materials by Part.  In reality, the authorizing relationship would typically run from a statute to a particular section of the CFR, but the targeting in the PTOA is to the Part containing that section. It is likely that this is not a problem with granularity so much as it is an informed design decision driven by problems with the volatility of section-level identifiers as compared to printed finding aids.  Sections of the CFR come and go with some frequency, often moving around within an individual Part. Parts change far less frequently.  In print, where updating is difficult and withdrawal of stale material even more so, identifier stability is a much bigger concern.  It is possible that a digital resource could track things much more closely.

  • Directionality and reciprocity.  It is not clear which of the four possible relationships between entries are reciprocal and which are strictly directional, nor is the Table necessarily intended to be used bidirectionally.

Unfortunately, improvement is unlikely, as it would require the collection of improved information from each of the hundreds of agencies involved. Nevertheless, a simplified model can provide at least some useful information.  The LII currently models the PTOA as a single relationship between individual pairs of identifiers, asserting that each pair in a combinatorial expansion of entries on each side of the table has some such relationship.  That is undoubtedly imprecise, but it is as good as anything currently available and far better than nothing.

As always, we close with a musical selection.

References

General

Other papers

  • Richards, Robert, and Thomas Bruce, “Adapting Specialized Legal Material to the Digital Environment: The Code of Federal Regulations Parallel Table of Authorities and Rules”, in ICAIL ‘11: Proceedings of the 13th International Conference on Artificial Intelligence and Law. A slide deck based on the presentation is at http://liicr.nl/M7QyTG .

Finding aids

 [..] events are primarily linguistic or cognitive in nature. That is, the world does not really contain events. Rather, events are the way by which agents classify certain useful and relevant patterns of change.

Allen and Ferguson, “Actions and Events in Interval Temporal Logic”

What’s an event?

Legislative events — things that take place as part of the legislative process — seem straightforward to define.  They are those things that occur in the legislature:  meetings, debates, parliamentary maneuvers, and so on.   But let’s look at one of the more important words we associate with legislative events:   “vote”.

As a noun, it has two meanings:

  • an occasion on which people announce their agreement or disagreement with some proposition;

  • the documentary record of that occasion, expressed as a tally of yeas and nays.

That duality — what happens, versus the record of what happens — creates confusion.  We often talk about the documentary record as if it were the process, and vice versa.  That creates problems when a community that is primarily concerned with legislative process — consumers of legislative information like staffers on Capitol Hill, members of Congress, and others who work with the process itself — talks to information architects who are primarily concerned with the documentary record.  Subtle differences in understanding about what data models represent  can lead to real confusion about the capabilities of information retrieval systems. That is particularly true during conversations about design and evaluation.

Here, we have tried to be explicit about when we are speaking about what.  Let’s begin with a discussion of modeling problems that pertain to events-as-occasions.  A second section treats identifiers for events and how they might be collected and dereferenced.  Finally, I’ll look at the incorporation of events into a model of legislative process.

Events as occasions: modeling problems

Useful work on events is found in strange places that, after a moment’s thought, turn out not to be so strange after all.  Much of what follows was derived from the event ontology developed at the Center for Digital Music at the University of London — musicians spend a lot of time thinking about things that are embedded in a time-stream, so it’s not surprising that their events model is both detailed and useful.  In their world, and ours, events are things that occur at a time and place.  They might have duration, but they might also represent an action or change of state for which duration is irrelevant.  For example, you could describe the introduction of a bill as something that takes place during a measurable interval that extends over some milliseconds from the time an envelope leaves the hand of the Member introducing the bill until that envelope hits the bottom of the hopper.  But that would be silly; some events are simply process milestones or instants that have no duration we need worry about.  Most events have participants. Some groups of participants are recurring or somehow formalized;  it makes sense to tie events to our models for people and organizations.  Other participant groups may be ad hoc groups defined solely in terms of the event (“the people waiting for the 3 o’clock bus”).  Finally,  things are often products of events : notes produced by a musician playing an instrument, or a pie baked as part of a contest, or a legislative amendment produced in the course of a committee meeting.

Duration and timestamps

Events may have duration, or they may not.  “Bill introduction” and “adjournment” are in some sense physical processes that take place in the real world — a piece of paper is placed in a hopper, or a gavel is struck and people gather up their papers and leave the room.  Usually, though, we think of them as instantaneous abstractions that refer to a point in a process.  By the same token, events may have a duration that is defined with an uncommon use of common language.  A “legislative day” is an example of such a thing: it extends between one adjournment of the Senate and the next, which may occupy days or even months in real time.

Participants and their roles

We’ve  discussed people and organizations in a separate blog post.  Participation in events raises a few other questions about our models for people.  In particular, we would stress that people can occupy particular roles with respect to an event, and that those roles may be quite separate from the role that the person occupies within the sponsoring or convening organization.  For example, two of the authors recently attended a workshop sponsored by the Committee on House Administration; the convenor was the Technology Policy Director for that organization, and the Committee Chairman was one of the speakers — but he played no leadership role in the workshop itself.  Thus, it is necessary to have a set of role properties that describe the individual as that person appears in the context of particular events, perhaps in a way that is different from other roles that that person might play with respect to groups or organizations that are somehow associated with the event.

Location and location defaults

Many events take place somewhere in real space, so event objects need to have attributes that tell us where the event occurs.   As with other data about places, we might choose to model these with geographic coordinates, locations drawn from geodata ontologies, or both.   But neither of those systems deal particularly well with things like office addresses (“1313 Longworth House Office Building”), which provide location information for the kinds of events we associate with legislative process.  Those tend to be expressed as office locations or postal addresses.  Examples of purpose-built systems for postal addresses include the Universal Postal Union S42 standard, the vCard ontology, and the W3C PIM standard.  Of these, the vCard standard seems to present the best balance of detail and workability, and is incorporated into the http://dvcs.w3.org/hg/gld/raw-file/default/org/index.html .

Often, location information is not explicitly stated, but implicit in the nature of the event itself. “Meeting of the House Ways and Means Committee”, for example, embeds a reliable default assumption about where the meeting is to take place.  That is because the meetings always take place in the hearing room that belongs to the committee.  It makes sense to model default assumptions about locations as something that is a property of the organization (e.g. “hasDefaultEventLocation”) rather than of any particular event. When information about the event itself is partially or totally incomplete, a location can be inferred via the organizational sponsorship of the event.

Events that collect other events; identifier design

Some events are primarily interesting as collections of other events — for example, a “session” of Congress, which might be seen as a collection of various occurrences on the floor of the legislative chamber, committee meetings, and so on.  Moreover, we might want the same event to be visible in very different collections — for example, a particular committee meeting might be part of a calendar, part of a history of meetings of that particular committee, or part of a collection of meetings that different committees have had with respect to a particular bill.

That has implications for identifier design.  As we have in other discussions, we would emphasize here the importance of distinguishing between identifiers (URIs) that provide unique, dereferenceable identification of an object (where that object may itself be a collection of other objects), and alternative URIs used solely for convenience of access or for setting up alternative groupings of things.  Event identifiers need to be short and opaque;  identifiers for collected events can have elaborate (and varied) semantics associated with path elements in the URI.   Here are some illustrative examples:

http://congress.gov/congresses/101/sessions All sessions of the 101st Congress
http://house.gov/congresses/101/sessions/2011/ First session of the 101st Congress. We believe that the use of the year is more helpful than misleading.
http://congress.gov/congresses/101/sessions/2012/events/2012-04-23/votes All votes taken on 2012-04-23, House or Senate.
http://congress.gov/congresses/101/sessions/2012/events/2012-04-23/house/votes A collection identifier that mirrors the organization of the Congressional Record (though other ordering of the hierarchy would also be sensible)
http://congress.gov/congresses/101/house/sessions/2012/votes/[vote-number] An individual identifier for a vote (House roll call vote numbering runs with the session)
http://congress.gov/congresses/101/house/committees/events/hearings All hearings before House committees for the 101st Congress

Even this limited set of examples shows that consistency, on one hand,  and thorough coverage of imaginable use cases, on the other, will require a lot of painstaking work if they are to be achieved simultaneously.  Different — and possibly inconsistent — orderings of the collection hierarchy make sense for different purposes.  For example, should the committee information come first in the path hierarchy, or the type of committee event?  For some users, ../events/hearings/committees/Judiciary makes more sense than ../committees/Judiciary/events/hearings , and so on.  Which of these pathways will be “real” unique identifiers, and which should just represent access pathways to the information?

Documentary products

Events produce things.  Most interesting to us, legislative events tend to produce documents that were either themselves the “subject” of the event (as in floor debate over a particular bill) or its result (as when a committee markup session produces a new version of a bill).

The role of the document (or the more abstract notion of a “bill” or “resolution” of which it is the expression) as a value for an “aboutness” property is problematic.  As we have remarked elsewhere, bills often have multiple provisions on widely different topics, and a bill’s identifier would be potentially confusing or unhelpful as the sole value offered as, say, the “subject” of a debate.  At the same time, it is quite legitimate to say that the bill is what the debate is “about”, in the sense that the bill number would no doubt appear in any headline or agenda entry used for a description of the event — regardless of whether a single word of the discussion actually was about any part of the bill itself, or whether it was a more general discussion about an issue that the bill was meant to address, or was an occasion for a partisan attack on another party.  It might thus be wise to distinguish the “agenda item for the discussion” from “the subject of the discussion” in some way.

Modeling the proper relationships between a sequence of events and the documentary workflow that more-or-less tightly reflects it is, then, a more difficult thing.  The next section discusses that problem in some detail.

Version-Events Chain

Events, legislative process, and documentary workflow

Events are components of any legislative process model. It is important to think about the ways that clusters or sequences of  events form narratives about the legislative process, and the ways in which events can be tied to the documents that the legislative process creates.  Prior attempts to integrate an events model with legislative documentation have primarily been concerned with problems of versioning and of legislative applicability or effect.  Here, we’re talking about the process of  document creation.

Existing ways of modeling — or even discussing — what happens to a measure as it makes its way through Congress are characterized by some confusion between closely connected approaches that are, nevertheless, subtly different.  Broadly, those approaches might be divided into subtly different three types:

  • A process-aware model, in which documents are seen as artifacts produced by particular actions or processes within the legislature.  (On that view, a “measure” is something of an abstraction that groups a series of bill or resolution texts created during the legislative process,  as well as related documentary artifacts such as amendments, signing statements, and so on).  Such a model might be one of two types:

  • A “legislative process” model containing parliamentary process detail. From a distance, this approach looks a lot like a fine-grained version of the process-aware model just mentioned.  But it is far more detailed in its treatment of parliamentary procedure, rules for debate, and other turns and twists primarily of short-term interest to Congressional insiders. That is the type of story told, for example, by the CRS report on resolving differences between House and Senate bill versions. Pieces of such an approach show up in the bill-status attribute used by the House for its XML schemas and DTDs, which contains such status indicators as “held-at-desk-House” and “reengrossed-amendment-Senate”.

  • A “workflow” model, which focuses on the actual flow, versioning, and exchange of documents or other work product.  Such a document-centric model has obvious intersections (and points of confusion) with a “process-aware” model, in which major legislative events tend to produce important versions of documents.  It also intersects the “legislative process” model, in which certain parliamentary actions (such as amendment) provoke the production of particular documents.

Most of the existing vocabularies in use  — which tend to use words like “status” or “stage” to describe their components — use more than one of these approaches at the same time.  That creates subtle confusion. Each of the approaches views the world a little differently, and mixing them creates inconsistencies.  An illustration  of such a mixed model is provided by the “bill stage” vocabulary from GPO, which contains aspects of the process-aware model (“Considered and Passed House”), the legislative process model (“Additional Sponsors House”), and the workflow model (“Amendment Ordered to be Printed Senate”) all at once.

We have taken a process-aware approach to create our models, for several reasons. We have tried to map as much useful detail as possible, while avoiding the many pitfalls inherent in trying to over-think and over-model. An excessively detailed approach focussed on particular parts of the process tends to skew the usefulness of the resulting model in favor of a limited number of specialized users. Also, excessive detail in any or all aspects of such a system inevitably results in a model too cumbersome to be used or maintained:

  • Too-fine distinctions between identifiers will become confused and misapplied, both by maintainers and users.

  • Many identifiers thought useful will end up being dismissed as over-specific, leading to inefficiency and eventually, further confusion.

  • The level of technical expertise required to implement and maintain a too-highly-detailed model will become limited to far too few individuals to guarantee its proper use.

  • Finally, the time and expense involved in too great a level of detail will, in the end, render it of limited usefulness.

Instead, we have tried to provide a model that is “just right” in its level of detail, while acknowledging that our decisions about the various trade-offs involved in such an approach  would not be everyone’s.    The model has been created with an inherent extensibility that ensures that others may later add any level of specialized detail that they need.

Beyond avoiding problems with excessive detail, such an approach has a few virtues:

  • Conformance with (and respect for) existing systems.  It will probably come as no surprise that most of the existing systems that track legislative information agree as to the major features of the process.  So do most of the helpful narratives — such as HOLAM and a wide array of CRS reports — that inform people about the process.  With such obvious landmarks in common view, it would be a mistake to suggest a map that ignores the landmarks that everyone sees and agrees on.

  • Interoperability.  Using a coarser-grained, process-aware model requires a willingness to compromise a level of specialized meaning that might be achieved through the use of more specialized objects, relationships, and identifiers at every stage of the process.  However, the compromises are not so great. The interoperability that results from making use of existing systems and standards more than compensates for them.  We believe that most of those who currently track legislative information will be able to find obvious points of correspondence and linkage for their own systems.

  • Clarity.  Our process-aware approach provides a means of clearly tying the document model to the process model without creating confusion. Our bifurcated audience looks at events in quite different ways.  Researchers see events in the context of the legislative process, and focus on the aspects of the events that are of most interest to them, whether it be votes, committee activities, or some other point of interest. Librarians, on the other hand, think of events in the context of their documentary evidence, whether physical or digital, because their descriptive traditions are based on documents.  In a very real sense, we see our mission as bringing those two points of view closer together, by using the power of linking to make it possible for members of each audience to find what they need and relate the available resources together in various ways.

As always, we close with musical selections:

Where to look for more

Some months ago, I and some of my colleagues at the LII began to release a series of white papers that were written as part of the construction of a (mostly) comprehensive metadata model for Federal legislation.  They are appearing as a series of blog posts in this blog.  One which seemed more appropriate for VoxPopuLII — it had to do with metadata quality concerns that are not limited to legislation —  was posted there yesterday.  We’ll continue to adapt the white papers as blog posts and release them as Metasausage posts, but we thought that it was high time that we released full documentation of the model.  Many of you have known of its existence for a while; we’ve been slow to release it because, well, we’re just overwhelmed with work.
The model is Linked-Data-friendly and designed to be highly extensible.  We think it could serve as a reference model (by which I think I really mean “extensible scaffolding”) for a much more comprehensive metadata model for Federal legislation.  As you’ll see when you read the documentation, we made no attempt to model things where we lacked domain expertise (appropriations and reconciliation being two), nor did we try to deal with the finer points of House and Senate rules when modeling process.
We’ll be interested in your reactions to it, and very, very interested in taking it further.  Over the next month or so, we’ll actually build out what we’ve already put in the Open Metadata Registry into a full Linked Data representation online.  Our hope is that this is a very big stone that can be used to make some Stone Soup.
The model was primarily done by myself, Diane Hillmann, John Joergensen, and Jon Phipps; other contributors included Sara Frug, Wayne Weibel, Dave Shetland, and Rob Richards.  We had a lot of help from many of you, as well.
 I suspect that there may be  some glitsches in the documentation itself, because as most of you probably know e-book compilation software is twitchy and it wouldn’t surprise me if different versions have a range of ugly formatting problems.  Let me know and we’ll clean ’em up.  Most of all, we’re interested in knowing what you think of the model.

These days, there’s no need to settle on a single answer to the question of what standard to reference in describing people and organizations. The current environment — where the speed of change can be daunting — demands strategies that start with descriptive properties that meet local needs as expressed in use cases. Taking the further step of mapping these known needs to a variety of existing standards best provides both local flexibility and interoperability with the wider world.

In the world of Web standards, most thinking about how to describe people and organizations begins with the FOAF vocabulary (http://xmlns.com/foaf/spec/), developed in 2000 as ‘Friend of a Friend’ and now used extensively on the web to describe people, groups, and organizations. FOAF is an RDF-based specification, and as such is poised to gain further in importance as the ideas behind Linked Data gain traction and more widespread implementation. FOAF is quite simple on its face, but as an RDF vocabulary it is easily extended to meet needs for richer and more complex information.  FOAF is now in a stable state, and its developers have recently entered into an agreement with the Dublin Core Metadata Initiative (DCMI), to provide improved persistence and sustainability for the website and the standard.

More recent standards efforts are emerging and deserve attention as well. Several that address the building of descriptions for people and organizations are in working draft at the World Wide Web Consortium (W3C). Although still in draft status, they offer several alternative methods for description that look very useful. Because organizations in these standards are declared as subclasses of foaf:agent, the close association with the FOAF standard is built in.

What may be most useful about FOAF — and more recent standards that seek to extend it —  is both its simple and unambiguous method of providing identification of people and groups, as well as its recommendations for minting URIs for significant information about the person or group identified.

But despite its wide adoption, there are some limitations to basic FOAF that weigh on any assessment of its capacity to describe the diversity of people and organizations involved in the legislative process.  Critically, FOAF lacks a method for assigning a temporal dimension to roles, membership, or affiliations.  That temporal requirement is essential to any model used for describing relationships between legislation, legislators, and legislative groups or organizations, both retrospectively and prospectively.  The emerging W3C standard for modeling governmental organizational structures (which includes the modelling descriptions of people and organizations mentioned above),  contemplates extensions to FOAF designed to address this limitation.  Another emerging standard, the Society of American Archivists’ EAC-CPF, also includes provisions for temporal metadata, and seems to take a very broad view of what it models, making it a standard worth watching.

LoC People and Orgs presentation  final 005

Thinking about affiliations gives a good feel for the process of working with standards; it takes a certain amount of careful thought and some trimming to fit. As an illustration, think about a member of Congress and her history as a congressional committee member. It’s not unusual for a member to serve on a committee for a while, become its chairperson, become ranking minority member after a change in the majority party, become chairperson once again, and finally settle down as a regular member of the committee. One might imagine this as a series of memberships in the committee, each with a different flavor, or as a single membership with changing roles. The figure at the right illustrates that history. The illustration at the top represents the “serial-membership” approach that is used in the W3C standard. In it, a membership also represents a specific role within the committee and has a duration; the total timespan for an individual’s committee service can only be found by calculation or inference. The bottom illustration, which represents roles and membership independently, is a little clunky in that it assigns durations to both roles and overall membership independently. Nevertheless, we prefer it. It does not require predecessor/successor relationships to link the individual role-memberships into a continuous membership span, nor does it require the slightly contrived idea of a “plain-vanilla” membership. On the other hand, it is a bit clunky in that it requires the assignment of durations in a way that might be considered duplicative.

We think that modelers are often tempted to choose standards prematurely, taking a kind of Chinese-menu approach to modeling can be overly influenced by the appeal of one-stop shopping. Our preference has been to model as closely to the data as we can. Once we have a model that is faithful to the data, we can start to think about which of its components should be taken from existing models — no sooner. In that way we avoid representation problems and also some hidden “gotchas” such as nonsensical or inappropriate default values, relationships that almost work, and so on. The same can be said of structure and hierarchy among objects — best to start modeling in a way that is very flat and very close to the data, and only once that is completed gather things into sub- and super-classes, sub properties, and so on.

Standards encountered in libraries

One question that always arises in discussing standards like FOAF in a library context is the prevalence of the MARC model in most discussions of description of people and organizations. Traditionally, libraries have used MARC name authority records as the basis for uniquely identifying people and organizations, providing text strings for both identification and display. Similar functionality has been attempted with the recent additions to the Library of Congress’s Metadata Authority Description Schema (MADS). MADS was originally developed as an XML expression of the traditional MARC authority data. Now, with the arrival of a public draft standard, focus is shifting toward an RDF expression to provide a path for migration of MADS data into the Semantic Web environment.  

MADS, like its parent USMARC Authority Format, focuses on preferred names for people and organizations, including variants, rather than on describing the person or organization more fully. As such it provides a useful ‘hook’ into library data referencing the person or organization, but is not really designed to accommodate the broader uses required for this project.

There is also a question about where this new RDF pathway for MADS might go, given the traditional boundaries of the MARC name authority world. In that tradition, names are added to the distributed file based on ‘literary warrant’, requiring that there be an object of descriptive interest which is by or about the person or organization that is a candidate for inclusion.  That is not a particularly useful basis for describing legislators, hearing witnesses, or others who have not written books or been the subject of them. Control of names and name variants will surely be necessary in the new web environment, and the extensive data and experience with the inherent problems of change in names will be essential, but not sufficient, for more widely-scoped projects like this one.

Groups vs. Organizations

Legislatures create myriad documents that must be identified and related to one another. For each of those documents, there are people and organizations fulfilling a variety of roles in the events the documents narrate, the creation of the documents themselves, the endorsement  of their conclusions, or the implementation of whatever those documents describe. Those people and organizations include not only legislators and the various committees and other sub-organizations of the legislature, but also the executive branch which, primarily through the President, exercises the final steps in the legislative process, as well as bearing responsibility for implementation. Finally, there are other parties, often outside government, who are involved in the legislative process as hearing witnesses or authors of committee prints, whose identity and organizational affiliations are essential to full description and interpretation. These latter present a particularly strong case for linked-data approaches, as they are unlikely to have any sort of formal description constructed for them by the legislature. The Congressional Biographical Dictionary is an excellent resource — but it is a dictionary of Congresspeople, not of all those who appear in Congressional documents. The latter would be impossible for any single entity to construct and maintain. But the task can be divided and conquered in concert with resources like the New York Times linked-data publishing effort, DBPedia, Freebase, and so on.

When discussing organizations, it is sometimes useful to distinguish between more and less formal groupings.  In the FOAF specification, that is conceptualized in the categories “group” and “organisation”  Generally, FOAF imagines that an “organisation” is a more formalized entity with fairly well defined memberships and descriptions, whereas a “group” is a more informal concept, intended to capture collections of agents where a strict specification of membership may be problematic, or impossible.  In practice, the distinction tends to be a very blurry one, and seems to be a sort of summary calculation done on a number of dimensions:

 

  • the temporal stability of the group itself, for example “people eating dinner at Tom’s house”, as opposed to “the House Judiciary Committee”;
  • the temporal stability of the group’s membership, which may be relatively fixed or constantly churning ( “the Supreme Court” versus “the people waiting in the anteroom” )
  • the existence of institutional trappings such as office locations, meeting rooms and websites;
  • the level of “institutionalization” or “officialness”.  In the case of government institutions in any branch, that may often rest on some legal authority that establishes the group and describes its scope of operations (as with the Federal courts). It may also take the form of a single, very narrow capability (as when an agency is said to have “gift authority”#).  Finally, it may also be established through tradition.  For example, the Congressional Black Caucus has existed for over 40 years, and occupies offices in the Longworth House Office Building, but has no formal existence in law.


Because that distinction is so blurry, we have chosen to treat all organizations similarly, using common properties that allow users to determine how official the organization is by ‘following their noses’.  The accumulation of statement-level data about any of the dimensions listed above (or others, for that matter) serves as evidence.
Thus, users of the model are free to draw their own conclusions about the “officialness” of any collection of people, although a statutory or constitutional mandate might well be interpreted as dispositive.

We end here, as usual, with a couple of musical selections: 1, 2 , 3 Next time: Events.

[This is part 3 of a three-part post on identifiers. Here are parts 1 and 2]

How well does current practice measure up?

To judge by the examples presented so far, current practice in legislative identifiers for US materials might best be described as “coping”, and specifically “coping in a way that was largely designed to deal with the problems of print”. Current practice presents a welter of “identifiers”, monikers, names, and titles, all believed by those who create and use them to be sufficiently rigorous to qualify as identifiers whether they are or not.  It might be useful to divide these into four categories:

  • Well-understood monikers, issued in predetermined ways as part of the legislative process by known actors.  Their administrative stability may well be the product of statutory requirement or of requirements embedded in House or Senate rules. Many of these will also correspond to definite stages in the legislative process. Examples would include House and Senate bill and resolution numbers.
  • Monikers arising from need and possibly semi-formalized, or possibly “bent” versions of monikers created for a purpose other than that they end up serving.   Monikers of this kind are widely relied-on,  but nobody is really responsible for them.  Some end up being embedded in retrieval systems because they’re all there is.  A variety of such approaches are on display in the world of House committee prints.
  • Monikers imposed after the fact in an effort to systematize things or otherwise compensate for any deficiencies of monikers issued at earlier stages of the process.  Certainly internal database identifiers would fit this description; so would most official citation.
  • A grab-bag of other monikers. These might be created within government ( as with GPO’s SuDoc numbers), or outside government altogether (as with accession numbers or other schemes that identify historical papers held in other libraries).  Here, a good model would provide a set of properties enabling others to relate their schemes to ours.

Identifiers in a Linked Data context

John Sheridan (of legislation.gov.uk) has written eloquently about the use of legislative Linked Data to support the development of “accountable systems”. The key idea is that exposing legislative data using Linked Data techniques has particular informational and economic value when that data defines real-world objects for legal purposes.  If we turn our attention from statutes to regulations, that value becomes even more obvious.

Valuable features of Linked Data approaches to legislative information

Ability to reference real-world objects

On the Semantic Web, URIs identify not just Web documents, but also real-world objects like people and cars, and even abstract ideas and non-existing things like a mythical unicorn. We call these real-world objects or things.” — Tim Berners-Lee

There are no unicorns in the United States Code. Nevertheless, legislative data describes and references many, many things.  More, it provides fundamental definitions of how those things are seen by Federal law.  It is valuable to be able to expose such definitions — and other fundamental information — in a way that allows it to be related to other collections of information for consumption by a global audience.

Avoiding cumbersome standards-building processes

In a particularly insightful blog post that discusses the advantages of the Linked Data methods used in building legislation.gov.uk, Jeni Tennison points out the ability that RDF and Linked Data standards have to solve a longstanding problem in government information systems: the social problem of standard-setting and coordination:

RDF has this balance between allowing individuals and organisations complete freedom in how they describe their information and the opportunity to share and reuse parts of vocabularies in a mix-and-match way. This is so important in a government context because (with all due respect to civil servants) we really want to avoid a situation where we have to get lots of civil servants from multiple agencies into the same room to come up with the single government-approved way of describing a school. We can all imagine how long that would take.

The other thing about RDF that really helps here is that it’s easy to align vocabularies if you want to, post-hoc.RDFS andOWL define properties that you can use to assert that this property is really the same as that property, or that anything with a value for this property has the same value for that other property. This lowers the risk for organisations who are starting to publish using RDF, because it means that if a new vocabulary comes along they can opportunistically match their existing vocabulary with the new one. It enables organisations to tweak existing vocabularies to suit their purposes, by creating specialised versions of established properties.

While Tennison’s remarks here concentrate on vocabularies, a similar point can be made about identifier schemes; it is easy to relate multiple legacy identifiers to a “gold standard”.

Layering and API-building

Well-designed, URI-based identifier schemes create APIs for the underlying data.  At the moment, the leading example for legislative information is the scheme used by legislation.gov.uk, described in summary at http://data.gov.uk/blog/legislationgovuk-api  and in detail in a collection of developer documentation linked from that page.  Because a URI is resolvable, functioning as a sort of retrieval hook, it is also the basis of a well-organized scheme for accessing different facets of the underlying information.  legislation.gov.uk  uses a three-layer system to distinguish the abstract identity of a piece of legislation from its current online expression as a document and from a variety of format-specific representations.  

That is an inspiring approach, but we would want to extend it to encompass point-in-time as well as point-in-process identification (such as being able to retrieve all of the codified fragments of a piece of legislation as codified, using its original bill number, popular name, or what-have-you).  At the moment, legislation.gov.uk does this only via search, but the recently announced Dutch statutory collection at http://doc.metalex.eu/ does support some point-in-time features.   It is worth pointing out that the American system presents greater challenges than either of these,  because of our more chaotic legislative drafting practices, the complexity of the legislative process itself, and our approach to amendment and codification.

Identifier challenges arising from Linked Data (and Web exposure generally)

The idea that we would publish legislative information using Linked Data approaches has obvious granularity implications (see above), but there are others that may prove more difficult.  Here we discuss three:  uniqueness over wider scope, resolvability, and the practical needs of “identifier manufacturing”:

Uniqueness over wider scope

Many of the identifiers developed in the closed silo of the world of legal citation could be reused as URIs in a linked data context, exposing them to use and reuse in environments outside the world where legal citation has developed.  In the open world, identifiers need to carry their context with them, rather than have that context assumed or dependent on bespoke processes for resolution or access.   For the most part, citation of judicial opinions survives wide exposure in fair style.  Other identifiers used for government documents do not cope as well.   Above, we mentioned bill numbers as being limited in chronological scope; other identifiers (particularly those that rely heavily on document titles or dates as the sole means of distinction from other documents in the same corpus) may not fare well either.

Resolvability

The differences between URNs (Uniform Resource Names) and URLs (Uniform Resource Locations, the URIs based on the HTTP protocol) are significant.  Wikipedia notes that the URNs are similar to personal names, the URLs to street addresses–the first rely on resolution services to function.  In many cases, URNs can provide the basis for URLs, with resolution built into the http address, but in the world we’re now working in, URNs must be seen as insufficient for creating linked open data.

In reality, they have different goals.  URIs provide resolvability — that is, the ability to actually find your way to an information resource,  or to information about a real-world thing that is not on the web.  As Jeni Tennison remarks in her blog#, they do that at the expense of creating a certain amount of ambiguity.  Well-designed URN schemes, on the other hand, can be unambiguous in what they name, particularly if they are designed to be part of a global document identification scheme from the beginning, as they are in the emerging URN:Lex specification .   

For our purposes, we probably want to think primarily in terms of URIs, but (as with legacy identifier schemes) there will be advantages to creating sensible linkages between our system, which emphasizes reliability, and others that emphasize a lack of ambiguity and coordination with other datasets.  

Things not on the Web

 

Legislation is created by real people and it acts on real things.  It is incredibly valuable to be able to relate legislative documents to those things.  The challenge lies, as it always has,  in eliminating ambiguity about which object we are talking about.  A newer and more subtle need is the need to distinguish references to the real-world object itself from references to representations of the object on the web.  The problems of distinguishing one John Smith from another are already well understood in the library community.  URIs present a new set of challenges.  For instance, we might want to think about how we are to correctly interpret a URI that might refer to John Smith, the off-web object that is the person himself, and a URI that refers to the Wikipedia entry that is (possibly one of many) on-web representations of John Smith.  This presents a variety of technical challenges that are still being resolved

Practical manufacturing and assignment of Web-oriented identifiers

Thinking about the highly-granular approach needed to make legislative data usefully recombinant — as suggested in the section on fragmentation and recombination above — quickly leads to practical questions about where all those granular identifiers will come from. The problem becomes more acute when we being to think about retrofitting such schemes to large bodies of legacy information.  For these among other reasons, the ability to manufacture and assign high-quality identifiers by automated means has become the Philosopher’s Stone of digital legal publishers.  It is not that easy to do.  

The reasons are many, and some arise from design goals that may not be shared by everyone, or from misperceptions about the data.  For example, it’s reasonable to assume that a sequence of accession numbers represents a chronological sequence of some kind, but as we’ve already seen, that’s not always the case.  Legacy practices complicate this.  For example, it would be interesting to see how the sequence of Supreme Court cases for which we have an exact chronological record (via file datestamping associated with electronic transmission) corresponds to their sequence as officially published in printed volumes.  It may well be that sequence in print has been dictated as much by page-layout considerations as by chronology.  It might well be that two organizations assigning sequential identifiers to the same corpus retrospectively would come up with a different sequence.

Those are the problems we encounter in an identifier scheme that is, theoretically, content-independent.  Content-dependent schemes can be even more challenging.  Automatic creation of identifiers typically rests on the automated extraction of one or more document features that can be concatenated to make a unique identifier of wide scope.  There are some document collections where that may be difficult or impossible, either because there is no combination of extractable document features that will result in a unique identifier, or because legacy practices have somehow obliterated necessary information, or because it is not easy to extract the relevant features by automated means.  We imagine that retroconversion of House Committee prints would present exactly this challenge.  

At the same time, it is worth remembering that the technologies available for extracting document features are improving dramatically, suggesting that a layered, incremental approach might be rewarded in the future.  While the idea of “graceful degradation” seems at first blush to be less applicable to identifiers than to other forms of metadata, it is possible to think about the problem a little differently in the context of corpus retroconversion.  That is a complicated discussion, but it seems possible that the use of provisional, accession-based identifiers within a system of properties and relationships designed to accomodate incomplete knowledge about the document might yield good results.

A final note on economics

Identifiers have special value in an information domain where authority is as important as it is for legal information.  In the event of disputes, parties need to be able to definitively identify a dispositive, authoritative version of a statute, regulation, or other legal document.  There is, then, a temptation toward a soft monopoly in identifiers: the idea that there should be a definitive, authoritative copy somewhere leads to the idea of a definitive, authoritative identifier administered by a single organization. Very often, challenges of scale and scope have dictated that that be a commercial publisher.  Such a scheme was followed for many years in the citation of judicial opinions, resulting in an effective monopoly for one publisher.  That is proving remarkably difficult and expensive to undo, even though it has had serious cost implications and other detrimental effects on the legal profession and for the public.  Care is needed to ensure that the soft, natural monopoly that arises from the creation of authoritative documents by authoritative sources does not harden into real impediments to the free flow of public information, as it did in the case of judicial opinions.

What we recommend

This is not a complete set of general recommendations — really more a series of guideposts or suggestions, to be altered and tempered by institutional realities:

  • At the most fundamental level, everything should have an identifier. It should be available for use by the public. For example, Congressional committee reports appear not to have any identifiers, but it would be reasonable to assume that some system is in use in the background, at least for their publication by GPO.
  • Many legacy identifier systems will need to be extended  or modified to create a gold standard system, probably issued by a third party and not by the document creators themselves.  That is especially the case because there is nobody in a position to compel good practice by document creators over the long term.  Such a gold-standard will need to be:
    • Unambiguous. For example, existing bill and resolution numbers would need to be extended by, eg., a date of introduction.
    • Designed to resist tampering. When things are numbered and labelled, there is a temptation to alter numbers and labels to serve short-term interests.  The reservation of “important” bill numbers under House procedural rules is an example; another (from the executive branch) is the long-standing practice of manipulating RIN numbers to color assessments of agency activity.
    • Clear as to the separation of titling, dating, and identification functions.  Presidential documents provide a good example of something currently needing improvement in this respect.
    • Taking advantage of carefully designed relationships among identifiers to allow the retention of well-understood legacy monikers for foreground use, while making use of a well-structured “gold standard” from the beginning.  Those relationships should enable automated linkage that will allow retrieval across multiple, related identifier systems.
  • Where possible, retain useful semantics in identifiers as a way of increasing access and reducing errors.  It is possible that different audiences will require different semantics, making this unlikely to happen in the background, but it should be possible to retain this functionality in the foreground.
  • Maintain granularity at the level of common citation and crossreferencing practice, but with a distinction between identifiers and labels.  Identifiers should be assigned at the whole-document level, with the notion of “whole document” determined on a corpus-by-corpus basis.  Labels may be assigned to subdocuments (eg., a section of a bill) for purposes of navigation and retrieval.  This is similar in function and purpose to the distinction between HREF and NAME attributes in HTML anchor tags.
  • Use a layered approach.  In our view, it is important not to hold future systems hostage to what is practicable in legacy document collections.  In general, it will be much harder to implement good practices over documents that were not “born digital”.  That is not a good reason to water down our prospective approach, but it is a good reason to design systems that degrade gracefully when it becomes difficult or impossible to deal with older collections. That is particularly true at a time when the technologies for extracting metadata from legacy documents are improving dramatically, suggesting that a layered, incremental approach might produce great gains in the future.

We conclude, as always, with a musical selection or two.  Next time, some stuff about people and organizations as we find them in the legislative world.

The Prisoner: Number 6[ Part 2 in a 3 part series. Last time we talked about some general characteristics of identifiers for legislation, and some sources of confusion in legacy systems. This time: some design problems having to do with granularity and use, and the fact that identifiers are situated in legal and bureaucratic process. ]

Identifier granularity

How small a thing should we try to identify? It’s difficult to make general prescriptions about that, for needs vary from corpus to corpus.  For the most part, we assume that identifier granularity should follow common citation or cross-referencing practice — that is, the smallest thing we identify or label should be the smallest thing that standard citation practice would allow a user to navigate to.  That will vary from collection to collection, and from context to context. For example, it’s quite common for citation to the US Code to refer to objects at the subsection level, sometimes right down to the paragraph level.  On the other hand, references to the Code in the Parallel Table of Authorities and Rules generally refer to a full section.  Similarly, although cross-references within the Code of Federal Regulations can be very granular, external references typically target the Part level. In any corpus, amendments can be expressed in ways that are very granular indeed.

Our citation and cross-referencing practices have evolved in the context of print, and we may be able to do things that better reflect the dynamic nature of legislative text.  The move from print to digital overturns background assumptions about practicality.  For example, print typically makes different assumptions about identifier stability than you would find, say, in an online legislative drafting system.  Good examples of this are found in citation practice for the Code of Federal Regulations, which typically cites material at the Part level because (one imagines) changes in numbering and naming of sections are so frequent as to render identifiers tied to such fine divisions unstable — at least in print, where the shelf life of such fine-grained identifiers is shorter than the shelf life of the edition by an order of magnitude. In a digital environment, it is possible to manage identifiers more closely, permitting graceful failure of those that are no longer valid, and providing automated navigation to things that have moved. We look at some of the possibilities and implications in sections on granularity, fragmentation, and recombination below.  All of those capabilities carry costs, and over-design is a real possibility.

Metadata, markup, and embedding

Thinking about granularity leads to ideas about the linkages between metadata and the target object itself.  Often metadata applies to chunks of documents rather than whole documents.  Cross-referencing in statutes and legislation is usually done at the subdocument level, for instance, and subject-matter classification of a bill containing multiple unrelated provisions would be better if the subject classifications could be separately tied to specific provisions within the bill. That need becomes particularly acute when something important, but unrelated to the main purpose of the bill, has been “snuck in” to a much larger piece of legislation.  A stunning example of such a Frankenstein’s monster appears at  111 Pub. L. 226 . It is described in its preamble as modernizing the air-traffic control system, but its first major Title heading describes it as an “Education Jobs Fund”,  and its second major Title contemplates highly technical apparatus for providing fiscal relief to state governments.

We are aware that sometimes we are thinking in terms that are not currently supported by the markup of documents in existing XML-creating systems.  However, we think it makes sense to design identifier systems that are more capable than some document collections will currently support via markup, in the expectation that  markup in those collections will evolve to the same granularity as current cross-referencing and citation practice, and that point-in-time systems supporting the full lifecycle of legislative drafting, passage, and codification will become the norm.  Right now,  divisions of statutory and regulatory text below the section level (“subsection containers”) are among the most prominent examples of “missing markup”; they are provided for in the legislative XML DTDs at (eg.) xml.house.gov, but do not survive into the FD/SYS versions from GPO.

Most often, we imagine that the flow of document processing leads from markup to metadata, since as a practical matter a lot of metadata is generated simply by extracting text features that have been tagged with some XML or HTML element.  Sometimes the flow is in the other direction; we may want to embed metadata in the documents for various purposes.  Use of microformats, microdata, and other such schemes can be handy for various applications; the use of research-management software like Zotero, or the embedding of information about document authenticity comes to mind.  These are not part of a legislative data model per se, but represent use cases worth thinking about.

Stresses and strains

Next, we turn to things that affect the design of real-world identifier systems, perhaps rendering them less “pure” in information-design terms than we might like.

Semantics versus purity

Some systems enforce notions of identifier purity — often defined as some combination of uniqueness, orderliness, and ease of collation and sorting — by rigorously stripping all semantics from identifiers.  That is an approach that can function reasonably well in back-end systems, but greatly reduces the usefulness of the identifiers to humans (because understanding what the identifier identifies requires database reflection), and introduces extra possibilities for error in application because (among other reasons) errors caused by human transcription are hard to catch when the identifiers are meaningless strings of letters and numbers.  On the other hand, “pure” opaque identifiers counter a tendency to assume that one knows what a semantically laden identifier means, when in fact one might not.  And sometimes opaque identifiers can be used to provide stability in situations where labels change frequently but the labelled objects do not.  

At the other end of the spectrum, identifier systems that are heavily burdened with semantics have problems with uniqueness, length, persistence, language, and other issues arising from inherent ambiguity of labels and other home-brewed identifier components.  It is worth remembering, too, that one person’s helpful semantics are another’s mumbo-jumbo; just walk up to someone at random and ask them the dates of the 75th Congress if you need proof of that. Useful systems find a middle ground between extremes of incomprehensible rigor and mindlessly verbose recitation of loosely-constructed labels.

It’s worth noting in passing that it can be very difficult to prevent the unwanted exposure of “back-end” identifier to end users.  For example, URIs constructed from back-end components often find their way into the browser bars of authors researching online, who then paste them into documents that would be better served by more brain-compatible, human-digestible versions.

Moniker type Identifier Notes
Citation 18 USC 47 Standard citation ignores all but Title and section number; intermediate aggregations not needed, and confusing.
Popular name Wild Horse Annie Act Descriptive and often used in popular accounts, the press, agency guidance on related rules, etc., but hard to find in the codified version.
LII URI, (“presentable” version) http://www.law.cornell.edu/uscode/18/47.html Based on title and section number
LII URI, “formal” version http://www.law.cornell.edu/uscode/18/usc_sec_18_00000047—-000-.html Also title and section based, but padded and normalized to allow proper collation; “supersection” aggregations above the section level are similarly disambiguated.
USGPO URI, GPOAccess http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=browse_usc&docid=Cite:+18USC47 Parameterized search returning 1 result.
FindLaw URI http://codes.lp.findlaw.com/uscode/18/I/3/47 Seemingly mysterious, because it interjects subtitle and part numbering, which is not used in citation.  Note that this hierarchy would also vary from Title to Title of the Code — not all have Subtitles, eg.

The table above shows some “monikers in the wild” — various real-world approaches to the problem of identifying a particular section of the US Code.  The “formal” LII identifier, highlighted in yellow, shows just how elaborate an identifier needs to be if it is to accommodate all the variation that is present in US Code section numbering (there is, for example, a 12 USC 1749bbb-10c), while still supporting collation.  The FindLaw URI demonstrates the fragility of hierarchical schemes; the intermediate path components would vary enormously from Title to Title, and occasionally lead to some confusion about structure, as intermediate levels of aggregation are called different things in different Titles. It is hard to tell, for example, if Findlaw interpolates “missing” levels into the URIs in order to maintain an identical scheme across Titles with dissimilar “supersection” structure.

Administrative zones of control and procedural rules

Every identifier implies a zone of administrative control:  somebody has to assign it, somebody has to ensure its uniqueness, and somebody or something has to resolve it to an actual document location, physical or electronic.  Though it has taken years, the community has recognized that qualities of persistence and uniqueness are primarily created by administrative rather than technical apparatus.  That becomes a much more critical factor when dealing with government documents, which may be surrounded by legal restrictions on who may assign identifiers and when, and in some cases what the actual formats must be.  A legislative document may have its roots in ideas and policies formed well outside government, and pass through numerous internal zones of control as it makes its way through the legislature. It may emerge at the other end via a somewhat mysterious intellectual process in which it is blown to bits and the fragments reassigned to a coherent, but altogether different, intellectual structure with its own system of identifiers (we call this ‘codification’).  There may be internal or external requirements that, at various points in the process,  cause the document to be expressed in a variety of publications and formats each carrying its own system of citations and identifiers.

The legacy process, then, is an accretive one in which an object acquires multiple monikers from multiple sources, each with its own requirements and rules.  Sometimes those requirements and rules are shaped by concerns that are outside, and perhaps at odds with, sound information-organization practice.  

For example, the House and Senate each have their own rules of procedure, in which bill numbering is specified.  Bill numbers are usually accession numbers that reset with each new Congress, but the rules of procedure create exceptions.  Under the rules of the House for  the 106th Congress, the first ten bill numbers were reserved for use by the Speaker of the House for a specified time period. During the 107th and 108th Congresses (at least), the time period was extended to the full first session.  We surmise that this may have represented an attempt to reserve “important” bill numbers for things important to the majority party’s legislative agenda.  Needless to say, this rendered any relationship between bill numbers and chronology or order of introduction questionable, at least in a limited number of cases. The important point is that identifier usage will be hostage to political considerations for as long as it is controlled by rules of procedure; that situation is not likely to change.  

But there are also virtues to the legacy process, primarily because close association with long-standing institutional practices lends long-term stability to identifier schemes.  Bill numbers have institutional advocates, are well-understood, and unlikely to change very much in deployment or format. They provide good service within their intended scope, however much they may lose when taken outside it.

That being said, a “gold standard” system of identifiers, specified and assigned by a relatively independent body, is needed at the core.  That gold standard can then be extended via known, stable relationships with existing identifier systems, and designed for extensible use by others outside the immediate legislative community.

Status, tracing, versioning and parallel activity

It is useful to distinguish between tracing the evolution of a bill or other legislative document and recording the status of that document.  Status usually records a strong association between some version of the document and a particular, well-known stage or event in the process by which it is created, revised, or made binding.  That presents two problems.  There is a granularity problem, in that some legislative events that cause alteration of the document are so trivial that to distinguish all of them would be to create an unnecessarily fine-grained, burdensome, and unworkable system. There is a stability problem in that legislative processes change over time, sometimes in ways that are very significant, as when (in 1975) the House rules changed to allow bills to be considered by multiple committees, and sometimes in ways that are not, as when House procedural rules are revised in trivial, short-lived ways at the beginning of each new Congress.  Optimally, bill status would be a property related to a small vocabulary of documented legislative milestones or events that remains very stable over time.  Detailed tracing of the evolution of a bill would be enabled through a series of relationships among documents that would (for instance) identify predecessor and successor drafts as well as other inter-document relationships.  These properties would exist as part of the data model without need for additional semantics in the identifiers. Such a scheme might readily be extended to accommodate the existence of multiple, parallel drafts, as sometimes happens during committee process.

In this way, the model would answer questions about the “version” of a given document by making assertions either about its “status” — that is, whether it is tied to some well-known milestone in legislative process — or by some combination of properties that are chained back to such a milestone.  For example, a document might be described as a “committee draft from Committee X that is a direct revision of the document submitted to the committee, dated on such-and-such a date”.  The exact “version” of the document is given by a chain of relationships tied back to a draft that can be definitively associated with a stable milestone in the legislative process.

It’s worth noting that while it would certainly be possible to identify versions using “version numbers” built out by extending the accession number of the root document with various semantically-derived text strings, it’s not necessary to do so.  The identifiers could, in fact, be anything at all.  All that is needed is for them to be linked to well-known “milestone” documents (e.g., the version reported out of committee) by a chain of relationships ( for example,  “isSuccessorVersionOf”) that chain back to the milestone.  This may be particularly important when the document-to-document relationship extends across boundaries between zones of administrative control, or outside government altogether.

Granularity

To a great extent, the things that are being ‘identified’ by identifiers are discrete documents, traditionally rendered as discrete print works. There are, however, significant exceptions that should be accommodated. In addition, changes in the nature and structure of documents that may be issued in the future should be anticipated as well.

The issue of “granularity” arises from the need to identify parts of a discrete document. For example, although a congressional hearing is published as a single document (sometimes in multi-volume form), it may be useful to make specific references to the testimony of individual witnesses. Even more significant would be mapping the relationships between the U.S. Code and the public laws from which it is derived. In these cases, the granularity of the identifiers available should be more fine-grained than the documents being identified. So, although a Public Law or slip law can be completely identified and described by a given set of identifiers, it is valuable to have additional identifiers available for sub-parts of these documents, so that mapping adequate relationships to sections of the U.S. Code can be described.

Of course, admitting such identifiers can be a slippery slope. The set of things that could be identified in legislative documents is fairly unbounded, and any identifiers will arguably be useful to someone. An attempt to label all possible things, however, is madness, and should be avoided. The result would be numbers of unused, or seldom used identifiers which would over-complicate entities and the overall structure of the identifier system.

Fragmentation and recombination

Identifiers are used in ways that go well beyond slapping a unique label on a relatively static document.  They help us keep track of resources that can, in the electronic environment, be highly mobile.  Legislation is often fragmented and re-combined into new expressions, some official and some not.  For many legal purposes, it is important for the fragments to be recognized as authentic, that is, carrying the same weight of authority as the work from which they were originally taken.  Current practice accommodates this through the use of a variety of officially-published finding aids, including significant ones associated with the US Code:  the Table of Popular Names, the “Short Title” notes, and Table III of the printed edition of the US Code, which is essentially a codification map. Elsewhere, we’ve referred to such a work as a “pont”, that is, something that bridges two isolated legal information resources.  Encoding  of ponts in engineered ways that facilitate use in retrieval systems is a particularly crucial function that should be supported by the identifier model. 

Codification

Codification presents challenges, the more so because it can erect substantial barriers for inexperienced researchers.  Citizens often seek legislation by popular name (“Wild Horse Annie Act”). They don’t get far.  The problem is usually (though not always) more difficult than simply uncovering an association between the popular name of the act they’re seeking and some coherent chunk of the United States Code, or a fragment within a document that carries a Public Law number.  Often, the original legislation has been codified in ways that scatter fragments over multiple Titles of the US Code.

That is so because even a coherent piece of legislation — and many are not —  typically addresses a bundle of issue-related concerns, or the needs of a particular constituency.  A “farm bill” might contain provisions related to tax, land use, regulation of commodities, water rights, and so on.  All of those belong in very different places under the system of topics used by the US Code.  Thus, legislation is fragmented and recombined during the process of codification.  While this results in much more coherent intellectual organization of statutes over the long term, it makes it difficult for users to exchange the tokens they have — usually the popular name of an Act, or some other moniker assigned by the press (“Obamacare”) — for access to what they are seeking.

Table III of the United States Code provides a map from provisions of Public Laws to their eventual destination within the US Code, as the Code existed at the time of classification.  That is potentially very useful to a present-day audience, provided that the relationships expressed in it can be traced forward through time; changes to the Code from the time of classification forward  would need to be accounted for.  That would rest on two things:  an identifier system capable of tracking the fragments of the original Act as they are codified, and a series of relationships that account for both the process of codification and the processes by which the Code itself subsequently evolves.

Fragmentary re-use

Codification is really a special case of something we might call “fragmentary re-use” — an application in which a document snippet, or other excerpt from an object, is reused outside its parent.  Next time, we’ll discuss the problems of identifier exposure in a Linked Data context, noting that identifiers must carry their own context.  A noteworthy example of this is the legislative fragment that needs to carry some link back to its provenance, and specifically its legal status or authority.  Minimally, this would be an identifier resolvable to a data resource describing the provenance of the fragment.  Such an approach might fit well into a “layered” URI scheme such as that used by legislation.gov.uk.

[ Fragmented and recombined as we are, we’ll stop here with a song about codification, a granular, highly-recombinant and NSFW musical selection, and a third that queries the object itself (at 2:45) and makes heavy use of visual and audio recombination . Next time: some problems with current practice, identifier manufacturing, and what happens when we think about Linked Data, as we surely should ]