{"id":1342,"date":"2011-10-02T09:38:12","date_gmt":"2011-10-02T14:38:12","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/voxpop\/?p=1342"},"modified":"2011-10-02T09:38:12","modified_gmt":"2011-10-02T14:38:12","slug":"csl-metadata-and-legal-information-that-just-works","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/voxpop\/2011\/10\/02\/csl-metadata-and-legal-information-that-just-works\/","title":{"rendered":"CSL, Metadata, and Legal Information that Just Works"},"content":{"rendered":"
<\/a><\/p>\n In the wake of a decisive victory at the Battle of Sekigahara in 1600, Tokugawa Ieyasu treated rival Japanese warlords to a simple but effective instrument of control, pioneered in the preceding Era of the Warring States. The Daimyo, as the defeated clan heads were known, retained control of their respective domains, but were required to reside in the newly established seat of government at Edo (now Tokyo) in alternate years. They were free to return home in the off-years, but only by leaving their princesses and heirs behind in the walled gardens of the capitol, as a token of the enduring bond of friendship and mutual admiration that united the Shogun and his sometimes grudging subordinates.<\/p>\n The processions of competing Daimyo moving to and from the seat of real power soon became a measure of status, and the cost of these semi-annual journeys would eventually consume fully half of each Daimyo\u2019s disposable income. This contributed greatly to the prosperity of communities stationed along the wayside, where tradesmen, innkeepers, chefs, entertainers, and the occasional thief shared in revenue extracted from the peasants in the Daimyo\u2019s fiefdom back home. A cynic might say that the practice of san-kin-k\u014dtai<\/em> (\u53c2\u52e4\u4ea4\u4ee3) was little more than an elaborate system of hostage-taking, but in its way it was very good for business \u2014 at least if you did not have the misfortune to be a peasant.<\/p>\n <\/a><\/p>\n Japan later shed the hobbles of feudal regulation, of course, and the population are now free to move about as they please; but for Daimyo read content<\/em>, and for the Daimyo\u2019s princesses and progeny read metadata<\/em>, and you have a description of a familiar Internet business model. Too familiar, perhaps, as most of us now rely on content supplied through walled gardens<\/a> for much of our research work.<\/p>\n Just as the freedom of individuals is improved by lifting restraints on travel, so the flow of content is more meaningful when accompanied by the descriptive metadata that is its natural companion. As observed by others in this space (most recently here<\/a> and here<\/a>), there are barriers today to the free flow of legal information. As will be outlined below, hamstrung metadata is, unfortunately, one of them. This information \u2014 mundane details like the date, court, and party names of a legal decision, and the volume, journal, page or identifier used to locate it \u2014 are curiously hard for machines<\/em> to find in the pages issued by any of the leading commercial services in the 40-year-old online legal information industry.<\/p>\n More than any fundamental difference in the materials themselves, captive metadata accounts for the striking gap that has emerged between the research tools available in law and in other disciplines. Driven by the needs of researchers in the sciences and the humanities, personal research platforms that thrive on metadata are now widely available: to make them servants of the law, they want only to be fed.<\/p>\n One element of this alternative infrastructure that depends on rich metadata provision is the Citation Style Language (CSL<\/a>), which is the proper subject of this essay. The next three sections provide a short introduction to CSL, followed by a few observations on the state of legal metadata provision on today’s legal Internet. The essay concludes with a comment on some of the lights that seem to be flickering into view at the end of this particular tunnel, and on the prospective benefits of at last bringing the law within reach of a modern research support ecosystem.<\/p>\n <\/p>\n The Citation Style Language is an XML vocabulary for accurately describing citation and bibliography formats. Given the breath of life<\/a> by the original Zotero<\/a> citation formatter, CSL is now entering its eighth year of development, can boast two full production implementations, and drives citation formatting in at least six major bibliographic or text processing projects, with total user numbers in the millions.<\/p>\n The illustration to the right provides a simplified view of CSL processing flow. In greater detail it works like this:<\/p>\n The upshot of all this swirling machinery is that generic metadata<\/em> can be used to generate citations in arbitrary formats<\/em>. In operation, this means that an article originally written according to, say, the Oxford Standard for Citation of Legal Authorities (OSCOLA<\/a>) can be reformatted on the fly to conform to the requirements of, say, the McGill Guide<\/a>, or perhaps the Australian Guide to Legal Citation<\/a> (PDF) or the ALWD Manual<\/a>. This functionality is used daily by researchers in most fields worldwide, and there is no reason the law should be an exception.<\/p>\n The automated generation of citations is just one benefit of this processing flow; it also enables the embedding of cited metadata directly in the source document (for sharing between collaborators), and it allows links to referenced resources to be attached at the point of production (for ease of referencing after publication). Hints of resistance from some quarters<\/a> notwithstanding, such tools clearly promise to save law professors, law students, lawyers, court clerks, judges, and others who must do legal drafting a tremendous amount of time.<\/p>\n There are a few commonly-encountered wrinkles in legal data and citation styles that CSL and the citeproc-js<\/span> formatter have been carefully designed to address. To give readers a glimpse of this work, a few basic elements of the language are laid out below. We’ll begin with the following sample citation in the OSCOLA style: Jones & others v Wright<\/em> [1991] 3 All ER 88.<\/p>\n The bare case name can be produced with the following construct:<\/p>\n (Note the use of font-style=”italic”<\/span> to render the variable content in italic type, and of the strip-periods=”true”<\/span> attribute, which will be discussed below.)<\/p>\n The year element can be produced with the following code:<\/p>\n (Note the use of prefix<\/span> and suffix<\/span>.)<\/p>\n To build the full cite, we join these and other elements together by wrapping them in a group<\/span> element and setting a single space as the delimiter. In the example below, we also define this construct as a macro, so that it can easily be reused in multiple contexts in the style:<\/p>\n If we want to use this cite form for English legal cases only, we can wrap it in a condition:<\/p>\n (Note the type<\/span>, jurisdiction<\/span> and match<\/span> attributes, and the use of a text<\/span> node with a macro<\/span> attribute to call our macro.)<\/p>\n With the code above, we will obtain something close to our target cite format if we arrange for the calling application to feed the processor JSON input like the following:<\/p>\n Looking carefully at this input, we can see that there are some small discrepancies in the metadata:<\/p>\n These details can be handled automatically in the processor. The first issue is trivial: quashing periods is a general requirement of OSCOLA, and this one will be removed by the strip-periods=”true”<\/span> attribute that we set on the title element. The second issue requires a bit of further explanation.<\/p>\n In our sample input, the journal name has been spelled out in full to avoid ambiguity. This is an example of best practice, although the field content does differ from our desired output of “All\u00a0ER”. The current version of Zotero provides a journalAbbreviation<\/span> field for each item, but this has known limitations, and is not suitable for legal writing.<\/p>\n Many styles require that commonly cited journal names, at least, be abbreviated. Some styles have mandatory and idiosyncratic abbreviation requirements. As Judge Posner commented recently<\/a> (PDF) concerning the requirements of Bluebook: A Uniform System of Citation<\/span>: It\u2019s as if there were a heavy tax on letters, making it costly to write out Coast Guard Court of Criminal Appeals instead of abbreviating it …<\/em> There is no tax on letters, of course, but the lack of a truly uniform system of abbreviation means that such elaborate schemes impose a significant cost in their own right. In Zotero, if journal abbreviations are registered directly on individual items in the user’s personal library, they must be entered manually for each item, both when the original item is created, and each time the user wants to generate citations in a different style. This is not acceptable: metadata should be generic<\/em>.<\/p>\n With a view to squaring the needs of users with those of the more demanding styles, the citeproc-js<\/span> processor allows arbitrary abbreviation lists to be registered and managed on a per-style basis.<\/p>\n Here’s how it works. When the processor encounters a field that requests form=”short”<\/span>, it looks for the field content in an externally-supplied abbreviation list derived from a small (persistent) database. If there is no match, the processor opens an empty entry for the field in its (ephemeral) run-time registry. In an application that draws on this functionality, the user can visit the run-time listing at any time, and enter suitable abbreviations. These are then registered in the persistent external database, where they are remembered for future use with the current style.<\/p>\n In our case, the user would enter “All\u00a0ER” as the journal abbreviation, and the application would store and deliver auxiliary input like the following:<\/p>\n Abbreviations list support has not yet been implemented in mainstream projects, but I have built a small Firefox add-on<\/a> for use with Zotero that draws upon it, and I am happy to report that it does work<\/a>, as advertised<\/a>, both for journal abbreviations, and for other similar purposes (such as “hereinafter” support).<\/p>\n In our CSL code, invoking the abbreviation list machinery requires only a small change to the citation macro:<\/p>\n A full style will be more elaborate, but the basic logical structures are the same, with conditional statements used to select simple nested groups of nodes that describe the output to be produced.<\/p>\n <\/a>I’ll draw a line under the technical discussion at this point, but you get the idea.<\/p>\n CSL is an elegant and expressive language that has grown under the tutelage of strict demands from academics and graduate students in many fields. The language is fully documented in the CSL Specification<\/a>. The proposed extensions for full legal support, documented in the citeproc-js<\/span> CSL Specification Supplement<\/a>, have been carefully formulated, and I am open to feedback. Style development is proceeding apace, and increments and milestones are being reported through the CitationStylist.org<\/a> website, which serves as a clearinghouse for legal and multilingual style development. From experience with the first target style for full implementation (the Creative Commons licensed OSCOLA<\/a>), the prospects for CSL style support for legal resources that “disappears”, as such tools ought to do, are very bright.<\/p>\n <\/a>In addition to bringing us open-source community-driven citation formatting technology, Zotero offers one-click acquisition of content, to a full-featured personal electronic library on the user’s desktop. This is handy, even essential, in today’s world of overabundant information sources. It is facilitated by the fact that in most fields of study, aggregator sites have a long history of providing access to structured metadata from their pages.<\/p>\n The server-side technology that enables one-click content acquisition well predates the Internet. Libraries that run their catalogs on the 1980’s MARC standard<\/a> or one of its variants can and often do expose these records to the Internet. Aggregators in the sciences typically provide BibTeX<\/a> records, which researchers have relied upon since the original format was frozen in 1988. Booksellers and publishing consortia offer metadata keyed to ISBN numbers<\/a>, and the publishers of academic and other journals participate in the DOI system<\/a> for assigning unique IDs keyed to canonical metadata for individual articles. The world of academic discourse is swimming in rich, life-giving metadata. Until, that is, one arrives on the salted shores of the law, where there is no water, and precious little sand.<\/p>\n <\/a>The metadata story on the paywalled sites is very straightforward: exposing it would not be in the vendor’s commercial interest, so there isn’t any. It’s hard to fault the logic. Even if we insist on the unflattering feudal analogy with which this essay opened, it’s worth remembering that Japan’s Shogunate endured for 250 years before finally giving way to change. Business opportunities don’t come much better than that, and one can hardly expect the leading providers to react any differently.<\/p>\n There is variety in the ecosystem, however, and not all suppliers of legal source are driven by the same pattern of economic incentives. Providers that expose their content with metadata stand to benefit from CSL and other infrastructure-in-waiting, which can significantly raise the real value of their service. To state the point more precisely: supplying fine-grained<\/em> metadata is essential for a publisher\u2019s content to be attractive to third-party reference management tools like Zotero \u2014 it\u2019s important enough to be in the project\u2019s guidance notes<\/a>.<\/p>\n This is a separate point from the movement for universal or format-neutral citation formats<\/a>. Promotion of these is also important, but from the perspective of data acquisition, they are not sufficiently uniform across jurisdictions<\/em> to serve, by themselves, as primary metadata for a general research platform. As a well-intended example, consider this tag embedded in a case from CanLII<\/a>:<\/p>\n In order to register this item in a reference manager database, we need to know what each of the elements means<\/em>. This will be obvious to a local practitioner, but a Zotero page translator would need to include hand-crafted pattern-matching functions to parse out the elements and assign them to field variables. If I were doing the coding (ignorant as I am of Canadian law), I would be stumped by several of these elements:<\/p>\n The answers would be obvious to a Canadian lawyer, of course, and with a bit of effort I could look up the details. But multiplied across the jurisdictions of the world, that is an effort that would prove fatal to the task. A meta<\/span> tag containing a full formatted citation is better than nothing, but with fine-grained metadata and simple descriptive variable names for each of the elements, the code would practically write itself. It really does make all the difference.<\/p>\n <\/a><\/p>\n A further issue concerns parallel references, which I mention here for the sake of completeness in ranting. In a world that offers an API<\/a> to the entire fictional economy of Farmville<\/a>, one would think that the various and sundry parallel citations to, say, Quackenbush v. US<\/a> would be available as a simple machine-readable graph. But as we have seen, the leading paywalled providers don’t even supply the date of the decision<\/em> in structured form, let alone parallel citation mappings: the data they publish is basically useless for this purpose.<\/p>\n The least-painful path at present is to visit Google Scholar<\/a> with Zotero<\/a>, and fetch the case from the hit listing (not from the case page itself). This yields a set of three cross-linked items that reflect the parallel reports of the case. One click and you’re away \u2014 but consider what happens behind the scenes: (1)\u00a0the Zotero translator performs contorted screen-scraping<\/a> of (2)\u00a0the displayed citations<\/em> in the Google listing, which (3)\u00a0were in turn reverse-engineered from scanned source, and hence (4)\u00a0cannot be trusted for 100% accuracy. It is a testament to human ingenuity that this is possible at all, but the underlying infrastructure is an embarrassing bundle of wet string.<\/p>\n Parallel references are tracked internally, of course, by the major service providers. Lack of user-side access to these mappings has the side effect (bizarre, from the standpoint of other fields) of placing uncommon importance on human-readable citations, because they are the only available means of identifying a given case across multiple data silos. Given current publishing arrangements, the problem is intractable, and for the present the best we can do on the reference management side is to provide means of recording these relations in personal libraries when they are identified by individual users.<\/p>\n <\/a>To end on a positive note, compliments are due to the growing number of publishers and dissemination initiatives that have gone the distance to expose well-structured metadata. In the CitationStyles.org<\/a> project, my own immediate aim is to get the CSL output story into shape, and I confess that I have not followed recent (and some not-so-recent) developments as closely as I should. As styles firm up and field assignment conventions come to be settled, I’ll be looking forward to work (by others, as well as a bit myself) on serving the growing number of open-access legal publishers that provide structured metadata.<\/p>\n Zotero is a flexible feeder, and the specific format in which metadata is presented is less important than that it be separated into discrete fields. The meta<\/span> field assignments<\/a> in the Cornell LII<\/a> Supreme Court judgments (CASENAME, DOCKET, DECDATE) serve the purpose. The BibTeX<\/a> source served by Google Scholar works as well. The legislative metadata at legislation.gov.uk<\/a> also works. The microformats metadata<\/a> embedded in Federal cases on law.resource.org<\/a> gives us enough to work with, and the very complete details in the RECOP<\/a> material are quite useful when they are carried through in refactored pages (as they are<\/a> in the Free Law Reporter<\/a> served by CALI<\/a>).<\/p>\n One of the benefits to be anticipated, as we make our way toward improved interoperation between publishers and third-party reference management tools for law, is a reduction in the barriers to collaboration between law and other disciplines. Legal citation conventions are by nature quite demanding, and removing some of their sting will improve access not only to the law itself, but also to participation in its discourse.<\/p>\n All signs of rain, and very welcome for grassroots projects like CSL.<\/p>\n <\/a>Frank Bennett<\/strong><\/a> is Associate Professor in the Graduate School of Law at Nagoya University<\/a>. His active projects related to legal informatics include the citeproc-js<\/a> CSL processor, an experimental multilingual branch of the Zotero<\/a> reference manager (MLZ<\/a>), and the CitationStylist.org<\/a> initiative for creating a CSL family of legal styles.<\/p>\n The CSL language was originally conceived<\/a> by Bruce D\u2019Arcus<\/a>. The CSL 1.0 schema and specification are maintained by Bruce D\u2019Arcus<\/a> and Rintze Zelle<\/a>.<\/p>\n (Readers should kindly note that despite Frank’s tasteful choice of hat in the photo to the left, the views expressed in this post are his own, and do not necessarily reflect those of Cornell University or the Cornell Legal Information Institute.)<\/p>\n VoxPopuLII is edited by Judith Pratt.<\/a> Editor-in-Chief is Robert Richards<\/a>, to whom queries should be directed. The statements above are not legal advice or legal representation. If you require legal advice, consult a lawyer. Find a lawyer<\/a> in the Cornell LII Lawyer Directory<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":" In the wake of a decisive victory at the Battle of Sekigahara in 1600, Tokugawa Ieyasu treated rival Japanese warlords to a simple but effective instrument of control, pioneered in the preceding Era of the Warring States. The Daimyo, as the defeated clan heads were known, retained control of their respective domains, but were required […]<\/a><\/p>\n","protected":false},"author":14,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[328,327,330,329],"tags":[4836,4835,4827,4830,4829,4828,4837,4832,4833,4831,4834,4826],"_links":{"self":[{"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/posts\/1342"}],"collection":[{"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/comments?post=1342"}],"version-history":[{"count":491,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/posts\/1342\/revisions"}],"predecessor-version":[{"id":1909,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/posts\/1342\/revisions\/1909"}],"wp:attachment":[{"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/media?parent=1342"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/categories?post=1342"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/tags?post=1342"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}About CSL<\/h1>\n
\n
Formatting citations<\/h1>\n
<text variable=\"title\" font-style=\"italic\" strip-periods=\"true\"\/><\/span><\/pre>\n
<date variable=\"issued\" form=\"text\" date-parts=\"year\" prefix=\"[\" suffix=\"]\"\/><\/span><\/pre>\n
<macro name=\"oscola-case\"><\/span>\r\n <group delimiter=\" \"><\/span>\r\n <text variable=\"title\" font-style=\"italic\" strip-periods=\"true\"\/>\r\n <date variable=\"issued\" form=\"text\" date-parts=\"year\"\r\n prefix=\"[\" suffix=\"]\"\/>\r\n <number variable=\"issue\"\/><\/span>\r\n <text variable=\"container-title\"\/><\/span>\r\n <text variable=\"page-first\"\/><\/span>\r\n <\/group><\/span>\r\n<\/macro><\/span><\/pre>\n
<choose><\/span>\r\n <if type=\"legal_case\" jurisdiction=\"gb\" match=\"all\"><\/span>\r\n <text macro=\"oscola-case\"\/>\r\n <\/if><\/span>\r\n<\/choose><\/span><\/pre>\n
{\r\n \"container-title\": \"All England Law Reports\",\r\n \"date\": {\r\n \"date-parts\": [[\"1991\"]]\r\n },\r\n \"issue\": \"3\",\r\n \"page\": \"88\",\r\n \"title\": \"Jones & others v. Wright\"\r\n}<\/pre>\n
\n
Applying abbreviations<\/h1>\n
{\r\n \"container-title\": {\r\n \"All England Law Reports\": \"All ER\"\r\n }\r\n}<\/pre>\n
<macro name=\"oscola-case\">\r\n <group delimiter=\" \">\r\n <text variable=\"title\" font-style=\"italic\" strip-periods=\"true\"\/>\r\n <date variable=\"issued\" form=\"text\" date-parts=\"year\"\r\n prefix=\"[\" suffix=\"]\"\/>\r\n <number variable=\"issue\"\/>\r\n <text variable=\"container-title\" form=\"short\"<\/span>\/>\r\n <text variable=\"page-first\"\/>\r\n <\/group>\r\n<\/macro><\/pre>\n
\nInput from the Web<\/h1>\n
<meta name=\"DC.Title\"\r\n content=\"Smith v. Jones, 2003 CanLII 19166 (NWT RO)\"\/><\/pre>\n
\n
In lieu of concluding<\/h1>\n
\n