{"id":1968,"date":"2011-10-25T08:50:15","date_gmt":"2011-10-25T13:50:15","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/voxpop\/?p=1968"},"modified":"2011-11-01T08:34:31","modified_gmt":"2011-11-01T13:34:31","slug":"the-metalex-document-server","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/voxpop\/2011\/10\/25\/the-metalex-document-server\/","title":{"rendered":"The MetaLex Document Server"},"content":{"rendered":"
<\/a>In this post I describe the process and requirements that eventually led to the\u00a0MetaLex Document Server<\/a>, a server that hosts all versions of Dutch national statutes and regulations published since May 2011, both as\u00a0CEN MetaLex<\/a>, and as\u00a0Linked Data<\/a>. Before I set out to do so, however, I would like to emphasize that, although the development of the server and its contents was a one-man-job, the road to make it possible surely was not solitary. A couple of people I’d like to mention here are Alexander Boer<\/a>, Radboud Winkels<\/a>, and Tom van Engers<\/a> of the Leibniz Center for Law<\/a>, together with whom I have worked over the past ten years to develop, test, and publish the ideas that underlie CEN MetaLex. Also, the team around\u00a0legislation.gov.uk<\/a> clearly has done a lot of great and inspiring\u00a0 work in this area.<\/p>\n So, what happened? Over the course of last spring, I was involved in several small-scale projects that shared a specific need: version-aware identifiers for all parts of legislative texts. \u00a0The first of these was a report for the\u00a0Swiss Federal Chancellery<\/a> on possible technological solutions for a regulation drafting system to be used by the Swiss government. Second to arrive on my desk was a project for the\u00a0Dutch Tax and Customs Administration (Belastingdienst),<\/a> in which we were asked to develop a concept-extraction toolkit that would allow them to make explicit where concepts are defined, where they are reused, and how they relate to other concepts (e.g.<\/em>, from an external thesaurus). The purpose of this project was to investigate whether we could replace with technology what is currently a manual process of turning legislation into business processes that fuel citizen- and business-oriented services. The Belastingdienst needs this to better cope with the yearly changes to tax regulation issued by the\u00a0Ministry of Finance<\/a>. The\u00a0Dutch Immigration and Naturalisation Service (IND)<\/a> faces exactly the same problem: of discovering what part of their business processes is affected by each legislative modification. Updates to legislation require continuous, significant investment in IT re-engineering.<\/p>\n <\/a><\/p>\n But don’t modern European governments already have elaborate facilities for supporting this workflow? I’m afraid not.<\/p>\n Currently, regulation drafting is a process of sending around Word documents, copy-and-pasting from older texts, “version hell,” signing by a Minister, and sending the enacted regulations off to a publisher, who will then turn it into some XML format to feed a publishing platform to generate HTML, PDF, and paper versions of the texts. This process is not designed with a content management perspective, and most if not all metadata is thrown away in the process.<\/p>\n Part of the problem is one of\u00a0organisational change<\/em>: convincing legislative drafters to use a more structured approach in their daily work. The\u00a0Dutch Ministry of Security and Justice<\/a> is currently developing a legislative editing environment (similar to the\u00a0MetaVex<\/a> editor developed at the\u00a0University of Amsterdam<\/a>), but it will take awhile before this is adopted in practice.<\/p>\n To develop a chain of tools for managing legal information, both as text and as knowledge models, we need to address a number of key requirements:<\/p>\n A versioning mechanism should distinguish between a regulation text as it exists at a particular time, and the final regulation. The IFLA Functional Requirements for Bibliographic Records (FRBR)<\/a> (Saur, ’98) makes the following distinctions: the work as a “distinguishable intellectual or artistic creation” (e.g.<\/em>, the constitution); the expression as the “intellectual or artistic form that a work takes each time it is realised” (e.g.<\/em>, “The Constitution of July 15th, 2008”); the manifestation as the “physical embodiment of an expression of a work” (e.g.,<\/em> a PDF version of “The Constitution of July 15th, 2008”); and the item as a “single exemplar of a manifestation” (e.g.<\/em>, the PDF version of “The Constitution of July 15th, 2008” residing on my USB stick).<\/p>\n As we don’t have any time to waste, and have neither the organisational infrastructure, nor the funds, to use or develop any other (richer) information source, we need to make do with what’s currently available. How hard is it to build a chain of tools that meets at least part of these requirements? And, what information does the Dutch government already provide on which we can build the services that it itself so dearly needs?<\/p>\n <\/p>\n Wetten.overheid.nl<\/a> is the de facto source for legislative information in The Netherlands.\u00a0Users can perform a full text search through the titles and text of all statutes and regulations of the Kingdom of the Netherlands. They can search for a specific article, as well as for the version of a text as it stood at a specified date. Wetten.overheid.nl also provides an API for retrieving XML manifestations of statutes and regulations.<\/p>\n What about identifiers? Wetten.nl supports deeplinks to particular versions of statutes and regulations, but is not very consistent about it. For example:<\/p>\n http:\/\/wetten.overheid.nl\/cgi-bin\/deeplink\/law1\/bwbid=BWBR0005416\/article=6\/date=2005-01-14<\/a><\/p>\n and<\/p>\n http:\/\/wetten.overheid.nl\/BWBR0005416\/TitelII698946\/HoofdstukII\/Artikel16\/geldigheidsdatum_14-01-2005<\/a><\/p>\n both point to article 6 of the Municipal law, as it was in effect on January 14th, 2005. A third mechanism for identifying regulations is the\u00a0Juriconnect<\/a> standard for referring to parts of statutes and regulations. XML documents hosted by Wetten.nl use these identifiers to specify citations between statutes and regulations.\u00a0For instance, the Juriconnect identifier for article 6 of the Municipal law is:<\/p>\n But… the standard does not prescribe a mechanism for dereferenecing an identifier to the actual text of (part of) a statute or regulation.<\/p>\n XML manifestations of statutes and regulations are retrievable through an API on top of the “Basiswettenbestand” (BWB) content management system. This REST<\/a> Web service only provides the latest version of an entire statute or regulation. The BWB XML document returned is stripped of all version history: it does not even contain the version date of the text itself.<\/p>\n <\/a><\/p>\n An index of all BWB identifiers, with basic attributes such as official and abbreviated titles, enactment and publication dates, retroactivity, etc. is available as a\u00a0zipped XML dump<\/a> or\u00a0a SOAP service<\/a>. Unfortunately, the XML file is\u00a0corrupt<\/strong>, and the date of the latest change to a statute or regulation reported in the index is\u00a0not really the date of the latest modification<\/em>, but of the latest update of the statute or regulation in the CMS. \u00a0See the picture above.<\/p>\n The BWB uses its own XML schema for storing statutes and regulations; this schema does not allow for intermixing with any third-party elements or attributes, ruling out obvious extensions for rich annotations such as\u00a0RDFa<\/a>.\u00a0And, BWB XML elements do not carry any identifiers.<\/p>\n CEN MetaLex<\/a> is a jurisdiction-independent XML format for representing, publishing, and interchanging legal texts. It was developed to allow\u00a0traceability<\/strong> of legal knowledge representations to their original source.\u00a0MetaLex elements are purely structural. Syntactic elements (structure) are strictly distinct from the meaning of elements by specifying for each element a\u00a0name<\/strong> and its\u00a0content model<\/strong>. What this essentially does is to pave the way for a semantic description of the types of content of elements in an XML document. The standard prescribes the existence of a naming convention for minting URI-based identifiers for all structural elements of a legal document. MetaLex explicitly encourages the use of RDFa attributes on its elements, and provides special metadata-elements for serialising additional RDF triples that cannot be expressed on structural elements themselves. MetaLex includes an ontology, which defines an event model for legislative modifications. The\u00a0legislation.gov.uk<\/a> portal has adopted the MetaLex event model for representing modifications.<\/p>\n The MetaLex schema is designed to be independent of jurisdiction, which means that it should be possible to map each legacy XML element to a MetaLex element in an unambiguous fashion. Fortunately, we were able to define a straightforward 1:n<\/em> mapping between BWB and MetaLex (see below) by a semi-automatic conversion of the BWB XML DTD.<\/p>\n The transformation of legacy XML to MetaLex and RDF is implemented in the MetaLex converter, an open source\u00a0Python script available from GitHub<\/a>. Conversion occurs in four stages:<\/p>\n For the transformation of BWB XML files, the converter is sequentially fed with all BWB XML files and identifiers listed in the BWB ID index. Based on the mapping table, the converter traverses the\u00a0DOM<\/a> tree of the source document, and synchronously builds a DOM tree for the target document. In cases where the MetaLex schema doesn’t quite “fit,” the converter has to make additional repairs.<\/p>\n <\/a><\/p>\n We evaluated the ability of MetaLex to map onto the BWB XML by running the converter on 300 randomly selected BWB identifiers. The\u00a0 Because of the limitations of the API, version information, citation titles, and other metadata are retrieved through a custom-built scraper of the information pages on the wetten.nl Website.<\/p>\n For every element in the document we create transparent URL-like URIs for the\u00a0work<\/strong>,\u00a0expression<\/strong> and\u00a0manifestation<\/strong> levels, and two opaque URIs for the\u00a0expression<\/strong> and\u00a0item<\/strong> levels in the FRBR specification.<\/p>\n For transparent URIs, we use a naming scheme that is based on the\u00a0URIs used at legislation.gov.uk<\/a>, with slight adaptations to allow for the Dutch situation. In short, work level identifiers are based on the standard BWB identifier, followed by a hierarchical path to an element in the source, e.g.<\/em>, “chapter\/1\/article\/1”. These URIs are extended to expression URIs by appending version and language information. Similarly, manifestation URIs are extensions of expression URIs that specify format information such as XML, RDF, etc.\u00a0 Juriconnect references in the source BWB XML are automatically translated to this naming scheme.<\/p>\n The\u00a0opaque version URI<\/strong> is needed to distinguish different versions of a text. The current Webservice does not provide access to all versions of statutes and regulations (only to the latest), let alone at a level of granularity lower than entire statutes or regulations. We therefore need some way of constructing a version history by regularly checking for new versions, and comparing them to those we looked at before. By including in the opaque URI a unique SHA1<\/a> hash<\/strong> of the textual content of an XML element, and simultaneously maintaining a link between the opaque URI and the transparent identifier, different expressions of a work can be automatically distinguished through time. This is needed to work around issues with identifiers based on numbers: the insertion of a new element can change the position (and therefore the identifier) of other elements without a change in the content of the elements.<\/p>\n By this method, globally persistent URIs of every element in a legal text can be consistently generated for both current and future versions of the text. By simultaneously generating an opaque and a transparent expression level URI, identification of these text versions does not have to rely on numbering.<\/p>\n The MetaLex converter produces three types of metadata. First,\u00a0legacy<\/strong> metadata from attributes in the source XML is directly translated to RDF triples. Second, metadata describing the\u00a0structural<\/strong> and\u00a0identity<\/strong> relations between elements is created. This includes typing resources according to the MetaLex ontology, e.g.<\/em>, as\u00a0 <\/a><\/p>\n Event information plays a central role in determining\u00a0what<\/em> version of a regulation was valid\u00a0when<\/em>. Making explicit which events and modifying processes contributed to an expression of a regulation provides for a flexible and extensible model.\u00a0The MDS uses the\u00a0MetaLex ontology<\/a> for legislative modification events, the\u00a0Simple Event Model (SEM)<\/a> and the\u00a0W3C Time Ontology<\/a> for an abstract description of events and event types, and the\u00a0Open Provenance Model Vocabulary (OPMV)<\/a> for describing processes and provenance information. These vocabularies can be combined in a compatible fashion, allowing for maximal reuse of event and process descriptions by third parties that may not necessarily commit to the MetaLex ontology.<\/p>\n <\/a><\/strong><\/p>\n The MetaLex converter supports three formats for serialising a legal text to a manifestation: the MetaLex format itself, viewable in a browser by\u00a0linking a CSS stylesheet<\/a>; RDFa, Turtle and RDF\/XML serializations of the RDF metadata; and a citation graph.\u00a0The converter can automatically upload RDF to a triple store through either the\u00a0Sesame API<\/a>, or\u00a0SPARQL updates<\/a>. The citation graph is exported as\u00a0a ‘”net” network file, for further analysis in social network software tools such as\u00a0Pajek<\/a> and\u00a0Gephi<\/a>. We are exploring ways to use these networks for determining the importance of articles (in degree) and the dependency of legislation on certain articles (betweenness centrality), and for analysing the correlation between legislation and case law.<\/p>\n The results of this procedure are published through the\u00a0MetaLex Document Server (MDS)<\/a>. The server follows the\u00a0Cool URIs<\/a> specification, and implements HTTP-based redirects for work- and expression-level URIs to corresponding manifestations based on the HTTP accept header.\u00a0Requests for an HTML mime-type are redirected to a\u00a0Marbles<\/a> HTML rendering of a\u00a0Symmetric Concise Bounded Description (SCBD)<\/a> of the RDF resource. Similarly, requests for RDF content return the SCBD itself; supported formats are RDF\/XML and Turtle.\u00a0A request for XML will return a snippet of MetaLex for the specified part of a statute or regulation.<\/p>\n <\/a><\/p>\n The MDS provides two convenient methods for retrieving manifestations of a statute or regulation. Appending “\/latest” to a work URI will redirect to the latest expression present in the triple store. Appending an arbitrary ISO date will return the last expression published before that date if no direct match is available. Lastly, the MDS offers a\u00a0simple search interface<\/a> for finding statutes and regulations based on the title and version date.<\/p>\n <\/a>We have been running the converter on a daily basis, on all versions of statutes and regulations made available through the wetten.nl portal since May 2011. This has resulted in a current total of 29,120<\/strong> document versions: 28 thousand versions in the first run, the rest accumulated through time. For these document versions, we now store 119 million<\/strong> triples of RDF metadata in a 4Store<\/a> triple store. Compared to the size of legislation.gov.uk (1.9 billion triples, since the 1200s), this is a modest number, but at the current growth rate we will soon need to look for alternative (more professional) solutions. Check the http:\/\/doc.metalex.eu<\/a> Website for the latest numbers.<\/p>\n I am happy to say, also, that this work has not gone unnoticed. The IND was particularly enthused by the versioning mechanism, and is in the process of adopting the MDS approach as their internal content management system. Similarly, the ability to link concept descriptions to reliably versioned parts of legislation has been an eye opener for the Belastingdienst. We are also in touch with several people at\u00a0ICTU<\/a>, the organisation behind Wetten.nl, to help them improve their services.<\/p>\n <\/a>Dr. Rinke Hoekstra<\/a><\/strong> is a postdoctoral researcher at the Leibniz Center for Law at the University of Amsterdam<\/a>. He is the developer of the MetaLex Document Server<\/a>, the principal author of the LKIF Core ontology of basic legal concepts<\/a>, and one of the initial developers of the MetaLex XML format for legal sources<\/a>.<\/p>\n VoxPopuLII is edited by Judith Pratt.<\/a> Editor-in-Chief is Robert Richards<\/a>, to whom queries should be directed. The statements above are not legal advice or legal representation. If you require legal advice, consult a lawyer. Find a lawyer<\/a> in the Cornell LII Lawyer Directory<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":" In this post I describe the process and requirements that eventually led to the\u00a0MetaLex Document Server, a server that hosts all versions of Dutch national statutes and regulations published since May 2011, both as\u00a0CEN MetaLex, and as\u00a0Linked Data. Before I set out to do so, however, I would like to emphasize that, although the development […]<\/a><\/p>\n","protected":false},"author":72,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[456,329,296,471,464,504,4845,293],"tags":[588,4853,4855,4851,584,4848,4849,4856,4986,4857,511,509,583,640,4846,4852,4854,4850,4847],"_links":{"self":[{"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/posts\/1968"}],"collection":[{"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/users\/72"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/comments?post=1968"}],"version-history":[{"count":81,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/posts\/1968\/revisions"}],"predecessor-version":[{"id":2016,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/posts\/1968\/revisions\/2016"}],"wp:attachment":[{"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/media?parent=1968"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/categories?post=1968"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.law.cornell.edu\/voxpop\/wp-json\/wp\/v2\/tags?post=1968"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}The root of the problem<\/strong><\/h3>\n
Requirements<\/strong><\/h3>\n
\n
\n
Making do with what we’ve got<\/strong><\/h3>\n
Identifiers<\/strong><\/h3>\n
1.0:c:BWBR0005416&artikel=16&g=2005-01-14<\/code><\/p>\n
The BWB XML service and format<\/strong><\/h3>\n
A more general XML format: CEN MetaLex<\/strong><\/h3>\n
Getting from BWB to CEN MetaLex<\/strong><\/h3>\n
\n
Step 1: Mapping<\/strong><\/h4>\n
artikel <\/code>element accounts for 72% of all corrections, and corresponds to 68% of all\u00a0
htitle<\/code> substitutions (5 % of total). This means that only a very small part of BWB XML does not directly fit onto the MetaLex schema. The main cause for incompatibility is the restriction in MetaLex that\u00a0
hcontainer <\/code>elements are not allowed to contain\u00a0
block<\/code> elements (and that’s perhaps something to consider for the MetaLex workshop).<\/p>\n
Step 2: Minting Identifiers<\/strong><\/h4>\n
Step 3: Producing Metadata<\/h4>\n
ml:BibliographicExpression<\/code>; creating\u00a0
ml:realizes<\/code> relations between expressions and works;\u00a0and creating
owl:sameAs<\/code> relations between opaque and transparent expression URIs.\u00a0The official title, abbreviation, and publication date of statutes and regulations are represented using the\u00a0
dcterms:title<\/code>,\u00a0
dcterms:alternative<\/code> and\u00a0
dct:valid<\/code> properties.<\/p>\n
Events and Processes<\/strong><\/h3>\n
Step 4: Serialization<\/strong><\/h4>\n
Publication<\/strong><\/h3>\n
Results and Take Up<\/strong><\/h3>\n