{"id":215,"date":"2010-04-30T04:15:54","date_gmt":"2010-04-30T09:15:54","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/voxpop\/2010\/04\/30\/jurmeta-new-metadata-initiative-for-legal-documents\/"},"modified":"2010-07-17T10:38:52","modified_gmt":"2010-07-17T15:38:52","slug":"jurmeta-new-metadata-initiative-for-legal-documents","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/voxpop\/2010\/04\/30\/jurmeta-new-metadata-initiative-for-legal-documents\/","title":{"rendered":"jurMeta – New Metadata Initiative for Legal Documents"},"content":{"rendered":"

\"puzzle\"<\/a>This article is about jurMeta<\/a>, a new metadata initiative for legal texts that I initiated with two colleagues in Germany. Our vision for jurMeta goes beyond a standard set of metadata tags and predefined legal terms. JurMeta is a whole concept allowing the automatic annotation, exchange and linking of legal documents on the Internet for a wide community. Technically, this will be achieved by easy-to-use plugins<\/a> for blog systems and content management systems<\/a>. Using this system, citations to a court decision in a blog post could be linked with the court decision itself on another website by simply installing a plugin. On http:\/\/jurmeta.de,<\/a> I have already started a platform for the standard itself and for tools that can be built on top of it in future.<\/p>\n

We need a practical standard <\/strong><\/p>\n

After having Googled metadata standards for legal texts<\/a> for hours, having screened dozens of projects<\/a>, and having seen a lot of ontologies and legal topic maps<\/a>, I suggested to my colleague Andreas Bock<\/a>: \u201cLet’s go and create our own metadata standard. Everything I’ve seen so far is too complex and does not fit our practical needs.\u201d<\/p>\n

In August 2009, we had just released kjur.de<\/a> – “Recht einfach finden” — “Law easily found” — a small, Google<\/a>-like search engine for legal documents in the German language. The concept is to offer better-structured access to freely available legal content by avoiding all that noise that bothers you when using one of the big search engines.<\/p>\n

Some other vertical search engines for legal content (such as these<\/a> and Google Scholar<\/a>) have been developed in recent years. But most are built on a Google custom search<\/a>, and so they offer only limited possibilities for analyzing content. We chose a flexible but more difficult method, and designed our back-end on top of an existing crawler<\/a>. Legal acts<\/a>, blogs<\/a>, Wikipedia<\/a> articles, and a bibliography were the first content types in kjur.de. Unfortunately, a few weeks after the release, the nice beep of a hard disc head crash and a badly configured RAID<\/a> sent a key part of our document processing into digital nirvana.<\/p>\n

However, this created a golden opportunity to address the increased demands on our system. We decided to shift from a less- to a more-structured data processing and storage approach. Our new, flexible, and scalable process mined the documents for particular legal information, and automatically annotated the documents with metadata, marked up in XML<\/a>. In homage to Yahoo pipes<\/a>, we called our process \u201ckjur pipe.\u201d The first goal was to offer more detailed access to legal information and, with query expansion<\/a>, to provide an extended search. In order to be able do this, we needed a document and metadata standard.<\/p>\n

jurMeta at the EDV-Gerichtstag 2009<\/strong><\/p>\n

Part two of the history of jurMeta concerns a session at the EDV-Gerichtstag <\/a>in Saarbrucken, a conference — held annually in Germany — on technical aspects of legal documents. Ralf Zosel<\/a>, who organized the session about free legal online projects, invited us to talk about our experiences with our project. Ralf created http:\/\/jurawiki.de<\/a>, the first and best-known legal online community in Germany.<\/p>\n

\"German<\/a>We spoke about the accessibility of law statutes and court decisions on governmental websites.<\/a> The German landscape of official websites with legal content is split up into a large number of very heterogeneous portals. For example, at http:\/\/www.gesetze-im-internet.de<\/a>, one can find most of the consolidated federal acts. Court decisions are available from five different sites:\u00a0 the Federal Constitutional Court<\/a>,\u00a0 the Federal Civil Court<\/a>, the Federal Employment Court<\/a>, the Federal Patent Court<\/a>,\u00a0 and the Federal Finance Court<\/a>. Additionally, most of the 16 federal states<\/a> have their own databases with court decisions<\/a> and the most important statutes<\/a>.<\/p>\n

An interesting detail is that many of these free databases provided by the government authorities are hosted by a publisher who sells the same documents on his own web portal<\/a>. That these sites are protected by<\/a> robots.txt files<\/a> and meta tags<\/a> clearly demonstrates that search spiders<\/a> are not welcome. Some of the official sites allow crawling, but use complicated JavaScripts<\/a> that can cause infinite crawling loops. Therefore, except for the database of the Federal Constitutional Court<\/a>, the German Yahoo<\/a> and Google<\/a> indices contain not a single court decision indexed from the federal courts in Germany.<\/p>\n

Moreover, the automatic extraction of metadata from online German legal documents poses a big problem. Let me demonstrate this with a metadata field that basically exists in every .doc<\/a>, .pdf<\/a>, or .html<\/a> file — the title. The users of our search engine like to see the search results with titles that tell something about what they can find behind the link; for example, the name of the court, the file number, and the date. It is a pity that the document titles of the crawled documents rarely contain any useful information.\u00a0 Most of the titles of court decisions on public websites contain the file name of the MS-Word template, the name of the secretary of the court who wrote the decision, or simply \u201cNew Document.doc.\u201d So we had to program a script for metadata extraction in order to build nice titles on our own.<\/p>\n

Therefore, in the session at the EDV-Gerichtstag 2009<\/a>, my colleague suggested to the government that they publish official legal texts on a centralized website — and that those texts should be structured and should contain descriptive metadata, if those metadata appear in the original documents. If the government took these steps, my colleague contended, the publishers of legal information would be free to compete with respect to creating the best information system and the best added value for the user, and would not spend too much time or money adding technical structure to primary legal texts.<\/p>\n

\"baby\"<\/a>How to give birth to a vital metadata baby<\/strong><\/p>\n

My role in the session was to offer some ideas about what a metadata standard could look like. When I prepared my slides<\/a>, I was very skeptical. I was afraid of presenting a stillborn child, because nobody needs a new metadata standard. Who should use it — and why? I had had my very own experiences with XML metadata-annotated documents when, in 2001, I created a system called VERIS<\/a>, a database containing public procurement case law. It was the first database — and I think the only one — that allowed the user to export court decisions in the format of the XML Standard of Court Decisions of Saarbr\u00fccken<\/a>. Nobody was interested in this functionality except for the person who copied the whole database piece by piece and created his own in order to sell it. Although a court ordered him to stop, he demonstrated the interoperability of legal XML documents<\/a>.<\/p>\n

Focusing on Web 2.0<\/strong><\/p>\n

I assume that because big publishers<\/a> have their own standards that depend on their business processes, they are not very interested in a new metadata standard. I am convinced that the largest audience for a new metadata standard can be found in the Web 2.0<\/a> community.<\/p>\n

A lot of incredibly good and interesting legal web projects are driven by hundreds, perhaps thousands of people, mostly technically talented lawyers. Here are some examples from the German websphere:<\/p>\n