skip navigation
search

 


Van Winkle wakes

In this post, we return to a topic we first visited in a book chapter in 2004.  At that time, one of us (Bruce) was an electronic publisher of Federal court cases and statutes, and the other (Hillmann, herself a former law cataloger) was working with large, aggregated repositories of scientific papers as part of the National Sciences Digital Library project.  Then, as now, we were concerned that little attention was being paid to the practical tradeoffs involved in publishing high quality metadata at low cost.  There was a tendency to design metadata schemas that said absolutely everything that could be said about an object, often at the expense of obscuring what needed to be said about it while running up unacceptable costs.  Though we did not have a name for it at the time, we were already deeply interested in least-cost, use-case-driven approaches to the design of metadata models, and that naturally led us to wonder what “good” metadata might be.  The result was “The Continuum of Metadata Quality: Defining, Expressing, Exploiting”, published as a chapter in an ALA publication, Metadata in Practice.

In that chapter, we attempted to create a framework for talking about (and evaluating) metadata quality.  We were concerned primarily with metadata as we were then encountering it: in aggregations of repositories containing scientific preprints, educational resources, and in caselaw and other primary legal materials published on the Web.   We hoped we could create something that would be both domain-independent and useful to those who manage and evaluate metadata projects.  Whether or not we succeeded is for others to judge.

The Original Framework

At that time, we identified seven major components of metadata quality. Here, we reproduce a part of a summary table that we used to characterize the seven measures. We suggested questions that might be used to draw a bead on the various measures we proposed:

Quality Measure Quality Criteria
Completeness Does the element set completely describe the objects?
Are all relevant elements used for each object?
Provenance Who is responsible for creating, extracting, or transforming the metadata?
How was the metadata created or extracted?
What transformations have been done on the data since its creation?
Accuracy Have accepted methods been used for creation or extraction?
What has been done to ensure valid values and structure?
Are default values appropriate, and have they been appropriately used?
Conformance to expectations Does metadata describe what it claims to?
Are controlled vocabularies aligned with audience characteristics and understanding of the objects?
Are compromises documented and in line with community expectations?
Logical consistency and coherence Is data in elements consistent throughout?
How does it compare with other data within the community?
Timeliness Is metadata regularly updated as the resources change?
Are controlled vocabularies updated when relevant?
Accessibility Is an appropriate element set for audience and community being used?
Is it affordable to use and maintain?
Does it permit further value-adds?

 

There are, of course, many possible elaborations of these criteria, and many other questions that help get at them.  Almost nine years later, we believe that the framework remains both relevant and highly useful, although (as we will discuss in a later section) we need to think carefully about whether and how it relates to the quality standards that the Linked Open Data (LOD) community is discovering for itself, and how it and other standards should affect library and publisher practices and policies.

… and the environment in which it was created

Our work was necessarily shaped by the environment we were in.  Though we never really said so explicitly, we were looking for quality not only in the data itself, but in the methods used to organize, transform and aggregate it across federated collections.  We did not, however, anticipate the speed or scale at which standards-based methods of data organization would be applied.  Commonly-used standards like FOAF, models such as those contained in schema.org, and lightweight modelling apparatus like SKOS are all things that have emerged into common use since, and of course the use of Dublin Core — our main focus eight years ago — has continued even as the standard itself has been refined.  These days, an expanded toolset makes it even more important that we have a way to talk about how well the tools fit the job at hand, and how well they have been applied. An expanded set of design choices accentuates the need to talk about how well choices have been made in particular cases.

Although our work took its inspiration from quality standards developed by a government statistical service, we had not really thought through the sheer multiplicity of information services that were available even then.  We were concerned primarily with work that had been done with descriptive metadata in digital libraries, but of course there were, and are, many more people publishing and consuming data in both the governmental and private sectors (to name just two).  Indeed, there was already a substantial literature on data quality that arose from within the management information systems (MIS) community, driven by concerns about the reliability and quality of  mission-critical data used and traded by businesses.  In today’s wider world, where work with library metadata will be strongly informed by the Linked Open Data techniques developed for a diverse array of data publishers, we need to take a broader view.  

Finally, we were driven then, as we are now, by managerial and operational concerns. As practitioners, we were well aware that metadata carries costs, and that human judgment is expensive.  We were looking for a set of indicators that would spark and sustain discussion about costs and tradeoffs.  At that time, we were mostly worried that libraries were not giving costs enough attention, and were designing metadata projects that were unrealistic given the level of detail or human intervention they required.  That is still true.  The world of Linked Data requires well-understood metadata policies and operational practices simply so publishers can know what is expected of them and consumers can know what they are getting. Those policies and practices in turn rely on quality measures that producers and consumers of metadata can understand and agree on.  In today’s world — one in which institutional resources are shrinking rather than expanding —  human intervention in the metadata quality assessment process at any level more granular than that of the entire data collection being offered will become the exception rather than the rule.   

While the methods we suggested at the time were self-consciously domain-independent, they did rest on background assumptions about the nature of the services involved and the means by which they were delivered. Our experience had been with data aggregated by communities where the data producers and consumers were to some extent known to one another, using a fairly simple technology that was easy to run and maintain.  In 2013, that is not the case; producers and consumers are increasingly remote from each other, and the technologies used are both more complex and less mature, though that is changing rapidly.

The remainder of this blog post is an attempt to reconsider our framework in that context.

The New World

The Linked Open Data (LOD) community has begun to consider quality issues; there are some noteworthy online discussions, as well as workshops resulting in a number of published papers and online resources.  It is interesting to see where the work that has come from within the LOD community contrasts with the thinking of the library community on such matters, and where it does not.  

In general, the material we have seen leans toward the traditional data-quality concerns of the MIS community.  LOD practitioners seem to have started out by putting far more emphasis than we might on criteria that are essentially audience-dependent, and on operational concerns having to do with the reliability of publishing and consumption apparatus.   As it has evolved, the discussion features an intellectual move away from those audience-dependent criteria, which are usually expressed as “fitness for use”, “relevance”, or something of the sort (we ourselves used the phrase “community expectations”). Instead, most realize that both audience and usage  are likely to be (at best) partially unknown to the publisher, at least at system design time.  In other words, the larger community has begun to grapple with something librarians have known for a while: future uses and the extent of dissemination are impossible to predict.  There is a creative tension here that is not likely to go away.  On the one hand, data developed for a particular community is likely to be much more useful to that community; thus our initial recognition of the role of “community expectations”.  On the other, dissemination of the data may reach far past the boundaries of the community that develops and publishes it.  The hope is that this tension can be resolved by integrating large data pools from diverse sources, or by taking other approaches that result in data models sufficiently large and diverse that “community expectations” can be implemented, essentially, by filtering.

For the LOD community, the path that began with  “fitness-for-use” criteria led quickly to the idea of maintaining a “neutral perspective”. Christian Fürber describes that perspective as the idea that “Data quality is the degree to which data meets quality requirements no matter who is making the requirements”.  To librarians, who have long since given up on the idea of cataloger objectivity, a phrase like “neutral perspective” may seem naive.  But it is a step forward in dealing with data whose dissemination and user community is unknown. And it is important to remember that the larger LOD community is concerned with quality in data publishing in general, and not solely with descriptive metadata, for which objectivity may no longer be of much value.  For that reason, it would be natural to expect the larger community to place greater weight on objectivity in their quality criteria than the library community feels that it can, with a strong preference for quantitative assessment wherever possible.  Librarians and others concerned with data that involves human judgment are theoretically more likely to be concerned with issues of provenance, particularly as they concern who has created and handled the data.  And indeed that is the case.

The new quality criteria, and how they stack up

Here is a simplified comparison of our 2004 criteria with three views taken from the LOD community.

Bruce & Hillmann Dodds, McDonald Flemming
Completeness Completeness
Boundedness
Typing
Amount of data
Provenance History
Attribution
Authoritative
Verifiability
Accuracy Accuracy
Typing
Validity of documents
Conformance to expectations Modeling correctness
Modeling granularity
Isomorphism
Uniformity
Logical consistency and coherence Directionality
Modeling correctness
Internal consistency
Referential correspondence
Connectedness
Consistency
Timeliness Currency Timeliness
Accessibility Intelligibility
Licensing
Sustainable
Comprehensibility
Versatility
Licensing
Accessibility (technical)
Performance (technical)

Placing the “new” criteria into our framework was no great challenge; it appears that we were, and are, talking about many of the same things. A few explanatory remarks:

  • Boundedness has roughly the same relationship to completeness that precision does to recall in information-retrieval metrics. The data is complete when we have everything we want; its boundedness shows high quality when we have only what we want.
  • Flemming’s amount of data criterion talks about numbers of triples and links, and about the interconnectedness and granularity of the data.  These seem to us to be largely completeness criteria, though things to do with linkage would more likely fall under “Logical coherence” in our world. Note, again, a certain preoccupation with things that are easy to count.  In this case it is somewhat unsatisfying; it’s not clear what the number of triples in a triplestore says about quality, or how it might be related to completeness if indeed that is what is intended.
  • Everyone lists criteria that fit well with our notions about provenance. In that connection, the most significant development has been a great deal of work on formalizing the ways in which provenance is expressed.  This is still an active level of research, with a lot to be decided.  In particular, attempts at true domain independence are not fully successful, and will probably never be so.  It appears to us that those working on the problem at DCMI are monitoring the other efforts and incorporating the most worthwhile features.
  • Dodds’ typing criterion — which basically says that dereferenceable URIs should be preferred to string literals  – participates equally in completeness and accuracy categories.  While we prefer URIs in our models, we are a little uneasy with the idea that the presence of string literals is always a sign of low quality.  Under some circumstances, for example, they might simply indicate an early stage of vocabulary evolution.
  • Flemming’s verifiability and validity criteria need a little explanation, because the terms used are easily confused with formal usages and so are a little misleading.  Verifiability bundles a set of concerns we think of as provenance.  Validity of documents is about accuracy as it is found in things like class and property usage.  Curiously, none of Flemming’s criteria have anything to do with whether the information being expressed by the data is correct in what it says about the real world; they are all designed to convey technical criteria.  The concern is not with what the data says, but with how it says it.
  • Dodds’ modeling correctness criterion seems to be about two things: whether or not the model is correctly constructed in formal terms, and whether or not it covers the subject domain in an expected way.  Thus, we assign it to both “Community expectations” and “Logical coherence” categories.
  • Isomorphism has to do with the ability to join datasets together, when they describe the same things.  In effect, it is a more formal statement of the idea that a given community will expect different models to treat similar things similarly. But there are also some very tricky (and often abused) concepts of equivalence involved; these are just beginning to receive some attention from Semantic Web researchers.
  • Licensing has become more important to everyone. That is in part because Linked Data as published in the private sector may exhibit some of the proprietary characteristics we saw as access barriers in 2004, and also because even public-sector data publishers are worried about cost recovery and appropriate-use issues.  We say more about this in a later section.
  • A number of criteria listed under Accessibility have to do with the reliability of data publishing and consumption apparatus as used in production.  Linked Data consumers want to know that the endpoints and triple stores they rely on for data are going to be up and running when they are needed.  That brings a whole set of accessibility and technical performance issues into play.  At least one website exists for the sole purpose of monitoring endpoint reliability, an obvious concern of those who build services that rely on Linked Data sources. Recently, the LII made a decision to run its own mirror of the DrugBank triplestore to eliminate problems with uptime and to guarantee low latency; performance and accessibility had become major concerns. For consumers, due diligence is important.

For us, there is a distinctly different feel to the examples that Dodds, Flemming, and others have used to illustrate their criteria; they seem to be looking at a set of phenomena that has substantial overlap with ours, but is not quite the same.  Part of it is simply the fact, mentioned earlier, that data publishers in distinct domains have distinct biases. For example, those who can’t fully believe in objectivity are forced to put greater emphasis on provenance. Others who are not publishing descriptive data that relies on human judgment feel they can rely on more  “objective” assessment methods.  But the biggest difference in the “new quality” is that it puts a great deal of emphasis on technical quality in the construction of the data model, and much less on how well the data that populates the model describes real things in the real world.  

There are three reasons for that.  The first has to do with the nature of the discussion itself. All quality discussions, simply as discussions, seem to neglect notions of factual accuracy because factual accuracy seems self-evidently a Good Thing; there’s not much to talk about.  Second, the people discussing quality in the LOD world are modelers first, and so quality is seen as adhering primarily to the model itself.  Finally, the world of the Semantic Web rests on the assumption that “anyone can say anything about anything”, For some, the egalitarian interpretation of that statement reaches the level of religion, making it very difficult to measure quality by judging whether something is factual or not; from a purist’s perspective, it’s opinions all the way down.  There is, then, a tendency to rely on formalisms and modeling technique to hold back the tide.

In 2004, we suggested a set of metadata-quality indicators suitable for managers to use in assessing projects and datasets.  An updated version of that table would look like this:


Quality Measure Quality Criteria
Completeness Does the element set completely describe the objects?
Are all relevant elements used for each object?
Does the data contain everything you expect?
Does the data contain only what you expect?
Provenance Who is responsible for creating, extracting, or transforming the metadata?
How was the metadata created or extracted?
What transformations have been done on the data since its creation?
Has a dedicated provenance vocabulary been used?
Are there authenticity measures (eg. digital signatures) in place?
Accuracy Have accepted methods been used for creation or extraction?
What has been done to ensure valid values and structure?
Are default values appropriate, and have they been appropriately used?
Are all properties and values valid/defined?
Conformance to expectations Does metadata describe what it claims to?
Does the data model describe what it claims to?
Are controlled vocabularies aligned with audience characteristics and understanding of the objects?
Are compromises documented and in line with community expectations?
Logical consistency and coherence Is data in elements consistent throughout?
How does it compare with other data within the community?
Is the data model technically correct and well structured?
Is the data model aligned with other models in the same domain?
Is the model consistent in the direction of relations?
Timeliness Is metadata regularly updated as the resources change?
Are controlled vocabularies updated when relevant?
Accessibility Is an appropriate element set for audience and community being used?
Is the data and its access methods well-documented, with exemplary queries and URIs?
Do things have human-readable labels?
Is it affordable to use and maintain?
Does it permit further value-adds?
Does it permit republication?
Is attribution required if the data is redistributed?
Are human- and machine-readable licenses available?
Accessibility — technical Are reliable, performant endpoints available?
Will the provider guarantee service (eg. via a service level agreement)?
Is the data available in bulk?
Are URIs stable?

 

The differences in the example questions reflect the differences of approach that we discussed earlier. Also, the new approach separates criteria related to technical accessibility from questions that relate to intellectual accessibility. Indeed, we suspect that “accessibility” may have been too broad a notion in the first place. Wider deployment of metadata systems and a much greater, still-evolving variety of producer-consumer scenarios and relationships have created a need to break it down further.  There are as many aspects to accessibility as there are types of barriers — economic, technical, and so on.

As before, our list is not a checklist or a set of must-haves, nor does it contain all the questions that might be asked.  Rather, we intend it as a list of representative questions that might be asked when a new Linked Data source is under consideration.  They are also questions that should inform policy discussion around the uses of Linked Data by consuming libraries and publishers.  

That is work that can be formalized and taken further. One intriguing recent development is work toward a Data Quality Management Vocabulary.   Its stated aims are to

  • support the expression of quality requirements in the same language, at web scale;
  • support the creation of consensual agreements about quality requirements
  • increase transparency around quality requirements and measures
  • enable checking for consistency among quality requirements, and
  • generally reduce the effort needed for data quality management activities

 

The apparatus to be used is a formal representation of “quality-relevant” information.   We imagine that the researchers in this area are looking forward to something like automated e-commerce in Linked Data, or at least a greater ability to do corpus-level quality assessment at a distance.  Of course, “fitness-for-use” and other criteria that can really only be seen from the perspective of the user will remain important, and there will be interplay between standardized quality and performance measures (on the one hand) and audience-relevant features on the other.   One is rather reminded of the interplay of technical specifications and “curb appeal” in choosing a new car.  That would be an important development in a Semantic Web industry that has not completely settled on what a car is really supposed to be, let alone how to steer or where one might want to go with it.

Conclusion

Libraries have always been concerned with quality criteria in their work as a creators of descriptive metadata.  One of our purposes here has been to show how those criteria will evolve as libraries become publishers of Linked Data, as we believe that they must. That much seems fairly straightforward, and there are many processes and methods by which quality criteria can be embedded in the process of metadata creation and management.

More difficult, perhaps, is deciding how these criteria can be used to construct policies for Linked Data consumption.  As we have said many times elsewhere, we believe that there are tremendous advantages and efficiencies that can be realized by linking to data and descriptions created by others, notably in connecting up information about the people and places that are mentioned in legislative information with outside information pools.   That will require care and judgement, and quality criteria such as these will be the basis for those discussions.  Not all of these criteria have matured — or ever will mature — to the point where hard-and-fast metrics exist.  We are unlikely to ever see rigid checklists or contractual clauses with bullet-pointed performance targets, at least for many of the factors we have discussed here. Some of the new accessibility criteria might be the subject of service-level agreements or other mechanisms used in electronic publishing or database-access contracts.  But the real use of these criteria is in assessments that will be made long before contracts are negotiated and signed.  In that setting, these criteria are simply the lenses that help us know quality when we see it.

References

 

Thomas R. Bruce is the Director of the Legal Information Institute at the Cornell Law School.

Diane Hillmann is a principal in Metadata Management Associates, and a long-time collaborator with the Legal Information Institute.  She is currently a member of the Advisory Board for the Dublin Core Metadata Initiative (DCMI), and was co-chair of the DCMI/RDA Task Group.

VoxPopuLII is edited by Judith Pratt. Editors-in-Chief are Stephanie Davidson and Christine Kirchberger, to whom queries should be directed.

WorldLII[Editor's Note: We are republishing here, with some corrections, a post by Dr. Núria Casellas that appeared earlier on VoxPopuLII.]

The organization and formalization of legal information for computer processing in order to support decision-making or enhance information search, retrieval and knowledge management is not recent, and neither is the need to represent legal knowledge in a machine-readable form. Nevertheless, since the first ideas of computerization of the law in the late 1940s, the appearance of the first legal information systems in the 1950s, and the first legal expert systems in the 1970s, claims, such as Hafner’s, that “searching a large database is an important and time-consuming part of legal work,” which drove the development of legal information systems during the 80s, have not yet been left behind.

Similar claims may be found nowadays as, on the one hand, the amount of available unstructured (or poorly structured) legal information and documents made available by governments, free access initiatives, blawgs, and portals on the Web will probably keep growing as the Web expands. And, on the other, the increasing quantity of legal data managed by legal publishing companies, law firms, and government agencies, together with the high quality requirements applicable to legal information/knowledge search, discovery, and management (e.g., access and privacy issues, copyright, etc.) have renewed the need to develop and implement better content management tools and methods.

Information overload, however important, is not the only concern for the future of legal knowledge management; other and growing demands are increasing the complexity of the requirements that legal information management systems and, in consequence, legal knowledge representation must face in the future. Multilingual search and retrieval of legal information to enable, for example, integrated search between the legislation of several European countries; enhanced laypersons’ understanding of and access to e-government and e-administration sites or online dispute resolution capabilities (e.g., BATNA determination); the regulatory basis and capabilities of electronic institutions or normative and multi-agent systems (MAS); and multimedia, privacy or digital rights management systems, are just some examples of these demands.

How may we enable legal information interoperability? How may we foster legal knowledge usability and reuse between information and knowledge systems? How may we go beyond the mere linking of legal documents or the use of keywords or Boolean operators for legal information search? How may we formalize legal concepts and procedures in a machine-understandable form?

In short, how may we handle the complexity of legal knowledge to enhance legal information search and retrieval or knowledge management, taking into account the structure and dynamic character of legal knowledge, its relation with common sense concepts, the distinct theoretical perspectives, the flavor and influence of legal practice in its evolution, and jurisdictional and linguistic differences?

These are challenging tasks, for which different solutions and lines of research have been proposed. Here, I would like to draw your attention to the development of semantic solutions and applications and the construction of formal structures for representing legal concepts in order to make human-machine communication and understanding possible.

Semantic metadata

For example, in the search and retrieval area, we still perform nowadays most legal searches in online or application databases using keywords (that we believe to be contained in the document that we are searching for), maybe together with a combination of Boolean operators, or supported with a set of predefined categories (metadata regarding, for example, date, type of court, etc.), a list of pre-established topics, thesauri (e.g., EuroVoc), or a synonym-enhanced search.

These searches rely mainly on syntactic matching, and — with the exception of searches enhanced with categories, synonyms, or thesauri — they will return only documents that contain the exact term searched for. To perform more complex searches, to go beyond the term, we require the search engine to understand the semantic level of legal documents; a shared understanding of the domain of knowledge becomes necessary.

Although the quest for the representation of legal concepts is not new, these efforts have recently been driven by the success of the World Wide Web (WWW) and, especially, by the later development of the Semantic Web. Sir Tim Berners-Lee described it as an extension of the Web “in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”

From Web 2.0 to Web 3.0

Thus, the Semantic Web is envisaged as an extension of the current Web, which now comprises collaborative tools and social networks (the Social Web or Web 2.0). The Semantic Web is sometimes also referred to as Web 3.0, although there is no widespread agreement on this matter, as different visions exist regarding the enhancement and evolution of the current Web.

These efforts also include the Web of Data (or Linked Data), which relies on the existence of standard formats (URIs, HTTP and RDF) to allow the access and query of interrelated datasets, which may be granted through a SPARQL endpoint (e.g., Govtrack.us, US census data, etc.). Sharing and connecting data on the Web in compliance with the Linked Data principles enables the exploitation of content from different Web data sources with the development of search, browse, and other mashup applications. (See the Linking Open Data cloud diagram by Cyganiak and Jentzsch below.) [Editor's Note: Legislation.gov.uk also applies Linked Data principles to legal information, as John Sheridan explains in his recent post.]

LinkedData

Thus, to allow semantics to be added to the current Web, new languages and tools (ontologies) were needed, as the development of the Semantic Web is based on the formal representation of meaning in order to share with computers the flexibility, intuition, and capabilities of the conceptual structures of human natural languages. In the subfield of computer science and information science known as Knowledge Representation, the term “ontology” refers to a consensual and reusable vocabulary of identified concepts and their relationships regarding some phenomena of the world, which is made explicit in a machine-readable language. Ontologies may be regarded as advanced taxonomical structures, Semantic Web Stackwhere concepts are formalized as classes and defined with axioms, enriched with the description of attributes or constraints, and properties.

The task of developing interoperable technologies (ontology languages, guidelines, software, and tools) has been taken up by the World Wide Web Consortium (W3C). These technologies were arranged in the Semantic Web Stack according to increasing levels of complexity (like a layer cake). In this stack, higher layers depend on lower layers (and the latter are inherited from the original Web). These languages include XML (eXtensible Markup Language), a superset of HTML usually used to add structure to documents, and the so-called ontology languages: RDF/RDFS (Resource Description Framework/Schema), OWL, and OWL2 (Ontology Web Language). While the RDF language offers simple descriptive information about the resources on the Web, encoded in sets of triples of subject (a resource), predicate (a property or relation), and object (a resource or a value), RDFS allows the description of sets. OWL offers an even more expressive language to define structured ontologies (e.g. class disjointess, union or equivalence, etc.

Moreover, a specification to support the conversion of existing thesauri, taxonomies or subject headings into RDF triples has recently been published: the SKOS, Simple Knowledge Organization System standard. These specifications may be exploited in Linked Data efforts, such as the New York Times vocabularies. Also, EuroVoc, the multilingual thesaurus for activities of the EU is, for example, now available in this format.

Although there are different views in the literature regarding the scope of the definition or main characteristics of ontologies, the use of ontologies is seen as the key to implementing semantics for human-machine communication. Many ontologies have been built for different purposes and knowledge domains, for example:

  • OpenCyc: an open source version of the Cyc general ontology;
  • SUMO: the Suggested Upper Merged Ontology;
  • the upper ontologies PROTON (PROTo Ontology) and DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering);
  • the FRBRoo model (which represents bibliographic information);
  • the RDF representation of Dublin Core;
  • the Gene Ontology;
  • the FOAF (Friend of a Friend) ontology.

Although most domains are of interest for ontology modeling, the legal domain offers a perfect area for conceptual modeling and knowledge representation to be used in different types of intelligent applications and legal reasoning systems, not only due to its complexity as a knowledge intensive domain, but also because of the large amount of data that it generates. The use of semantically-enabled technologies for legal knowledge management could provide legal professionals and citizens with better access to legal information; enhance the storage, search, and retrieval of legal information; make possible advanced knowledge management systems; enable human-computer interaction; and even satisfy some hopes respecting automated reasoning and argumentation.

Regarding the incorporation of legal knowledge into the Web or into IT applications, or the more complex realization of the Legal Semantic Web, several directions have been taken, such as the development of XML standards for legal documentation and drafting (including Akoma Ntoso, LexML, CEN Metalex, and Norme in Rete), and the construction of legal ontologies.

Ontologizing legal knowledge

During the last decade, research on the use of legal ontologies as a technique to represent legal knowledge has increased and, as a consequence, a very interesting debate about their capacity to represent legal concepts and their relation to the different existing legal theories has arisen. It has even been suggested that ontologies could be the “missing link” between legal theory and Artificial Intelligence.

The literature suggests that legal ontologies may be distinguished by the levels of abstraction of the ideas they represent, the key distinction being between core and domain levels. Legal core ontologies model general concepts which are believed to be central for the understanding of law and may be used in all legal domains. In the past, ontologies of this type were mainly built upon insights provided by legal theory and largely influenced by normativism and legal positivism, especially by the works of Hart and Kelsen. Thus, initial legal ontology development efforts in Europe were influenced by hopes and trends in research on legal expert systems based on syllogistic approaches to legal interpretation.

More recent contributions at that level include the LKIF-Core Ontology, the LRI-Core Ontology, the DOLCE+CLO (Core Legal Ontology), and the Ontology of Fundamental Legal Concepts.Blue Scene Such ontologies usually include references to the concepts of Norm, Legal Act, and Legal Person, and may contain the formalization of deontic operators (e.g., Prohibition, Obligation, and Permission).

Domain ontologies, on the other hand, are directed towards the representation of conceptual knowledge regarding specific areas of the law or domains of practice, and are built with particular applications in mind, especially those that enable communication (shared vocabularies), or enhance indexing, search, and retrieval of legal information. Currently, most legal ontologies being developed are domain-specific ontologies, and some areas of legal knowledge have been heavily targeted, notably the representation of intellectual property rights respecting digital rights management (IPROnto Ontology, the Copyright Ontology, the Ontology of Licences, and the ALIS IP Ontology), and consumer-related legal issues (the Customer Complaint Ontology (or CContology), and the Consumer Protection Ontology). Many other well-documented ontologies have also been developed for purposes of the detection of financial fraud and other crimes; the representation of alternative dispute resolution methods, privacy compliance, patents, cases (e.g., Legal Case OWL Ontology), judicial proceedings, legal systems, and argumentation frameworks; and the multilingual retrieval of European law, among others. (See, for example, the proceedings of the JURIX and ICAIL conferences for further references.)

A socio-legal approach to legal ontology development

Thus, there are many approaches to the development of legal ontologies. Nevertheless, in the current legal ontology literature there are few explicit accounts or insights into the methods researchers use to elicit legal knowledge, and the accounts that are available reflect a lack of consensus as to the most appropriate methodology. For example, some accounts focus solely on the use of text mining techniques towards ontology learning from legal texts; while others concentrate on the analysis of legal theories and related materials to extract and formalize legal concepts. Moreover, legal ontology researchers disagree about the role that legal experts should play in ontology development and validation.

Orange SceneIn this regard, at the Institute of Law and Technology, we are developing a socio-legal approach to the construction of legal conceptual models. This approach stems from our collaboration with firms, government agencies, and nonprofit organizations (and their experts, clients, and other users) for the gathering of either explicit or tacit knowledge according to their needs. This empirically-based methodology may require the modeling of legal knowledge in practice (or professional legal knowledge, PLK), and the acquisition of knowledge through ethnographic and other social science research methods, together with the extraction (and merging) of concepts from a range of different sources (acts, regulations, case law, protocols, technical reports, etc.) and their validation by both legal experts and users.

For example, the Ontology of Professional Judicial Knowledge (OPJK) was developed in collaboration with the Spanish School of the Judicary to enhance search and retrieval capabilities of a Web-based frequentl- asked-question system (IURISERVICE) containing a repository of practical knowledge for Spanish judges in their first appointment. The knowledge was elicited from an ethnographic survey in Spanish First Instance Courts. On the other hand, the Neurona Ontologies, for a data protection compliance application, are based on the knowledge of legal experts and the requirements of enterprise asset management, together with the analysis of privacy and data protection regulations and technical risk management standards.

This approach tries to take into account many of the criticisms that developers of legal knowledge-based systems (LKBS) received during the 1980s and the beginning of the 1990s, including, primarily, the lack of legal knowledge or legal domain understanding of most LKBS development teams at the time. These criticisms were rooted in the widespread use of legal sources (statutes, case law, etc.) directly as the knowledge for the knowledge base, instead of including in the knowledge base the “expert” knowledge of lawyers or law-related professionals.

Further, in order to represent knowledge in practice (PLK), legal ontology engineering could benefit from the use of social science research methods for knowledge elicitation, institutional/organizational analysis (institutional ethnography), as well as close collaboration with legal practitioners, users, experts, and other stakeholders, in order to discover the relevant conceptual models that ought to be represented in the ontologies. Moreover, I understand the participation of these stakeholders in ontology evaluation and validation to be crucial to ensuring consensus about, and the usability of, a given legal ontology.

Challenges and drawbacks

Although the use of ontologies and the implementation of the Semantic Web vision may offer great advantages to information and knowledge management, there are great challenges and problems to be overcome.

First, the problems related to knowledge acquisition techniques and bottlenecks in software engineering are inherent in ontology engineering, and ontology development is quite a time-consuming and complex task. Second, as ontologies are directed mainly towards enabling some communication on the basis of shared conceptualizations, how are we to determine the sharedness of a concept? And how are context-dependencies or (cultural) diversities to be represented? Furthermore, how can we evaluate the content of ontologies?

Collaborative Current research is focused on overcoming these problems through the establishment of gold standards in concept extraction and ontology learning from texts, and the idea of collaborative development of legal ontologies, although these techniques might be unsuitable for the development of certain types of ontologies. Also, evaluation (validation, verification, and assessment) and quality measurement of ontologies are currently an important topic of research, especially ontology assessment and comparison for reuse purposes.

Regarding ontology reuse, the general belief is that the more abstract (or core) an ontology is, the less it owes to any particular domain and, therefore, the more reusable it becomes across domains and applications. This generates a usability-reusability trade-off that is often difficult to resolve.

Finally, once created, how are these ontologies to evolve? How are ontologies to be maintained and new concepts added to them?

Over and above these issues, in the legal domain there are taking place more particularized discussions:  for example, the discussion of the advantages and drawbacks of adopting an empirically based perspective (bottom-up), and the complexity of establishing clear connections with legal dogmatics or general legal theory approaches (top-down). To what extent are these two different perspectives on legal ontology development incompatible? How might they complement each other? What is their relationship with text-based approaches to legal ontology modeling?

I would suggest that empirically based, socio-legal methods of ontology construction constitute a bottom-up approach that enhances the usability of ontologies, while the general legal theory-based approach to ontology engineering fosters the reusability of ontologies across multiple domains.

The scholarly discussion of legal ontology development also embraces more fundamental issues, among them the capabilities of ontology languages for the representation of legal concepts, the possibilities of incorporating a legal flavor into OWL, and the implications of combining ontology languages with the formalization of rules.

Finally, the potential value to legal ontology of other approaches, areas of expertise, and domains of knowledge construction ought to be explored, for example: pragmatics and sociology of law methodologies, experiences in biomedical ontology engineering, formal ontology approaches, salamander.jpgand the relationships between legal ontology and legal epistemology, legal knowledge and common sense or world knowledge, expert and layperson’s knowledge, legal information and Linked Data possibilities, and legal dogmatics and political science (e.g., in e-Government ontologies).

As you may see, the challenges faced by legal ontology engineering are great, and the limitations of legal ontologies are substantial. Nevertheless, the potential of legal ontologies is immense. I believe that law-related professionals and legal experts have a central role to play in the successful development of legal ontologies and legal semantic applications.

[Editor's Note: For many of us, the technical aspects of ontologies and the Semantic Web are unfamiliar. Yet these technologies are increasingly being incorporated into the legal information systems that we use everyday, so it's in our interest to learn more about them. For those of us who would like a user-friendly introduction to ontologies and the Semantic Web, here are some suggestions:

Dr. Núria Casellas Dr. Núria Casellas is a visiting researcher at the Legal Information Institute at Cornell University. She is a researcher at the Institute of Law and Technology and an assistant professor at the UAB Law School (on leave). She has participated in several national and European-funded research projects regarding legal ontologies and legal knowledge management: these concern the acquisition of knowledge in judicial settings (IURISERVICE), modeling privacy compliance regulations (NEURONA), drafting legislation (DALOS), and the Legal Case Study of the Semantically Enabled Knowledge Technologies (SEKT VI Framework project), among others. Co-editor of the IDT Series, she holds a Law Degree from the Universitat Autònoma de Barcelona, a Master’s Degree in Health Care Ethics and Law from the University of Manchester, and a PhD (“Modelling Legal Knowledge through Ontologies. OPJK: the Ontology of Professional Judicial Knowledge”).

VoxPopuLII is edited by Judith Pratt. Editor in Chief is Robert Richards.