{"id":3012,"date":"2013-01-24T11:23:04","date_gmt":"2013-01-24T16:23:04","guid":{"rendered":"http:\/\/blog.law.cornell.edu\/voxpop\/?p=3012"},"modified":"2013-02-25T11:21:50","modified_gmt":"2013-02-25T16:21:50","slug":"metadata-quality-in-a-linked-data-context","status":"publish","type":"post","link":"https:\/\/blog.law.cornell.edu\/voxpop\/2013\/01\/24\/metadata-quality-in-a-linked-data-context\/","title":{"rendered":"Metadata Quality in a Linked Data Context"},"content":{"rendered":"

 <\/p>\n


\n<\/strong><\/strong><\/h1>\n

\"\"<\/a>Van Winkle wakes<\/span><\/h1>\n

In this post, we return to a topic we first visited in a book chapter in 2004. \u00a0At that time, one of us (Bruce) was an electronic publisher of Federal court cases and statutes, and the other (Hillmann, herself a former law cataloger) was working with large, aggregated repositories of scientific papers as part of the National Sciences Digital Library project. \u00a0Then, as now, we were concerned that little attention was being paid to the practical tradeoffs involved in publishing high quality metadata at low cost. \u00a0There was a tendency to design metadata schemas that said absolutely everything that <\/span>could<\/span> be said about an object, often at the expense of obscuring what <\/span>needed<\/span> to be said about it while running up unacceptable costs. \u00a0Though we did not have a name for it at the time, we were already deeply interested in least-cost, use-case-driven approaches to the design of metadata models, and that naturally led us to wonder what \u201cgood\u201d metadata might be. \u00a0The result was \u201cThe Continuum of Metadata Quality: Defining, Expressing, Exploiting<\/a>\u201d, published as a chapter in an ALA publication, <\/span>Metadata in Practice<\/span>.<\/span><\/p>\n

In that chapter, we attempted to create a framework for talking about (and evaluating) metadata quality. \u00a0We were concerned primarily with metadata as we were then encountering it: in aggregations of repositories containing scientific preprints, educational resources, and in caselaw and other primary legal materials published on the Web. \u00a0\u00a0We hoped we could create something that would be both domain-independent and useful to those who manage and evaluate metadata projects. \u00a0Whether or not we succeeded is for others to judge. <\/span><\/p>\n

The Original Framework<\/span>
\n<\/span><\/h1>\n

At that time, we identified seven major components of metadata quality. Here, we reproduce a part of a summary table that we used to characterize the seven measures. We suggested questions that might be used to draw a bead on the various measures we proposed:<\/span><\/p>\n\n\n\n\n\n\n\n\n\n\n
Quality Measure<\/span><\/td>\nQuality Criteria<\/span><\/td>\n<\/tr>\n
Completeness<\/span><\/td>\nDoes the element set completely describe the objects?<\/span>
\nAre all relevant elements used for each object?<\/span><\/td>\n<\/tr>\n
Provenance<\/span><\/td>\nWho is responsible for creating, extracting, or transforming the metadata?<\/span>
\nHow was the metadata created or extracted?<\/span>
\nWhat transformations have been done on the data since its creation?<\/span><\/td>\n<\/tr>\n
Accuracy<\/span><\/td>\nHave accepted methods been used for creation or extraction?<\/span>
\nWhat has been done to ensure valid values and structure?<\/span>
\nAre default values appropriate, and have they been appropriately used?<\/span><\/td>\n<\/tr>\n
Conformance to expectations<\/span><\/td>\nDoes metadata describe what it claims to?<\/span>
\nAre controlled vocabularies aligned with audience characteristics and understanding of the objects?<\/span>
\nAre compromises documented and in line with community expectations?<\/span><\/td>\n<\/tr>\n
Logical consistency and coherence<\/span><\/td>\nIs data in elements consistent throughout?<\/span>
\nHow does it compare with other data within the community?<\/span><\/td>\n<\/tr>\n
Timeliness<\/span><\/td>\nIs metadata regularly updated as the resources change?<\/span>
\nAre controlled vocabularies updated when relevant?<\/span><\/td>\n<\/tr>\n
Accessibility<\/span><\/td>\nIs an appropriate element set for audience and community being used?<\/span>
\nIs it affordable to use and maintain?<\/span>
\nDoes it permit further value-adds?<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

 <\/p>\n

There are, of course, many possible elaborations of these criteria, and many other questions that help get at them. \u00a0Almost nine years later, we believe that the framework remains both relevant and highly useful, although (as we will discuss in a later section) we need to think carefully about whether and how it relates to the quality standards that the Linked Open Data (LOD) community is discovering for itself, and how it and other standards should affect library and publisher practices and policies.<\/span><\/p>\n

\u2026 and the environment in which it was created<\/span><\/strong><\/span><\/h1>\n

Our work was necessarily shaped by the environment we were in. \u00a0Though we never really said so explicitly, we were looking for quality not only in the data itself, but in the methods used to organize, transform and aggregate it across federated collections. \u00a0We did not, however, anticipate the speed or scale at which standards-based methods of data organization would be applied. \u00a0Commonly-used standards like FOAF<\/a>, models such as those contained in schema.org<\/a>, and lightweight modelling apparatus like SKOS<\/a> are all things that have emerged into common use since, and of course the use of Dublin Core — our main focus eight years ago — has continued even as the standard itself has been refined. \u00a0These days, an expanded toolset makes it even more important that we have a way to talk about how well the tools fit the job at hand, and how well they have been applied. An expanded set of design choices accentuates the need to talk about how well choices have been made in particular cases.<\/span><\/p>\n

Although our work took its inspiration from quality standards developed by a government statistical service<\/a>, we had not really thought through the sheer multiplicity of information services that were available even then. \u00a0We were concerned primarily with work that had been done with descriptive metadata in digital libraries, but of course there were, and are, many more people publishing and consuming data in both the governmental and private sectors (to name just two). \u00a0Indeed, there was already a substantial literature on data quality that arose from within the management information systems (MIS) community, driven by concerns about the reliability and quality of \u00a0mission-critical data used and traded by businesses. \u00a0In today\u2019s wider world, where work with library metadata will be strongly informed by the Linked Open Data techniques developed for a diverse array of data publishers, we need to take a broader view. \u00a0<\/span><\/p>\n

Finally, we were driven then, as we are now, by managerial and operational concerns. As practitioners, we were well aware that metadata carries costs, and that human judgment is expensive. \u00a0We were looking for a set of indicators that would spark and sustain discussion about costs and tradeoffs. \u00a0At that time, we were mostly worried that libraries were not giving costs enough attention, and were designing metadata projects that were unrealistic given the level of detail or human intervention they required. \u00a0That is still true. \u00a0The world of Linked Data requires well-understood metadata policies and operational practices simply so publishers can know what is expected of them and consumers can know what they are getting. Those policies and practices in turn rely on quality measures that producers and consumers of metadata can understand and agree on. \u00a0In today\u2019s world — one in which institutional resources are shrinking rather than expanding — \u00a0human intervention in the metadata quality assessment process at any level more granular than that of the entire data collection being offered will become the exception rather than the rule. \u00a0\u00a0<\/span><\/p>\n

While the methods we suggested at the time were self-consciously domain-independent, they did rest on background assumptions about the nature of the services involved and the means by which they were delivered. Our experience had been with data aggregated by communities where the data producers and consumers were to some extent known to one another, using a fairly simple technology<\/a> that was easy to run and maintain. \u00a0In 2013, that is not the case; producers and consumers are increasingly remote from each other, and the technologies used are both more complex and less mature, though that is changing rapidly.<\/span><\/p>\n

The remainder of this blog post is an attempt to reconsider our framework in that context.<\/span><\/p>\n

The New World<\/span><\/strong><\/span><\/h1>\n

\"\"The Linked Open Data (LOD) community has begun to consider quality issues; there are some noteworthy online discussions<\/a>, as well as workshops<\/a> resulting in a number of published papers and online resources<\/a>. \u00a0It is interesting to see where the work that has come from within the LOD community contrasts with the thinking of the library community on such matters, and where it does not. \u00a0<\/span><\/p>\n

In general, the material we have seen leans toward the traditional data-quality concerns of the MIS community. \u00a0LOD practitioners seem to have started out by putting far more emphasis than we might on criteria that are essentially audience-dependent, and on operational concerns having to do with the reliability of publishing and consumption apparatus. \u00a0\u00a0As it has evolved, the discussion features an intellectual move away from those audience-dependent criteria, which are usually expressed as \u201cfitness for use\u201d, \u201crelevance\u201d, or something of the sort (we ourselves used the phrase \u201ccommunity expectations\u201d). Instead, most realize that both audience and usage \u00a0are likely to be (at best) partially unknown to the publisher, at least at system design time. \u00a0In other words, the larger community has begun to grapple with something librarians have known for a while: future uses and the extent of dissemination are impossible to predict. \u00a0There is a creative tension here that is not likely to go away. \u00a0On the one hand, data developed for a particular community is likely to be much more useful to that community; thus our initial recognition of the role of \u201ccommunity expectations\u201d. \u00a0On the other, dissemination of the data may reach far past the boundaries of the community that develops and publishes it. \u00a0The hope is that this tension can be resolved by integrating large data pools from diverse sources, or by taking other approaches that result in data models sufficiently large and diverse that \u201ccommunity expectations\u201d can be implemented, essentially, by filtering.<\/span><\/p>\n

For the LOD community, the path that began with \u00a0\u201cfitness-for-use\u201d criteria led quickly to the idea of maintaining a \u201cneutral perspective\u201d. Christian F\u00fcrber describes that perspective<\/a> as the idea that \u201cData quality is the degree to which data meets quality requirements no matter who is making the requirements\u201d. \u00a0To librarians, who have long since given up on the idea of cataloger objectivity, a phrase like \u201cneutral perspective\u201d may seem naive. \u00a0But it is a step forward in dealing with data whose dissemination and user community is unknown. And it is important to remember that the larger LOD community is concerned with quality in data publishing in general, and not solely with descriptive metadata, for which objectivity may no longer be of much value. \u00a0For that reason, it would be natural to expect the larger community to place greater weight on objectivity in their quality criteria than the library community feels that it can, with a strong preference for quantitative assessment wherever possible. \u00a0Librarians and others concerned with data that involves human judgment are theoretically more likely to be concerned with issues of provenance, particularly as they concern who has created and handled the data. \u00a0And indeed that is the case. <\/span><\/p>\n

The new quality criteria, and how they stack up<\/span><\/h1>\n

Here is a simplified comparison of our 2004 criteria with three views taken from the LOD community.<\/span><\/p>\n\n\n\n\n\n\n\n\n\n\n\n
Bruce & Hillmann<\/a><\/span><\/td>\nDodds<\/a><\/span>, <\/span>McDonald<\/a><\/span><\/td>\nFlemming<\/a><\/span><\/td>\n<\/tr>\n
Completeness<\/span><\/td>\nCompleteness<\/span>
\nBoundedness<\/span>
\nTyping<\/span><\/td>\n
Amount of data<\/span><\/td>\n<\/tr>\n
Provenance<\/span><\/td>\nHistory<\/span>
\nAttribution<\/span>
\nAuthoritative<\/span><\/td>\n
Verifiability<\/span><\/td>\n<\/tr>\n
Accuracy<\/span><\/td>\nAccuracy<\/span>
\nTyping<\/span><\/td>\n
Validity of documents<\/span><\/td>\n<\/tr>\n
Conformance to expectations<\/span><\/td>\nModeling correctness<\/span>
\nModeling granularity<\/span>
\nIsomorphism<\/span><\/td>\n
Uniformity<\/span><\/td>\n<\/tr>\n
Logical consistency and coherence<\/span><\/td>\nDirectionality<\/span>
\nModeling correctness<\/span>
\nInternal consistency<\/span>
\nReferential correspondence<\/span>
\nConnectedness<\/span><\/td>\n
Consistency<\/span><\/td>\n<\/tr>\n
Timeliness<\/span><\/td>\nCurrency<\/span><\/td>\nTimeliness<\/span><\/td>\n<\/tr>\n
Accessibility <\/span><\/td>\nIntelligibility<\/span>
\nLicensing<\/span>
\nSustainable<\/span><\/td>\n
Comprehensibility<\/span>
\nVersatility<\/span>
\nLicensing<\/span><\/td>\n<\/tr>\n
Accessibility (technical)<\/span>
\nPerformance (technical)<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

Placing the \u201cnew\u201d criteria into our framework was no great challenge; it appears that we were, and are, talking about many of the same things. A few explanatory remarks:<\/span><\/p>\n