Linked Data for Librarians

by Seth van Hooland and Ruben Verborgh

Part 1 – Module 4: Data quality

Linked Data for Librarians

by Seth van Hooland and Ruben Verborgh

Part 1 – Module 4: Data quality

Institute of Museum and Library Services Drexel University College of Computing & Informatics

[The Open-World Assumption can have a dramatic impact.]
©2016 The Examiner

Achilles heel of Linked Data

[Book overview.]

Relevant but complex topic

How can we define data quality?

The totality of features and characteristics
of a product, process or service that bears
on its ability to satisfy stated or implicit needs.

ISO 9000:2005

Commonly referred to in the literature as the fitness for purpose definition

Differentiate deterministic and empirical data

[Impact of time.]
©2009 Marc Staub
[Crushed fire truck]
©2014 Menelik Simpson

We can make use of hermeneutics as a tool to make sense of quality

Process of understanding moving from parts of a whole to a global understanding of the whole and back to individual parts in an iterative manner.

Klein et al, 1999

Isabelle Boydens reused Fernand Braudel’s stratified time concept as a hermeneutical approach to audit the quality of social security databases.

[Hermeneutics]

Implications of living in an Empirical World

The work of Boydens demonstrates we can not assert a direct correspondence between the empirical, ever-changing world and the metadata and database schema representing it.

Defining data quality in a deterministic manner (e.g., MIT Total Data Quality Program) makes no sense for empirical application domains.

Stratified time applied on social security databases by Boydens

[Paper Isabelle Boydens and Seth van Hooland]

From theory to practice…

Getting to grips with data profiling

The use of analytical techniques to discover the true structure, content and quality of a collection of data.

Olson, 2003

The next module will explain how a tool such as OpenRefine can help you to spot data quality issues.

[Police]
©2009 Wikipedia

Recipe for a data quality audit

Flattening data

Columns as a starting point

Interpretation of the title of a field

Issues with empty columns

Data types

Biggest metadata horror: dates!

There is an incredible range of possibilities to express dates. Here are just a few examples:

Pattern Example
empty
9999-9999 1891-1912
9999 1912
99-99/9999 09-10/1912
99/9999 01-1912
99/99/99 04/08/12
AAA 9999 May 1912

Length of entries

Empty fields

[Malevich painting]
©2014 Micha Theiner

Field overloading

Encoding in one field of values which should be split out over multiple fields, due to:

Impacts search and retrieval but also limits data analysis and cleaning

Self-assessment 1: relevance

Why is data quality so relevant in a Linked Data context?

  1. The closed-world assumption of the Linked Data paradigm limits the amount of data available.
    • No, Linked Data is based on the open world assumption, implying that no one at a certain moment knows exactly what type of data are available and the type of constraints they respect.
  2. Linked Data holds the potential danger of introducing erroneous and conflicting data.
    • Yes, without specific efforts to clean original data sources and ensuring standardised methods and tools to evaluate and compare data set published as Linked Data, the issue of data quality might seriously undermine the potential of Linked Data for libraries.
  3. The introduction of Linked Data will boost the quality of library catalogs.
    • It depends! Using data from very diverse and heterogeneous sources might seriously undermine the quality of catalogs.

Self-assessment 2: deterministic data

Why is it important to distinguish deterministic from empirical data when talking about metadata quality?

  1. Contrary to deterministic data, there exist no formal theories to validate empirical data
    • Yes! For deterministic data there are fixed theories which no longer evolve, such as is the case with algebra. 1 + 1 will always equal 2.
  2. There are more issues with deterministic data.
    • No, irrelevant answer.
  3. Because empirical data can not be cleaned.
    • No, it is not because we can not establish a direct correspondence between the observable and the data that one can not identify errors and rectify them.

Self-assessment 3: field overloading

What is field overloading and why is it problematic?

  1. The issue rises when you go beyond the number of characters which may be encoded in a field.
    • No! The length of an entry can definitively be an interesting data quality indicator, but field overloading is not linked to the length of an entry.
  2. This issue mainly rises when you transfer data from a flat file to a database.
    • No, it tends to be the other way around. Moving from a well-structured database, with clear definitions of fields, to a flat file might result in packing together related but different (e.g. surname and family name) fields.
  3. Field overloading occurs when related data are put together in the same field.
    • Yes, this lowers the possibilities to clearly define encoding constraints and structured search.

Self-assessment 4: absent values

Why is it important to think about how we communicate about absent values?

  1. In order to save space.
    • No, this is not a relevant answer.
  2. In order to avoid them at all times.
    • No! Both for conceptual and operational reasons, it is impossible to avoid empty fields. The important aspect is to document the reason behind the absence of a value.
  3. An empty field can be there for a large variety of reasons. Knowing the reason can be important in order to know how to interpret the absence.
    • Yes! A value might not be known or not applicable, or there simply might not be enough resources to fill it in.