Linked Data for Librarians

by Seth van Hooland and Ruben Verborgh

Part 2 – Module 1: Vocabulary reconciliation

Linked Data for Librarians

by Seth van Hooland and Ruben Verborgh

Part 2 – Module 1: Vocabulary reconciliation

Institute of Museum and Library Services Drexel University College of Computing & Informatics

Linked Data and Controlled Vocabularies

[LCSH screenshot]
©2017 LCSH
[Black hole]
©2017 NASA

Role of Controlled Vocabularies

Come in different flavours

[Library]
©2015 Michael D. Beckwith

Classification schemes

Propose systematically arranged classes to arrange documents regarding the same topic together

Classes are represented by notations and captions provide a human-readable description of the scope of the class

Dewey’s Decimal Classification (DDC) is the most well known example

Example: Dewey Decimal Classification

We can systematically drill down the DCC in order to locate the class for Cubism:

7 Art and recreation
70 Arts
709 History, geographic treatment, biography
709.04 20th Century, 1900–1999
709.0403 Cubism and futurism
709.04032 Cubism

Subject headings

Example: Library of Congress Subject Headings

Cubism (May Subd Geog)
BTAesthetics
BTArt
BTArt, Modern - 20th Century
BTModernism (Art)
BTPainting
RTPost-impressionism (Art)
NTDecoration and ornament - Cubism
NTDrawing, Cubist
NTPurism (Art)

Thesauri

Represents all the concepts for a specific domain in a consistent manner and labels each concept with a preferred term, as well as synonyms. Hierarchical relationships between concepts are expressed.

The only type of controlled vocabulary for which a formal standard (ISO 25964) exists, allowing to enforce compliance. Examples include: Arts and Architecture Thesaurus (AAT), Education Resources Information Center thesaurus (ERIC).

[Draft horses]
©2010 Bernard Spragg

Example: Arts and Architecture Thesaurus

Cubist

Drawbacks of Controlled Vocabularies

Despite their obvious advantages, controlled vocabularies also represent a number of problems:

cost
expensive to develop and maintain
complexity
end-users have trouble using them
evolve slowly
updating takes time
subjectivity
express always a certain worldview

Yet a little semantics goes a long way…

Difficulties to implement the full-blown Semantic Web and the move to the more pragmatic Linked Data approach stirred new interest in controlled vocabularies.

Inferencing capabilities are limited but awareness grew to reuse existing vocabularies to establish connections between data.

This evolution spurred interest in the development of SKOS.

Simple Knowledge Organization System (SKOS)

SKOS properties

Concepts are expressed through concrete terms (=labels) and may have three kind of properties:

  1. labeling properties
    • skos:prefLabel
    • skos:altLabel
  2. semantic properties
    • skos:narrower
    • skos:broader
    • skos:related
  3. documentation properties
    • skos:definition
    • skos:scopeNote

SKOS example from LCSH

@prefix : <http://id.loc.gov/authorities/subjects/>.
@prefix ch: <http://purl.org/vocab/changeset/schema#>.
@prefix org: <http://id.loc.gov/vocabulary/organizations/>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.

:sh85034652 a skos:Concept;
  skos:inScheme <http://id.loc.gov/authorities/subjects>;
  skos:prefLabel "Cubism"@en;
  skos:broader :sh85001441, :sh85007461, :sh85007805,
               :sh85086445, :sh85096661;
  skos:narrower :sh85036235, :sh85039437, :sh85109192;
  skos:related :sh2001008665, :sh85105416;
  skos:closeMatch <http://d-nb.info/gnd/4165855-3>;
  skos:exactMatch <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119361753>;
  skos:changeNote [
       a ch:ChangeSet;
       ch:changeReason "new"^^xsd:string;
       ch:createdDate "2001-06-22T00:00:00"^^xsd:dateTime;
       ch:creatorName org:dl;
       ch:subjectOfChange :sh85034652
     ],
     [
       a ch:ChangeSet;
       ch:changeReason "revised"^^xsd:string;
       ch:createdDate "2001-07-19T13:07:56"^^xsd:dateTime;
       ch:creatorName org:dlc;
       ch:subjectOfChange :sh85034652
    ].

Let’s create our own thesaurus in SKOS!

Coding our thesaurus in a text editor

@prefix art:<http://example.org/art/#>.
@prefix skos:<http://www.w3.org/2004/02skos/core#>.
@prefix dc:<http://purl.org/dc/terms/>.

We need to add a title and a top concept

For convenience, stick to ASCII characters and avoid whitespaces or any special tokens:

art:ModernArtPeriodsThesaurus a skos:ConceptScheme;
  dc:title "Modern art periods thesaurus"@en;
  skos:hasTopConcept art:ModernArt.

Adding concepts

Adding the top concept:

art:ModernArt a skos:Concept;
  skos:prefLabel "Modern art"@en;
  skos:inScheme art:ModernArtPeriodsThesaurus.

Adding multilingual labels and alternative names

art:ModernArt a skos:Concept;
  skos:prefLabel "Modern art"@en;
  skos:prefLabel "Art moderne"@fr;
  skos:prefLabel "Moderne Kunst"@de;
  skos:inScheme art:ModernArtPeriodsThesaurus.

art:ModernArt skos:altLabel "Modern art (1860-1945)"@en.

Now we can add some supplementary genres and subgenres

art:HeidelbergSchool a skos:Concept;
  skos:prefLabel "Heidelberg School"@en;
  skos:broader art:Impressionism;
  skos:inScheme art:ModernArtPeriodsThesaurus.

art:DieBrucke a skos:Concept;
  skos:prefLabel "Die Brücke"@en;
  skos:broader art:Expressionism;
  skos:inScheme art:ModernArtPeriodsThesaurus.

Play on your own!

Download our mini-thesaurus in SKOS/Turtle and add some extra concepts!
Keep the file, you’ll need it later on.

Validation and visualization tools:

[Connections]
©2012 deargdoom57

Install the RDF extension

Download the extension and place it in the extensions folder (accessible by clicking on Browse workspace directory from the home page of OpenRefine).

A new RDF button appears after the installation.

Detailed instructions can be found on http://refine.deri.ie/installlationDocs.

Creating a project with a dataset

You can either create your own metadata, or download our test file on modern art.

After importing the file, OpenRefine’s preview will show headers such as Title, Artist, Year etc.

As you’ll see some corrupted accents, don’t forget to set the encoding to UTF-8 by clicking on the Encoding field in Preview mode.

[OpenRefine interface]
©2017 OpenRefine
[OpenRefine interface]
©2017 OpenRefine

Reconciling the Powerhouse Museum metadata with the LCSH

After we played with our own mini-thesaurus and some dummy data, let’s work on a real-life case!

We have used the metadata set of the museum extensively for various cleaning and linking operations.

For this exercise, make sure to download either the cleaned OpenRefine project or create a new project using the cleaned metadata.

[Powerhouse Museum]
©2017 Powerhouse Museum

Issues with LCSH we first need to solve

OpenRefine can not make a choice if there’s a match with different headings. If one concept uses a label as its preferred term, and another uses the same label to designate a non-preferred term, OpenRefine can not choose

For example, skating is an alternative label of the term with preferred label Ice skating (sj96005713), but a separate term with the preferred label Skating (sj85123105) also exists!

[Powerhouse Museum]
©2017 Powerhouse Museum

Preprocessing of the LCSH

Some changes were made in our version of the LCSH:

Configuring LCSH as reconciliation source

Click the RDF button, select Add reconciliation service, Based on SPARQL endpoint.

Set its parameters as follows:

Name
LCSH (preprocessed)
Endpoint URL
http://sparql.freeyourmetadata.org/
Graph URI
http://sparql.freeyourmetadata.org/authorities-processed/
Type
Virtuoso
Label properties
check only skos:prefLabel

Now let’s reconcile!

Start the reconciliation process for the Categories column with this new endpoint (Reconcile, Start reconciling, LCSH (preprocessed), Start Reconciling)

Important: Experiment first with a very little subset of the records, as in below 100. The process takes a lot of time and if you launch it on the entire data set, it will take at least a day on your laptop. Create for example a filter on Record ID with 123 so that you have results after a couple of seconds

[JASIST]
©2013 Journal for the American Society of Information Science and Technology (JASIST)

Impact of reconciliation

Creating a link between your catalog record and an entry of the LCSH unfortunately does not allow you to be connected automatically with all other records which link to that heading!

Always keep in mind that URLs are unidirectional: you point to the LCSH, but the LoC is agnostic of the fact that you point to them.

I wanted the act of adding a link to be trivial. So long as I didn’t introduce some central link database, everything would scale nicely.

Tim Berners-Lee, Weaving the Web, 1999

The unidirectionality of links was an explicit design choice. Asking the linked entity to confirm the link would create too much of a bottleneck for the Web to grow. Imagine someone at LoC whose job it is to check each link created to the LCSH… However, alternatives exist, such as Xanadu by Ted Nelson.

Self-assessment 1: thesauri

What key aspect distinguishes thesauri from other forms of controlled vocabularies?

  1. A formal standard exists to verify their well-formedness.
    • Yes, the ISO 25964 standard defines exactly how a thesaurus should be constructed.
  2. A thesaurus provides description at a more granular level.
    • No, this does not depend on the type of vocabulary.
  3. Thesauri can be represented in SKOS.
    • Thesauri can indeed be represented in SKOS, but so can other types of vocabularies, as illustrated by the LCSH.

Self-assessment 2: non-preferred terms

Why adding non-preferred terms to a vocabulary?

  1. It reduces the negative effect of synonymy on search results.
    • Yes! Even if an end-user performs a search on a synonym encoded as a non-preferred term in regards to the preferred term used for indexing, the results are the same.
  2. It reduces the negative effect of polysemy on search results.
    • No! Adding too many and potentially even irrelevant non-preferred terms will increase the negative impact of polysemy.
  3. You can increase the success rate of the reconciliation process.
    • Yes! That is, if you have configured the process to include the non-preferred terms.

Self-assessment 3: labels and concepts

How do labels and concepts relate to each other in a SKOS vocabulary?

  1. Labels allow defining the structure whereas concepts can express the specific terms used.
    • No! Completely wrong.
  2. Labels are used to express semantic relations between concepts.
    • No, semantic relations are expressed by using properties such as broader or narrower.
  3. Concepts are abstract units of thought; labels are strings of characters associated with a concept.
    • Yes!

Self-assessment 4: unidirectionality

Why is it important to acknowledge unidirectionality when creating URLs?

  1. It explains why we don’t need SPARQL.
    • No, it’s the opposite! SPARQL exactly allows us to traverse links in both ways across a graph.
  2. It helps understand why it isn’t straightforward to connect all records together which point out to a central authority file.
    • Exactly! It’s not because you link to the LCSH, that the LCSH, or other people referring to the same heading, are made aware of its existence.
  3. In order to prevent the creation of dead links.
    • No, but understanding unidirectionality helps us to realize why dead links are unavoidable.