Linked Data for Librarians

by Seth van Hooland and Ruben Verborgh

Part 1 – Module 2: Understanding data models

Linked Data for Librarians

by Seth van Hooland and Ruben Verborgh

Part 1 – Module 2: Understanding data models

Institute of Museum and Library Services Drexel University College of Computing & Informatics

[Overview of four data-models: Tabular Data, Relational Model, XML and RDF]
©2014 Seth van Hooland and Ruben Verborgh

Comparing data models

[You can think of a dish as the model and a recipe as a serialization format.]
©2015 Myan Brenn

Models and serialization formats

Tabular data
CSV, TSV
Relational model
binary files
Meta-markup languages
SGML, XML
RDF
Turtle, N-Triples, RDF-XML

First model: tabular data

Tabular data—example

Title Creator Date Collection
Guernica Pablo Picasso 1937 Museo Reina Sofia
First Communion Picasso 1895 Museo Picasso
Puppy Koons, Jeff 1992 Guggenheim

Serializing tabular data

Common serialization formats for tabular data include CSV and TSV

Example of our data in CSV:

title,creator,date,collection
Guernica,Pablo Picasso,1937,Reina Sofia
First Communion,Picasso,1895,Museo Picasso
Puppy,"Koons, Jeff",1992,Guggenheim

[Happy?]
©2017 Alex Pozhydaev

Limits and possibilities of tabular data?

How do we overcome the inconsistencies and poor search within tabular data?

Second model: relational model

[Rethinking tabular data as a relational database.]
©2014 Seth van Hooland and Ruben Verborgh

Mapping reality to a database

Worldview consisting of three building blocks:

Entities
group of information which share properties and which vary independently from other groups
Attributes
describe properties of entities
Relationships
create connections in between entities

Designing a relational database

  1. Identify the entity types (e.g. Work, Creator)
  2. For every entity, create a table which will contain the properties of the entity
  3. Establish the relationships in between the tables
It is sometimes more an art than a science!

Redesigning our catalog

©2014 Seth van Hooland and Ruben Verborgh

Implementation

SQL

Creating a table with SQL

CREATE TABLE Work (
  id INT AUTO_INCREMENT PRIMARY KEY,
  title VARCHAR(100),
  creator INT,
  collection INT,
  year CHAR(4),
  style VARCHAR(40),
);

Encoding data into a table with SQL

INSERT INTO Work (title, creator, collection, year, style)
VALUES ('Guernica',43, 20, 1937,'Cubism')


Note that the values 43 and 20 refer to the primary keys from the tables Creator and Collection

Search and retrieval

Example: find all the titles of the works by Picasso

SELECT title FROM Work WHERE creator=43

Practice on your own!

Dealing with change

[Ducktaping]
©2013 David Helbich

Sharing your Data

Both data and the schema are locked up in a binary file, coupled to a specific software—you can’t just copy/paste a database and give it to someone else!

Leaving aside the technical issues of migrating and integrating databases, the main complexity resides in the semantics: a database schema is rarely well documented in practice, making it very complex to understand how sometimes hundreds of tables are connected

[Island]
©2010 Mohamed Majid
[Tim Berners-Lee at his desk in CERN, 1994]
©1994 CERN

Third model: markup

Origins of markup lie in the tradition of typesetting, where an author marks up a manuscript in order to explain how it should be printed

Certain passages, such as the titles of chapters, should be printed in bold, whereas footnotes should be printed in a smaller font than the normal paragraphs

Two options: either you apply makeup or markup

Applying makeup

Applying markup

[Pharma industry]
©2007 Wikipedia
[Russian puppets to illustrate mark-up]
©2014 Rainer Stropek

HTML detour

[SGML and HTML relationship]
©2004 Eric Clapton
[Browser war]
©1997 Carlos Avila Gonzalvez

XML

Modeling XML

Let’s add a work to our XML catalog of art objects

<?xml version="1.0" encoding="UTF-8"?>
<Art title="Modern art"/>
<Work title="Guernica" year="1937" creator="Pablo Picasso"
      collection="Reina Sofia" location="Madrid"/>
</Art>

Let’s model everything as child elements

<Art title="Modern art"/>
  <Work>
  <Title>Guernica</Title>
  <CreationDate><Year>1937</Year></CreationDate>
  <Creator>
    <FirstName>Pablo</FirstName>
    <LastName>Picasso</LastName>
  </Creator>
  <Collection>
    <Name>Reina Sofia</Name>
    <Location>Madrid</Location>
  </Collection>
  </Work>
</Art>

Let’s go for a compromise!

<Art title="Modern art"/>
  <Work title="Guernica" year="1937">
    <Creator firstName="Pablo" lastName="Picasso"/>
    <Collection name="Reina Sofia" location="Madrid"/>
  </Work>
</Art>

Schemas

<xsd:element name="Work"/>
  <xsd:complexType>
    <xsd:sequence>
      <xsd:element name="CreatorRef" maxOccurs="unbounded"/>
      <xsd:attribute name="title" type="xsd:string"/>
      <attribute name="year" type="xsd:gYear"/>
      <attribute name="collectionId" type="xsd:IDREF"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:element>

Namespaces

<Art title="Modern art" xmlns:dc="http://purl.org/dc/terms/"
          xmlns:vra="http://www.vraweb.org/vracore4.htm"
<Work>
  <dc:creator>Pablo Picasso</dc:creator>
  <dc:title>Guernica</dc:title>
  <vra:technique>Oil painting</vra:technique>
</Work>
</Art>

Search and retrieval

When to use a database or XML?

The quick answer is: it depends on the context

Read this paper by a historian who explains the pros and cons of each approach to model an inventory

XML is often criticized for its verbose nature, and JSON has become more popular to represent structured data

Limitations?

[Map]
©1740 Seutter Map of India, Pakistan, Tibet and Afghanistan
[Map]
Public Domain Pictures

Fourth model: RDF

[Graph]
©2014 Seth van Hooland and Ruben Verborgh

Model

Serialization—different options

Turtle example

Let’s see how we can express the metadata explaining Jeff Koons created the artwork Puppy:

@prefix gh:<http://www.guggenheim.org/new-york/collections/collection-online/artwork/>.
@prefix dc:<http://purl.org/dc/terms/>.
@prefix viaf:<http://viaf.org/viaf/>.
gh:48 dc:creator viaf:5035739

Turtle syntax

Multiple statements about the same object can be written tersely by using a semicolon if the subject is repeated, and a comma if the subject and predicate are repeated

gh:48 dc:creator viaf:5035739;
dc:title "Puppy".
viaf:5035739 :influencedBy viaf:15873,
viaf:95794725.

Turtle syntax

RDF includes literal values (Puppy) in its model as well, as some properties eventually do not point at another object but rather at a non-decomposable value. Also, not how :influencedBy has an empty namespace prefix, which indicates that it is locally defined:

gh:48 dc:creator viaf:5035739;
dc:title "Puppy".
viaf:5035739 :influencedBy viaf:15873,
viaf:95794725.

Search and retrieval in RDF

Implementation

Limits of RDF?

[Table]
©2014 Seth van Hooland and Ruben Verborgh
[Hammer]
Public Domain Pictures

Self-assessment 1: tabular data

Creating metadata as tabular data is a bad idea:

  1. If you want to share your metadata.
    • No, tabular data is platform-independent and very easy to exchange
  2. If you want to avoid spelling mistakes.
    • Yes, flat files do not offer the possibility to ensure consistency when encoding metadata
  3. If you want to express hierarchy in your metadata.
    • Yes, tabular do not give the opportunity to express hierarchy. Relational databases or XML seem then a better fit

Self-assessment 2: entities or attributes

How to decide to represent something as an entity or as an attribute within a database schema?

  1. Opt for an attribute if it is an important aspect of the reality you are modeling.
    • No! In that case an entity would be a better choice as you can further document the entity with the help of attributes.
  2. It depends on how important that part of reality is within the database.
    • Yes! If you want to give additional information about something, it will be better to model it as an entity, as you can attach extra information to an entity with attributes.
  3. It depends on the amount of data you want to store in the database.
    • No, this is an irrelevant argument.

Self-assessment 3: XML and preservation

Why is XML interesting from a digital preservation point of view?

  1. XML files are non-binary files.
    • Yes, XML files are text files which can be opened and modified with a simple text editor and are independent of any particular software.
  2. XML is self-describing.
    • Yes and no: XML tags allow to explicitly state the role of a specific part of a document, but the interpretation of the name of a tag might be problematic. The name of a tag might quickly lose its meaning after a couple of years, especially when acronyms are used.
  3. XML files take up less space than JSON.
    • No, JSON was actually developed as a reaction to the verbose nature of XML and drastically reduces the amount of characters used to represent data.

Self-assessment 4: XML and HTML

How is XML related to HTML?

  1. XML is a subset of HTML.
    • No, XML is a meta-markup language whereas HTML is simply a markup language.
  2. Both are examples of a markup language.
    • The precise answer is no, as XML is a meta-markup language: you have the possibility to define your own tags. HTML is simply a markup language with pre-defined tags.
  3. XML was developed as a reaction to the evolution of HTML from markup to makeup.
    • Yes! Towards the end of the 1990s, browsers proposed tags such as <blink> which merely had an aesthetic role, undermining the potential for a smarter Web.