Understanding data models

[How to catalog all of this quickly?] — ©2015 Shipscompass

[Overview of four data-models: Tabular Data, Relational Model, XML and RDF] — ©2014 Seth van Hooland and Ruben Verborgh

Comparing data models

World view of each model
- Models helps us to make abstraction of reality. You may think of a data model as a particular pair of glasses, influencing the way in which you see the world.
Implementation and serialization formats
- Models are made concrete through file formats, query languages and software.
How does the model address updates and sharing?

[You can think of a dish as the model and a recipe as a serialization format.] — ©2015 Myan Brenn

Models and serialization formats

Tabular data: CSV, TSV
Relational model: binary files
Meta-markup languages: SGML, XML
RDF: Turtle, N-Triples, RDF-XML

First model: tabular data

Also referred to as flat files
Intuitive approach to organize data
Represents the world in one big gigantic table or spreadsheet
Consists of columns and rows, their intersection gives meaning to the data

Tabular data—example

Title	Creator	Date	Collection
Guernica	Pablo Picasso	1937	Museo Reina Sofia
First Communion	Picasso	1895	Museo Picasso
Puppy	Koons, Jeff	1992	Guggenheim

Serializing tabular data

Common serialization formats for tabular data include CSV and TSV

Example of our data in CSV:

title,creator,date,collection
Guernica,Pablo Picasso,1937,Reina Sofia
First Communion,Picasso,1895,Museo Picasso
Puppy,"Koons, Jeff",1992,Guggenheim

Limits and possibilities of tabular data?

Data quality: prone to inconsistencies!
Search and retrieval: ineffective
Updates—change: easy
Distribution: easy

How do we overcome the inconsistencies and poor search within tabular data?

Second model: relational model

Most successful model to manage structured data
First described by Edgar Codd in 1969
Rare to find an information system which is not driven by a relational database
Mature technology which is here to stay—don’t believe the people who claim NoSQL or triplestores will take over

[Rethinking tabular data as a relational database.] — ©2014 Seth van Hooland and Ruben Verborgh

Mapping reality to a database

Worldview consisting of three building blocks:

Entities: group of information which share properties and which vary independently from other groups
Attributes: describe properties of entities
Relationships: create connections in between entities

Designing a relational database

Identify the entity types (e.g. Work, Creator)
For every entity, create a table which will contain the properties of the entity
Establish the relationships in between the tables

It is sometimes more an art than a science!

Redesigning our catalog

©2014 Seth van Hooland and Ruben Verborgh

Implementation

Relational Database Management System (RDBMS) software used to implement the model
Common RDBMS tools/software vendors include for example MySQL, Oracle or SQLServer
Cataloging software or Integrated Library Systems (ILS) add a visual interface on top of a RDBMS

SQL

Structured Query Language used to insert, query, update and delete data, for schema creation and data access control
No need to become an SQL wizard, but mastering the basic commands can be very helpful to create exports from your cataloging back-end
Standardized in theory (ISO/IEC 9075) but not so much in practice

Creating a table with SQL

CREATE TABLE Work (
  id INT AUTO_INCREMENT PRIMARY KEY,
  title VARCHAR(100),
  creator INT,
  collection INT,
  year CHAR(4),
  style VARCHAR(40),
);

Encoding data into a table with SQL

INSERT INTO Work (title, creator, collection, year, style)
VALUES ('Guernica',43, 20, 1937,'Cubism')

Note that the values 43 and 20 refer to the primary keys from the tables Creator and Collection

Search and retrieval

Relational model is extremely performant
End-users interact with a GUI but it can be useful to know some basic SQL queries

Example: find all the titles of the works by Picasso

SELECT title FROM Work WHERE creator=43

Practice on your own!

As a librarian, it can be useful to have a basic knowledge of SQL
Websites such as http://sqlfiddle.com allow you to experiment on your own!
Copy/paste the SQL code from the previous slides in order to create a table, insert metadata and query them

Dealing with change

Updating the schema of a database can be very complex
Apart from ensuring the normalization of the modified schema, the modifications might also affect public front-ends
Quick-and-dirty ad hoc solutions often are taken, which have disastrous consequences over time

Sharing your Data

Both data and the schema are locked up in a binary file, coupled to a specific software—you can’t just copy/paste a database and give it to someone else!

Leaving aside the technical issues of migrating and integrating databases, the main complexity resides in the semantics: a database schema is rarely well documented in practice, making it very complex to understand how sometimes hundreds of tables are connected

[Tim Berners-Lee at his desk in CERN, 1994] — ©1994 CERN

Third model: markup

Origins of markup lie in the tradition of typesetting, where an author marks up a manuscript in order to explain how it should be printed

Certain passages, such as the titles of chapters, should be printed in bold, whereas footnotes should be printed in a smaller font than the normal paragraphs

Two options: either you apply makeup or markup…

Applying makeup

The quick and dirty way…
In the context of HTML, applying makeup would imply the following:
- <font size="20"><b>Linked Data for librarians</b></font>
We simply indicate how that specific string of characters should be displayed—it’s makeup!

Applying markup

Indicate the role a part of a document plays and define separately the aesthetics of that role
Let’s use markup on our HTML example:
- h1 {font-size: 20pt; font-weight: bold}
  …
  <h1>Linked Data for librarians</h1>
Defining the lay-out of structural elements of a document (e.g. h1), opens a new world of possibilities

[Russian puppets to illustrate mark-up] — ©2014 Rainer Stropek

HTML detour

1989: Tim Berners-Lee was inspired by SGML but simplified it radically by proposing a fixed set of tags to represent the structural components of a Web page
Examples: <head>, <body>, <h1>, etc.
Most people forgot all about SGML, but its influence has been enormous…

[SGML and HTML relationship] — ©2004 Eric Clapton

[Browser war] — ©1997 Carlos Avila Gonzalvez

XML

End of the 1990s => desire to focus again on the structure and not the lay-out of the Web
XML 1.0: W3C recommendation in 1998
Effort was made to maintain 80% of SGML’s functionality with only 20% of its complexity
Big impact: open standard which is platform and application independent

Modeling XML

Structure documents with
- elements: serialized as tags surrounded by angle brackets (<tag>)
- attribute: key/value modifiers of a tag
Each document starts with a declaration:
- <?xml version="1.0" encoding="UTF-8"?>
- <Art title="Modern art"/>

Let’s add a work to our XML catalog of art objects

Note how we can model all of the metadata as attributes:

<?xml version="1.0" encoding="UTF-8"?>
<Art title="Modern art"/>
<Work title="Guernica" year="1937" creator="Pablo Picasso"
      collection="Reina Sofia" location="Madrid"/>
</Art>

Let’s model everything as child elements

<Art title="Modern art"/>
  <Work>
  <Title>Guernica</Title>
  <CreationDate><Year>1937</Year></CreationDate>
  <Creator>
    <FirstName>Pablo</FirstName>
    <LastName>Picasso</LastName>
  </Creator>
  <Collection>
    <Name>Reina Sofia</Name>
    <Location>Madrid</Location>
  </Collection>
  </Work>
</Art>

Let’s go for a compromise!

<Art title="Modern art"/>
  <Work title="Guernica" year="1937">
    <Creator firstName="Pablo" lastName="Picasso"/>
    <Collection name="Reina Sofia" location="Madrid"/>
  </Work>
</Art>

Schemas

Different languages exist to express schemas, but XML Schema (XSD) is the most popular

<xsd:element name="Work"/>
  <xsd:complexType>
    <xsd:sequence>
      <xsd:element name="CreatorRef" maxOccurs="unbounded"/>
      <xsd:attribute name="title" type="xsd:string"/>
      <attribute name="year" type="xsd:gYear"/>
      <attribute name="collectionId" type="xsd:IDREF"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:element>

Namespaces

Mechanism to disambiguate the meaning of elements and attributes across schemas
Namespaces are indicated with the help of the reserved XML attribute xmlns:

<Art title="Modern art" xmlns:dc="http://purl.org/dc/terms/"
          xmlns:vra="http://www.vraweb.org/vracore4.htm"
<Work>
  <dc:creator>Pablo Picasso</dc:creator>
  <dc:title>Guernica</dc:title>
  <vra:technique>Oil painting</vra:technique>
</Work>
</Art>

Search and retrieval

XML has its own query language: XPath
Allows to traverse XML trees and collect element and attribute values
Examples:
- /Art/Work/Creator
- Creator/LastName
- Work/descendent::LastName
- Work/@year

When to use a database or XML?

The quick answer is: it depends on the context

Read this paper by a historian who explains the pros and cons of each approach to model an inventory

XML is often criticized for its verbose nature, and JSON has become more popular to represent structured data

Limitations?

Inconsistencies: usage of XML Schema to validate data is powerful
Search and retrieval: less performant than relational databases
Updates—change: as painful as with relational databases
Distribution: open standard, but still one needs to understand the schema

[Map] — ©1740 Seutter Map of India, Pakistan, Tibet and Afghanistan

Fourth model: RDF

Resource Description Framework (RDF)
Worldview consisting of a gigantic ever-expanding graph of triples
Triple consists of a subject, object and predicate
Any resource (the subject) can have a relationship (the predicate) to any other resource (the object)

[Graph] — ©2014 Seth van Hooland and Ruben Verborgh

Model

Through a radically simplified data model, the semantics are made explicit by the triple itself
Both databases and XML are based on the principle that only data conform to a locally defined schema may exist => closed world assumption
RDF sails under the open world paradigm flag

Serialization—different options

RDF/XML
- Developed in 2001 at the beginning of the XML omnipresence. Now considered to be too verbose and hard to parse
Turtle
- Allows to express RDF triples in a compact and natural text form. Each of the components (subject, predicate, object) are separated by whitespace and a triple ends with a dot

Turtle example

Let’s see how we can express the metadata explaining Jeff Koons created the artwork Puppy:

@prefix gh:<http://www.guggenheim.org/new-york/collections/collection-online/artwork/>.
@prefix dc:<http://purl.org/dc/terms/>.
@prefix viaf:<http://viaf.org/viaf/>.
gh:48 dc:creator viaf:5035739

Turtle syntax

Multiple statements about the same object can be written tersely by using a semicolon if the subject is repeated, and a comma if the subject and predicate are repeated

gh:48 dc:creator viaf:5035739;
dc:title "Puppy".
viaf:5035739 :influencedBy viaf:15873,
viaf:95794725.

Turtle syntax

RDF includes literal values (Puppy) in its model as well, as some properties eventually do not point at another object but rather at a non-decomposable value. Also, not how :influencedBy has an empty namespace prefix, which indicates that it is locally defined:

gh:48 dc:creator viaf:5035739;
dc:title "Puppy".
viaf:5035739 :influencedBy viaf:15873,
viaf:95794725.

Search and retrieval in RDF

SPARQL: recursive acronym for SPARQL Protocol and RDF Query Language
Example: let’s retrieve all triples which have Picasso as the subject:
- SELECT ?predicate ?object WHERE {
  <http://dpbedia.org/resource/Pablo_Picasso>
  ?predicate ?object.
  }

Implementation

Triplestore: database used to store and query RDF triples
Either natively built or on top of existing relational databases
Despite recent developments, performance remains an issue

Limits of RDF?

Inconsistencies: fantastic in theory, but often problematic in practice
Search and retrieval: tremendous possibilities, but complex to execute
Updates—change: new information can be added at any point
Distribution: where the model shines!

[Table] — ©2014 Seth van Hooland and Ruben Verborgh

Self-assessment 1: tabular data

Creating metadata as tabular data is a bad idea:

If you want to share your metadata.
- No, tabular data is platform-independent and very easy to exchange
If you want to avoid spelling mistakes.
- Yes, flat files do not offer the possibility to ensure consistency when encoding metadata
If you want to express hierarchy in your metadata.
- Yes, tabular do not give the opportunity to express hierarchy. Relational databases or XML seem then a better fit

Self-assessment 2: entities or attributes

How to decide to represent something as an entity or as an attribute within a database schema?

Opt for an attribute if it is an important aspect of the reality you are modeling.
- No! In that case an entity would be a better choice as you can further document the entity with the help of attributes.
It depends on how important that part of reality is within the database.
- Yes! If you want to give additional information about something, it will be better to model it as an entity, as you can attach extra information to an entity with attributes.
It depends on the amount of data you want to store in the database.
- No, this is an irrelevant argument.

Self-assessment 3: XML and preservation

Why is XML interesting from a digital preservation point of view?

XML files are non-binary files.
- Yes, XML files are text files which can be opened and modified with a simple text editor and are independent of any particular software.
XML is self-describing.
- Yes and no: XML tags allow to explicitly state the role of a specific part of a document, but the interpretation of the name of a tag might be problematic. The name of a tag might quickly lose its meaning after a couple of years, especially when acronyms are used.
XML files take up less space than JSON.
- No, JSON was actually developed as a reaction to the verbose nature of XML and drastically reduces the amount of characters used to represent data.

Self-assessment 4: XML and HTML

How is XML related to HTML?

XML is a subset of HTML.
- No, XML is a meta-markup language whereas HTML is simply a markup language.
Both are examples of a markup language.
- The precise answer is no, as XML is a meta-markup language: you have the possibility to define your own tags. HTML is simply a markup language with pre-defined tags.
XML was developed as a reaction to the evolution of HTML from markup to makeup.
- Yes! Towards the end of the 1990s, browsers proposed tags such as <blink> which merely had an aesthetic role, undermining the potential for a smarter Web.

Linked Data for Librarians

Part 1 – Module 2: Understanding data models

Comparing data models

Models and serialization formats

First model: tabular data

Tabular data—example

Serializing tabular data

Limits and possibilities of tabular data?

Second model: relational model

Mapping reality to a database

Designing a relational database

Redesigning our catalog

Implementation

SQL

Creating a table with SQL

Encoding data into a table with SQL

Search and retrieval

Practice on your own!

Dealing with change

Sharing your Data

Third model: markup

Applying makeup

Applying markup

HTML detour

XML

Modeling XML

Let’s add a work to our XML catalog of art objects

Let’s model everything as child elements

Let’s go for a compromise!

Schemas

Namespaces

Search and retrieval

When to use a database or XML?

Limitations?

Fourth model: RDF

Model

Serialization—different options

Turtle example

Turtle syntax

Turtle syntax

Search and retrieval in RDF

Implementation

Limits of RDF?

Self-assessment 1: tabular data

Self-assessment 2: entities or attributes

Self-assessment 3: XML and preservation

Self-assessment 4: XML and HTML