Linked Data for Librarians

by Seth van Hooland and Ruben Verborgh

Part 1 – Module 5: Data profiling and cleaning

Linked Data for Librarians

by Seth van Hooland and Ruben Verborgh

Part 1 – Module 5: Data profiling and cleaning

Institute of Museum and Library Services Drexel University College of Computing & Informatics

OpenRefine is the Excel of large data, and well-suited for metadata records.

Installing OpenRefine should be easy,
but ask assistance when in doubt.

Click the diamond icon to start,
and OpenRefine open in your browser.

[the OpenRefine interface]

Refine offers several options
to create a new project.

Your data never leaves your computer.

OpenRefine shows an import preview.
We need to detail the input format.

Choose a name for the project.

Get acquainted with this dataset
through the OpenRefine interface.

Get to know the data inside of columns
through filters and facets.

With a filter, the user is responsible
for coming up with relevant values.

When a filter or facet is active,
actions apply only to matching rows.

With OpenRefine, you can never
destroy your original data.

  1. You are always working on a copy of the data.
    • Your original dataset remains unaltered.
  2. OpenRefine keeps an undo history of all actions.
    • Go to Undo / Redo and pick the step you want to revert to.

Use OpenRefine effectively: first do, then think.
No need to think upfront: when wrong, you can undo.

With a facet, OpenRefine automatically
comes up with relevant values.

The Author column has multiple values
for some of its records.

Rows mode: all lines are individual.
Records mode: group values per record.

Fields with a categorizing function
facilitate partitioning of the data.

Clustering detects variants automatically
and helps you fix them.

Try the learned techniques
on your own datasets.

Self-assessment 1: data location

Where does OpenRefine store your data?

  1. on my hard disk
    • Yes, your data never leaves your computer.
  2. in my browser
    • No, OpenRefine can access your data in different browsers of the same computer.
  3. in the cloud
    • No, your data never leaves your computer.

Self-assessment 2: filters and facets

What are filters and facets in OpenRefine?

  1. They are different names for the same thing.
    • No, they are different things.
  2. Facets can result in multiple values.
    • No, filters can also result in multiple values.
  3. Creating a filter requires user input.
    • Indeed, and facets are created automatically based on the data.

Self-assessment 3: rows and records

Which of the following situations could occur
in an OpenRefine project?

  1. There are fewer rows than records.
    • No, each record consists of at least one row.
  2. There is an equal number of rows and records.
    • Yes, each record could consist of a single row.
  3. There are more rows than records.
    • Yes, records can consist of multiple rows.

Self-assessment 4: clustering

To what do you need to pay attention when clustering?

  1. To not merge syntactically unrelated words.
    • No, clusters created by the clustering algorithms will always be syntactically related.
  2. To not create large clusters.
    • No, cluster size is irrelevant, as long as the words in the cluster point to a sufficiently similar entity.
  3. To not merge semantically different entities.
    • Yes: two words could be related syntactically or phonetically, but represent different entities (such as rockets and lockets).

Get in touch and reuse our materials!