Data profiling and cleaning

Linked Data for Librarians

by Seth van Hooland and Ruben Verborgh

Part 1 – Module 5: Data profiling and cleaning

OpenRefine is the Excel of large data, and well-suited for metadata records.

free, open-source software available online
- formerly Google Refine
- now maintained by an open-source community
uses a table-oriented interface
- can import CSV, XML, Excel…
- can export CSV, RDF…
great for batch operations on large datasets

Installing OpenRefine should be easy,
but ask assistance when in doubt.

Follow the online installation instructions.
- OpenRefine works on Windows, macOS, and Linux.
- OpenRefine requires Java.
If it doesn’t work, ask for help.
- in the OpenRefine forums
- from your local IT expert

Click the diamond icon to start,
and OpenRefine open in your browser.

OpenRefine is a local application.
Its interface displays in your browser.
- If it doesn’t start right away,
  navigate to http://localhost:3333/.
Your data stays safely with you.
- Despite having an in-browser interface,
  OpenRefine never lets your data leave your computer.

Refine offers several options
to create a new project.

As an example, we will use metadata
from the Schoenberg Database of Manuscripts.
- We use an older dataset with several quality problems.
Download the dataset and save it locally.
Choose Create Project, and select the downloaded file schoenberg.zip using Choose Files.
- Then click “Next”.

Your data never leaves your computer.

OpenRefine says Uploading data…
- This means transferring your dataset
  into the OpenRefine software.

OpenRefine shows an import preview.
We need to detail the input format.

The preview lets us check the interpretation
of the input before we start working on it.
OpenRefine can guess some details.
- main file format (CSV, TSV, Excel…)
- format-specific properties (field separator…)
We will need to specify some details ourselves.
- With Schoenberg, everything is done automatically.

Choose a name for the project.

Type a name.
Click Create Project.
OpenRefine will now import all data
according to the import settings.

Get acquainted with this dataset
through the OpenRefine interface.

Scroll to the right to see all columns.
The Schoenberg website lists
definitions of all fields.

Get to know the data inside of columns
through filters and facets.

Filters and facets allow you to
only see a specific subset of all rows.
Filters are selections based on
user-created conditions.
Facets are selections based on
automatically created conditions.

With a filter, the user is responsible
for coming up with relevant values.

Click the Seller column triangle
and create a text filter.
- A filter box appears on the left.
Type into the filter box to only show
rows containing that text for Seller.
- Type us to only see rows with sellers that contain us.
Remove a filter by clicking the cross
in its top left corner.

When a filter or facet is active,
actions apply only to matching rows.

With the us text filter on Seller still active,
let’s delete “all” rows.
Click the special All column triangle,
Edit rows, Remove all matching rows.
All rows seem to have disappeared!
- It only seems so because the filter is still on.
- Remove it by clicking its cross.

With OpenRefine, you can never
destroy your original data.

You are always working on a copy of the data.
- Your original dataset remains unaltered.
OpenRefine keeps an undo history of all actions.
- Go to Undo / Redo and pick the step you want to revert to.

Use OpenRefine effectively: first do, then think.
No need to think upfront: when wrong, you can undo.

With a facet, OpenRefine automatically
comes up with relevant values.

Not all questions can be answered by filters.
- What if you don’t know values to look for?
- What other values besides us should we try?
- How do Primary and Secondary seller differ?
Create a facet: Primary seller, Facets, Text facet.
- OpenRefine lets you select possible values on the left.
You can combine multiple filters and facets.
- Remove all when confused.

The Author column has multiple values
for some of its records.

Multi-valued fields are harder to deal with.
- OpenRefine considers Guillaume de Lorris|Jean de Meun
  a single value, even though there are two names.
- We want facets to treat them individually!
We need to split the multiple field values
across multiple rows.
- Author, Edit cells, Split multi-valued cells.
- In this dataset, the separator is the pipe token |.

Rows mode: all lines are individual.
Records mode: group values per record.

If we switch to Records mode, actions apply
to all rows belonging to a single record.
- When split, multiple values of a multi-valued cell belong to different rows but the same record.
Filter for a certain author name.

rows mode

only rows with a matching author

records mode

all records where at least one row matches

Fields with a categorizing function
facilitate partitioning of the data.

For example, the Author field allows us to
count the number of manuscripts per author.
- OpenRefine does this automatically through facets.
Yet such fields are particularly prone to errors.
- A typo or spelling variant makes similar values different.
- In such cases, facets are ineffective.

Clustering detects variants automatically
and helps you fix them.

OpenRefine will group similar values together.
- You decide which ones to merge (and which ones not).
Different clustering algorithms possess
different levels of aggressiveness.
- Simple clustering finds few mistakes, and most correctly.
- Advanced clustering finds more, but fewer correctly.
Cluster responsibly.
- OpenRefine is just a tool—you are the operator.

Try the learned techniques
on your own datasets.

The Powerhouse Museum dataset
harbours some interesting challenges.
Don’t be afraid to experiment.
- Heavily rely on undo.
Definitely try clustering where applicable.
- new insights and improved quality in minutes

Self-assessment 1: data location

Where does OpenRefine store your data?

on my hard disk
- Yes, your data never leaves your computer.
in my browser
- No, OpenRefine can access your data in different browsers of the same computer.
in the cloud
- No, your data never leaves your computer.

Self-assessment 2: filters and facets

What are filters and facets in OpenRefine?

They are different names for the same thing.
- No, they are different things.
Facets can result in multiple values.
- No, filters can also result in multiple values.
Creating a filter requires user input.
- Indeed, and facets are created automatically based on the data.

Self-assessment 3: rows and records

Which of the following situations could occur
in an OpenRefine project?

There are fewer rows than records.
- No, each record consists of at least one row.
There is an equal number of rows and records.
- Yes, each record could consist of a single row.
There are more rows than records.
- Yes, records can consist of multiple rows.

Self-assessment 4: clustering

To what do you need to pay attention when clustering?

To not merge syntactically unrelated words.
- No, clusters created by the clustering algorithms will always be syntactically related.
To not create large clusters.
- No, cluster size is irrelevant, as long as the words in the cluster point to a sufficiently similar entity.
To not merge semantically different entities.
- Yes: two words could be related syntactically or phonetically, but represent different entities (such as rockets and lockets).

Get in touch and reuse our materials!

Interactive slides
- Comments or questions in regards to the content?
- On each slide, you will find the possibility to tweet to us and engage with other participants!
Reuse the content
- The content of these modules can be freely downloaded from GitHub. We would be happy to hear from you how you reuse the content in your own classes or workshops.

Linked Data for Librarians

Part 1 – Module 5: Data profiling and cleaning

OpenRefine is the Excel of large data, and well-suited for metadata records.

Installing OpenRefine should be easy, but ask assistance when in doubt.

Click the diamond icon to start, and OpenRefine open in your browser.

Refine offers several options to create a new project.

Your data never leaves your computer.

OpenRefine shows an import preview. We need to detail the input format.

Choose a name for the project.

Get acquainted with this dataset through the OpenRefine interface.

Get to know the data inside of columns through filters and facets.

With a filter, the user is responsible for coming up with relevant values.

When a filter or facet is active, actions apply only to matching rows.

With OpenRefine, you can never destroy your original data.

With a facet, OpenRefine automatically comes up with relevant values.

The Author column has multiple values for some of its records.

Rows mode: all lines are individual. Records mode: group values per record.

Fields with a categorizing function facilitate partitioning of the data.

Clustering detects variants automatically and helps you fix them.

Try the learned techniques on your own datasets.