OpenRefine is the Excel of large data,
and well-suited for metadata records.
free, open-source software available online
- formerly Google Refine
- now maintained by an open-source community
uses a table-oriented interface
- can import CSV, XML, Excel…
- can export CSV, RDF…
great for batch operations on large datasets
Installing OpenRefine should be easy,
but ask assistance when in doubt.
Follow the online installation instructions.
- OpenRefine works on Windows, macOS, and Linux.
- OpenRefine requires Java.
If it doesn’t work, ask for help.
Click the diamond icon to start,
and OpenRefine open in your browser.
- OpenRefine is a local application.
Its interface displays in your browser.
Your data stays safely with you.
Despite having an in-browser interface,
OpenRefine never lets your data leave your computer.
Refine offers several options
to create a new project.
Your data never leaves your computer.
OpenRefine says Uploading data…
This means transferring your dataset
into the OpenRefine software.
OpenRefine shows an import preview.
We need to detail the input format.
The preview lets us check the interpretation
of the input before we start working on it.
OpenRefine can guess some details.
- main file format (CSV, TSV, Excel…)
- format-specific properties (field separator…)
We will need to specify some details ourselves.
- With Schoenberg, everything is done automatically.
Choose a name for the project.
Type a name.
Click Create Project.
OpenRefine will now import all data
according to the import settings.
Get acquainted with this dataset
through the OpenRefine interface.
Scroll to the right to see all columns.
The Schoenberg website lists
definitions of all fields.
Get to know the data inside of columns
through filters and facets.
Filters and facets allow you to
only see a specific subset of all rows.
Filters are selections based on
Facets are selections based on
automatically created conditions.
With a filter, the user is responsible
for coming up with relevant values.
Click the Seller column triangle
and create a text filter.
- A filter box appears on the left.
Type into the filter box to only show
rows containing that text for Seller.
us to only see rows
with sellers that contain us.
Remove a filter by clicking the cross
in its top left corner.
When a filter or facet is active,
actions apply only to matching rows.
us text filter on Seller still active,
let’s delete “all” rows.
Click the special All column triangle,
Edit rows, Remove all matching rows.
All rows seem to have disappeared!
- It only seems so because the filter is still on.
- Remove it by clicking its cross.
With OpenRefine, you can never
destroy your original data.
You are always working
on a copy of the data.
- Your original dataset remains unaltered.
OpenRefine keeps an undo history of all actions.
Go to Undo / Redo
and pick the step you want to revert to.
Use OpenRefine effectively:
first do, then think.
No need to think upfront:
when wrong, you can undo.
With a facet, OpenRefine automatically
comes up with relevant values.
Not all questions can be answered by filters.
- What if you don’t know values to look for?
- What other values besides
us should we try?
- How do Primary and Secondary seller differ?
Create a facet:
Primary seller, Facets, Text facet.
OpenRefine lets you select possible values on the left.
You can combine multiple filters and facets.
- Remove all when confused.
The Author column has multiple values
for some of its records.
Multi-valued fields are harder to deal with.
OpenRefine considers Guillaume de Lorris|Jean de Meun
a single value, even though there are two names.
We want facets to treat them individually!
We need to split the multiple field values
across multiple rows.
Author, Edit cells,
Split multi-valued cells.
In this dataset, the separator is the pipe token
Rows mode: all lines are individual.
Records mode: group values per record.
If we switch to Records mode, actions apply
to all rows belonging to a single record.
When split, multiple values of a multi-valued cell
belong to different rows but the same record.
Filter for a certain author name.
- rows mode
- only rows with a matching author
- records mode
- all records where at least one row matches
Fields with a categorizing function
facilitate partitioning of the data.
For example, the Author field allows us to
count the number of manuscripts per author.
- OpenRefine does this automatically through facets.
Yet such fields are particularly prone to errors.
- A typo or spelling variant makes similar values different.
- In such cases, facets are ineffective.
Clustering detects variants automatically
and helps you fix them.
OpenRefine will group similar values together.
- You decide which ones to merge (and which ones not).
Different clustering algorithms possess
different levels of aggressiveness.
- Simple clustering finds few mistakes, and most correctly.
- Advanced clustering finds more, but fewer correctly.
- OpenRefine is just a tool—you are the operator.
Try the learned techniques
on your own datasets.
The Powerhouse Museum dataset
harbours some interesting challenges.
Don’t be afraid to experiment.
Definitely try clustering where applicable.
- new insights and improved quality in minutes
Self-assessment 1: data location
Where does OpenRefine store your data?
on my hard disk
- Yes, your data never leaves your computer.
in my browser
- No, OpenRefine can access your data in different browsers of the same computer.
in the cloud
- No, your data never leaves your computer.
Self-assessment 2: filters and facets
What are filters and facets in OpenRefine?
They are different names for the same thing.
- No, they are different things.
Facets can result in multiple values.
- No, filters can also result in multiple values.
Creating a filter requires user input.
- Indeed, and facets are created automatically based on the data.
Self-assessment 3: rows and records
Which of the following situations could occur
in an OpenRefine project?
There are fewer rows than records.
- No, each record consists of at least one row.
There is an equal number of rows and records.
- Yes, each record could consist of a single row.
There are more rows than records.
- Yes, records can consist of multiple rows.
Self-assessment 4: clustering
To what do you need to pay attention when clustering?
To not merge syntactically unrelated words.
- No, clusters created by the clustering algorithms will always be syntactically related.
To not create large clusters.
- No, cluster size is irrelevant, as long as the words in the cluster point to a sufficiently similar entity.
To not merge semantically different entities.
- Yes: two words could be related syntactically or phonetically, but represent different entities (such as rockets and lockets).
Get in touch and reuse our materials!
- Interactive slides
- Comments or questions in regards to the content?
- On each slide, you will find the possibility
to tweet to us and engage with other participants!
- Reuse the content
- The content of these modules can be freely downloaded from GitHub.
We would be happy to hear from you how you reuse the content in your
own classes or workshops.