Linked Data for Librarians

by Seth van Hooland and Ruben Verborgh

Part 2 – Module 2: Metadata enriching

Linked Data for Librarians

by Seth van Hooland and Ruben Verborgh

Part 2 – Module 2: Metadata enriching

Institute of Museum and Library Services Drexel University College of Computing & Informatics

[Mining for semantics.]
©2015 Bureau of Land Management Oregon and Washington State

Strategies for metadata creation

The LIS community has to fundamentally rethink how to create added-value metadata in contexts where it is impossible to continue with a traditional manual cataloging approach

Other communities have illustrated over the last decade how algorithms, crowdsourcing or outsourcing via micro-payments can provide alternative strategies

Digital Humanities are embracing scale

DH community in particular can inspire libraries to embrace the avalanche of digital collections not as a problem, but as an opportunity

Methods and tools developed by the DH community to make sense out of very big volumes of non-structured text can also be reused for automated metadata creation

What can you experiment with as a librarian?

Low hanging fruits from the information extraction field which are often used in DH research:

Both methods can be used with a minimum of resources and technical experience. This module gives you the opportunity to experiment on your own!

[Mining for semantics.]
©2017 Robert K. Nelson
[Distant reading in a library catalog.]
©2017 DH at Notre Dame University

With great power comes great responsibility.

Spiderman, 1962

The increased availability of data and text-mining technologies has given a tremendous boost to quantitative methods in the humanities. The ease with which certain technologies can be applied does hold a danger, in the sense that the tool increasingly defines the method and the research question itself. The DH community has been accused of being too tool-driven

Librarians as data curators

[DH debate in the NYT.]
©2012 Stanley Fish

Named-Entity Recognition (NER)

Consider the sentence: On 25 September 2006, we visited Washington to see the White House.

  1. Identification of entities
    • 25 September 2005
    • Washington
    • White House
  2. Disambiguation
    • Washington: the state, the city, the jazz musician, etc?

Disambiguation through URLs

What is actually identified by https://en.wikipedia.org/wiki/Jeff_Koons?

Do URLs identify only electronic resources or also real-world objects?

DBPedia takes a pragmatic approach to distinguish both

When Jeff Koons is mentioned in the document, the first URL will be used. Furthermore, if we visit the first URL in the browser, DBpedia is not able to represent Jeff Koons, so it redirects you towards the page about Jeff Koons instead.

[Magritte’s pipe]
©1948 René Magritte

NER services

[Black box.]
©1998 Ken Goldberg

Deciding which service to use—questions you should ask yourself:

How to represent the entity Library of Congress?

Deciding which service to use—questions you should ask yourself:

How to represent the entity Library of Congress?

You can perform NER on you data
directly from within OpenRefine

Configure the NER services

[the NER configuration dialog]

Start the NER process

[the NER menu]
[the NER dialog]
[results of NER]

Topic Modeling (TM)

First step: extracting the clusters

Running a popular TM tool such as Mallet on a corpus of English historical newspapers during the period of the Second World War might result in the following clusters:

  1. potatoes, farmer, transport, hunger, corn
  2. alarm, bunker, explosion, siren, airplane
  3. doctor, nurse, bed, medicine, death

Second step: labeling the clusters

In order to identify an appropriate label for a cluster, a good understanding of the context of the corpus is often essential. For the three clusters from the previous slide, we could come up with these labels:

  1. food shortage
  2. airstrikes
  3. hospitalization

TM understands the semantics?

The algorithm is constrained by the words used in the text; if Freudian psychoanalysis is your thing, and you feed the algorithm a transcription of your dream of bear-fights and big caves, the algorithm will tell you nothing about your father and your mother; it’ll only tell you things about bears and caves. It’s all text and no subtext.

Scott B. Weingart, Topic Modeling for Humanists: A Guided Tour , 2012

Self-assessment 1: distant reading

The concept of distant reading

  1. …refers to the possibility to automatically create abstracts of books.
    • No, the concept refers to a generic approach of using quantitative methods to deal with large volumes of data and does not represent one specific method or technique.
  2. …implies we no longer need to actually read texts.
    • No, distant and close reading practices complement one another and are not mutually exclusive.
  3. …can be helpful to librarians.
    • Yes! Automated methods such as TM or NER can be applied to extract metadata from large corpora and create navigational paths for end-users.

Self-assessment 2: NER

What makes NER so interesting as an information extraction technique?

  1. Entities are identified by a URL.
    • Yes! The URL allows to disambiguate and to obtain information about the entity.
  2. It allows to perform sentiment analysis on feedback by patrons.
    • No! Sentiment analysis uses other techniques.
  3. You can parse ambiguous dates to an ISO standardized data format.
    • No! Certain scripts can do that, but they have nothing to do with NER.

Self-assessment 3: (non-)information

Why is it important to distinguish information and non-information resources on the Web?

  1. They have different metadata.
    • Yes! For example, a Web page about a person (information resource) has a different creation date than the date of creation of the actual individual (non-information resource), which would be his or her date of birth.
  2. It is actually not so important.
    • It is really important! If not it becomes really difficult to refer to actual objects, people, ideas etc on the Web.
  3. It has an important impact on the design of URLs.
    • Yes, the URL needs to allow differentiating between the actual thing and documentation about the thing.

Self-assessment 4: Topic Modeling

When is Topic Modeling relevant for librarians?

  1. To extract key terms from metadata fields such as description.
    • No, TM requires a substantial volume of text and does not work for a couple of paragraphs.
  2. When you want to create an overview of recurring themes in a collection of scanned literature.
    • Yes, this would potentially be a good application for TM.
  3. When you want to identify all place names from a corpus.
    • No! NER is the application you need for this type of task.