Learning Semantic Relatedness from Human Feedback Using Relative Relatedness Learning

This page contains all the necessary information to reproduce the results given in the ISWC’17 poster “Learning Semantic Relatedness from Human Feedback Using Relative Relatedness Learning” by Thomas Niebler, Martin Becker, Christian Pölitz and Andreas Hotho, all members of the DMIR group at the University of Würzburg.

Thomas is maintaining the code, while the DMIR group will offer support

Overview

In our work, we learned a semantic relatedness measure from human feedback, using a metric learning approach. Human Intuition Datasets contain direct human judgments about the relatedness of words, i.e. human feedback. We exploit these datasets to then learn a parameterization of the cosine measure, while resorting to a metric learning approach, which is based on relative distance comparisons. We validate our approach on several different embedding datasets, which we either make public or provide a download a link here.

Furthermore and to the best of our knowledge, we were the first to explore the possibility of learning word embeddings from tagging data. We further elaborated on this in a different paper.

Reference Implementations

From Tag Co-Occurrences to Tag Embeddings

To calculate the tag cooccurrence graph as input for the GloVe algorithm, we applied the method presented in “Semantic Grounding of Tag Relatedness in Social Bookmarking Systems” by Cattuto et al.

More specifically, we used the co-occurrence based on posts as described in Equation (1) in the linked paper: $coocc(t_1, t_2) := card\left((u, r) \in U \times R | t_1, t_2 \in T_{ur} \right)$ Here, $T_{ur} := \left\{ t \in T | (u,t,r) \in Y \right\}$ , i.e. all tags t, which hav been assigned to resource r by user u.

In src/embeddings/example_call.py, we provided an example on how to call the corresponding methods to construct the co-occurrence graph. It then needs to be saved to a file, before the GloVe algorithm can be called on that file.

LSML

RRL is inspired by the LSML metric learning algorithm. We built on the LSML implementation contained in the metric_learn python package.

GloVe

We used the published code of GloVe to create the tag embeddings of dimension 100. We used the predefined parameter values of alpha=0.75 and x_max=100.

Word Embedding Datasets

These are the datasets that we used for our experiments.

Delicious The Delicious tagging dataset is publicly available. The generated word embeddings are published in this repository.
BibSonomy The BibSonomy tagging data can be retrieved from the BibSonomy homepage. We also provide the generated word embeddings as a public download in this repository.
WikiGlove Pennington et al. made some of their vector collections publicly available. Specifically, we used to GloVe6B corpus, which is generated from a Wikipedia dump from 2014 and the Gigaword5 corpus.
WikiNav The WikiNav vectors are publicly available at Wikimedia Research. Specifically, we used the 100-dimensional vectors from FigShare, created with data ranging from 01-01-2017 till 31-01-2017.
ConceptNet Numberbatch Finally, we applied our algorithm on the ConceptNet Numberbatch vectors, which currently yield state-of-the-art performance in a series of competitions.

Human Intuition Datasets

The Human Intuition Datasets (HIDs) can be retrieved as preprocessed pandas-friendly csv files here or from the corresponding original locations.