Learning Word Embeddings from Tagging Data: A Methodological Comparison

This repository contains the generated embeddings from the KDML submission “Learning Word Embeddings from Tagging Data: A Methodological Comparison” by Thomas Niebler, Luzian Hahn and Andreas Hotho.

Overview

In our work, we compared the three embedding algorithms Word2Vec, GloVe and LINE with regard to their applicability on tagging data from folksonomies and the semantic quality of the produced embeddings.

Reference Implementations

We used the following embedding algorithm implementations in our work:

Datasets

Generated Vector Embeddings

We published the generated embeddings for each tagging dataset in the embeddings directory in the repository.

Original tagging data

Delicious

The Delicious tagging dataset is publicly available.

BibSonomy

The BibSonomy tagging data can be retrieved from the BibSonomy homepage.

CiteULike

The CiteULike tagging data can be retrieved from CiteULike.

Human Intuition Datasets

The Human Intuition Datasets (HIDs) can be retrieved as preprocessed pandas-friendly csv files here or from the corresponding original locations.

Results not included in the paper

In the following, we will present the results for the experiments performed in the paper, but evaluated on WS353, MTurk and Bib100.

semantics-tagembeddings

Learning Word Embeddings from Tagging Data: A Methodological Comparison

Overview

Reference Implementations

Datasets

Generated Vector Embeddings

Original tagging data

Delicious

BibSonomy

CiteULike

Human Intuition Datasets

Results not included in the paper

Word2Vec

WordSimilarity-353

MTurk

Bib100

GloVe

WordSimilarity-353

MTurk

Bib100

LINE

WordSimilarity-353

MTurk

Bib100