SSiW


Semantic Change Corpus Cleaning and Evaluation

Research on the automatic detection of semantic change in computational linguistics has become a popular topic, motivated by expected performance improvements of practical NLP applications, or by theoretical interest in language and cultural change. A major obstacle in the computational modeling of semantic change, however, is the lack of resources for evaluation. Besides computational models of semantic change we have developed a number of such evaluation resources.


German Diachronic Metaphor Annotation Dataset

Metaphoric change plays a fundamental role in semantic change. We introduce a resource for the evaluation of computational models of metaphoric change and propose a structured annotation process that is generalisable to the creation of gold standards for other types of semantic change.

Two annotators judged for 560 context pairs whether one of the contexts admitted inference of a meaning of the target word which is related metaphorically to the meaning in the other context.

See here on how to obtain the data.

Reference:

Dominik Schlechtweg, Stefanie Eckmann, Enrico Santus, Sabine Schulte im Walde, Daniel Hole (2017)
German in Flux: Detecting Metaphoric Change via Word Entropy
In: Proceedings of the SiGNLL Conference on Computational Natural Language Learning (CoNNL). Vancouver, Canada.


DURel - Annotation of Diachronic Ursage Relatedness

We extend a framework of synchronic polysemy annotation to the annotation of Diachronic Usage Relatedness (DURel). DURel has a strong theoretical basis and at the same time makes use of established synchronic procedures that rely on the intuitive notion of semantic relatedness.

DURel distinguishes between innovative and reductive meaning changes with high interannotator agreement. The resulting test set for German comprises ratings from five annotators for the relatedness of 1,320 use pairs across 22 target words.

See here on how to obtain the data.

Reference:

Dominik Schlechtweg, Sabine Schulte im Walde, Stefanie Eckmann (2018)
Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change
In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). New Orleans, LA.


DiaWUG: Diatopic Word Usage Graphs for Spanish

This data collection contains discovered diachronic Word Usage Graphs (WUGs) for Spanish. Find a description of the data format, code to process the data and further datasets on the WUGsite.

See here on how to obtain the data.

Reference:

Gioia Baldissin, Dominik Schlechtweg, Sabine Schulte im Walde (2022)
DiaWUG: A Dataset for Diatopic Lexical Semantic Variation in Spanish
In: Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC). Marseille, France.


Dataset of Grammaticalisation for German Prepositions

We developed a test set containing 206 prepositions with four degrees of grammaticalisation (1: low -- 4: high). The test set distinguishes between
  1. prepositions with the form of a content word (e.g., trotz),
  2. prepositions with the form of a syntactic structure (e.g., am Rande), and
  3. prepositions with the form of a function word (e.g., vor).
Prepositions in 1. and 2. show a low to medium degree of grammaticalisation, while the ones in 3. show a high degree.

See here on how to obtain the data.

Reference:

Dominik Schlechtweg, Sabine Schulte im Walde (2018)
Distribution-based Prediction of the Degree of Grammaticalization for German Prepositions
In: Proceedings of the Evolution of Language International Conference (EvoLang XII). Torun, Poland.


CCOHA: Clean Corpus of Historical American English

The Corpus of Historical American English (COHA) is one of the most commonly used large corpora in diachronic studies in English. We cleaned the corpus in order to overcome its main limitations, such as inconsistent lemmas and malformed tokens, without compromising its qualitative and distributional properties. The resulting corpus CCOHA in addition contains a larger number of cleaned word tokens which can offer better insights into language change and allow for a larger variety of tasks to be performed.

See here on how to obtain the data.

Reference:

Reem Alatrash, Dominik Schlechtweg, Jonas Kuhn, Sabine Schulte im Walde (2020)
CCOHA: Clean Corpus of Historical American English
In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC). Marseille, France.