Prof. Dr. Sabine Schulte im Walde

Terminology Extraction and Annotation

Terms are linguistic units which characterise a specific topic domain. Not only the automatic extraction of terms is a challenging task, but also their manual definition and identification. This raises the question whether there is a common, natural understanding of what constitutes a term, and to what extent this term is associated to a domain. We have developed a number of resources for terminology extraction, as presented below.

Laypeople Study on Term Annotation

Differently to previous annotation studies, we collected judgements of laypeople, rather than experts, and specify on analysing their (dis-)agreements on common assumptions and core issues in term identification: the word classes of terms, the identification of ambiguous terms, and the relations between complex terms and possibly included subterms. To ensure a broad understanding of term identification, we designed four different tasks to address the granularities of term concepts, and we performed all annotations across four different domains in German: diy, cooking, hunting, chess.

See here on how to obtain the data.

Reference:

Anna Hätty, Sabine Schulte im Walde (2018)
A Laypeople Study on Terminology Identification across Domains and Task Definitions
In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). New Orleans, LA.

SURel - Annotation of Synchronic Meaning Shifts and Incorporation into Term Extraction

We introduce SURel, a novel dataset for German with human-annotated meaning shifts between general-language and domain-specific contexts. We show that meaning shifts of term candidates cause errors in term extraction, and demonstrate that the SURel annotation reflects these errors. Furthermore, we illustrate that SURel enables us to assess optimisations of term extraction techniques when incorporating meaning shifts.

See here on how to obtain the data.

Reference:

Anna Hätty, Dominik Schlechtweg, Sabine Schulte im Walde (2019)
SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction
In: Proceedings of the 8th Joint Conference on Lexical and Computational Semantics (*SEM). Minneapolis, MN, USA.

Domain-Specific Cooking Compound Termhood Annotation

We consider term difficulty as part of a tier model for a term's strength of association to a domain. It should naturally align to the idea of a gradual increase of term specificity to the domain: The more difficult or specialised a term is, the more distinctive it is from general language and the more it is associated to a domain. If terms are both general and understandable, it is sometimes hard to distinguish them from general-language words. Thus, the more expert knowledge is needed to understand a term, the stronger it should be associated to a domain. In this dataset, we distinguish four tiers, according to which five human judges annotated 396 German compounds from the cooking domain.

See here on how to obtain the data.

Reference:

Anna Hätty, Sabine Schulte im Walde (2018)
Fine-grained Termhood Prediction for German Compound Terms using Neural Networks
In: Proceedings of the COLING Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG). Santa Fe, NM.

Domain-Specific Compounds Difficulty Ratings

The dataset contains difficulty ratings for 1,030 German closed noun compounds extracted from domain-specific texts for do-it-yourself (DIY), cooking and automotive. It includes two-part compounds for cooking and DIY, and two- to four-part compounds for automotive. The compounds were identified in text using the Simple Compound Splitter (Weller-Di Marco, 2017); a subset was filtered and balanced for frequency and productivity criteria as basis for manual annotation and fine-grained interpretation. The final dataset was annotated with ratings from 20 annotators.

See here on how to obtain the data.

Reference:

Julia Bettinger, Anna Hätty, Michael Dorna, Sabine Schulte im Walde (2020)
A Domain-Specific Dataset of Difficulty Ratings for German Noun Compounds in the Domains DIY, Cooking and Automotive
In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC). Marseille, France.

Resources: Terminology Datasets

Terminology Extraction and Annotation

Laypeople Study on Term Annotation

SURel - Annotation of Synchronic Meaning Shifts and Incorporation into Term Extraction

Domain-Specific Cooking Compound Termhood Annotation

Domain-Specific Compounds Difficulty Ratings