SSiW


Terminology Extraction

Terms are linguistic units which characterise a specific topic domain. Not only the automatic extraction of terms is a challenging task, but also their manual definition and identification. This raises the question whether there is a common, natural understanding of what constitutes a term, and to what extent this term is associated to a domain. We have developed a number of resources for terminology extraction, as presented below.


Laypeople Study on Term Annotation

Differently to previous annotation studies, we collected judgements of laypeople, rather than experts, and specify on analysing their (dis-)agreements on common assumptions and core issues in term identification: the word classes of terms, the identification of ambiguous terms, and the relations between complex terms and possibly included subterms. To ensure a broad understanding of term identification, we designed four different tasks to address the granularities of term concepts, and we performed all annotations across four different domains in German: diy, cooking, hunting, chess.

See here on how to obtain the data.

Reference:

Anna Hätty, Sabine Schulte im Walde (2018a)
A Laypeople Study on Terminology Identification across Domains and Task Definitions
In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). New Orleans, LA.


Fine-grained Compound Termhood Annotation

We consider term difficulty as part of a tier model for a term's strength of association to a domain. It should naturally align to the idea of a gradual increase of term specificity to the domain: The more difficult or specialised a term is, the more distinctive it is from general language and the more it is associated to a domain. If terms are both general and understandable, it is sometimes hard to distinguish them from general-language words. Thus, the more expert knowledge is needed to understand a term, the stronger it should be associated to a domain. In this dataset, we distinguish four tiers, according to which five human judges annotated 396 German compounds from the cooking domain.

See here on how to obtain the data.

Reference:

Anna Hätty, Sabine Schulte im Walde (2018b)
Fine-grained Termhood Prediction for German Compound Terms using Neural Networks
In: Proceedings of the COLING Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG). Santa Fe, NM.