SSiW


German Verb Subcategorisation Database extracted from MATE Dependency Parses

Based on the SubCat-Extractor, we induced verb subcategorisation information from German MATE dependency parses. The subcategorisation database is represented in a compact but linguistically detailed and flexible format, comprising various aspects of verb information, complement information and sentence information, within a one-line-per-clause style. Here are a few examples from the paper:

As a natural and immediately subsequent step, we induced a subcategorisation frame lexicon from the verb data. Taking voice into account, we summed over the various complement combinations a verb lemma appeared with. For example, among the most frequent subcategorisation frames for the verb glauben ‘believe’ are a subcategorised clause ‘believe that’ (freq: 52,710), a subcategorised prepositional phrase with preposition anacc ‘believe in’ (freq: 4,596) and an indirect object ‘trust s.o.’ (freq: 2,514). In addition, we took the actual complement heads into account. For example, among the most frequent combinations of heads that are subjects and indirect objects of glauben are ‘one, survey’ and ‘nobody, him’. Paying attention to a specific complement type (e.g., the direct object within a transitive frame), we induced information that is relevant for collocation analyses. For example, among the most frequent indirect objects of glauben in a transitive frame are Wort ‘word’, Bericht ‘report’, and Aussage ‘statement’.

So far, we have applied the SubCat-Extractor to the German web corpus SdeWaC (Faaß and Eckart, 2013), which contains approx. 880 million words, and a Wikipedia dump from April 10, 2011, containing approx. 430 million words.

See here on how to obtain the data.


Reference:

Silke Scheible, Sabine Schulte im Walde, Marion Weller, Max Kisselew (2013)
A Compact but Linguistically Detailed Database for German Verb Subcategorisation relying on Dependency Parses from a Web Corpus: Tool, Guidelines and Resource
In: Proceedings of the 8th Web as Corpus Workshop. Lancaster, UK.