German clause-embedding predicates : an extraction and classification approach

This thesis describes a semi-automatic approach to the analysis of subcategorisation properties of verbal, nominal and multiword predicates in German. We semi-automatically classify predicates according to their subcategorisation properties by means of extracting them from German corpora along with their complements. In this work, we concentrate exclusively on sentential complements, such as dass, ob and w-clauses, although our methods can be also applied for other complement types. Our aim is not only to extract and classify predicates but also to compare subcategorisation properties of morphologically related predicates, such as verbs and their nominalisations. It is usually assumed that subcategorisation properties of nominalisations are taken over from their underlying verbs. However, our tests show that there exist different types of relations between them. Thus, we review subcategorisation properties of morphologically related words and analyse their correspondences and differences. For this purpose, we elaborate a set of semi-automatic procedures, which allow us not only to classify extracted units according to their subcategorisation properties, but also to compare the properties of verbs and their nominalisations, which occur both freely in corpora and within a multiword expression. The lexical data are created to serve symbolic NLP, especially large symbolic grammars for deep processing, such as HPSG or LFG, cf. work in the LinGO project (Copestake et al. 2004) and the Pargram project (Butt et al. 2002). HPSG and LFG need detailed linguistic knowledge. Besides that, subcategorisation iformation can be applied in applications for IE, cf. (Surdeanu et al. 2003). Moreover, this information is necessary for linguistic, lexicographic, SLA and translation work. Our extraction and classification procedures are precision-oriented, which means that we focus on high accuracy of our extraction and classification results. High precision is opposed to completeness, which is compensated by the application of extraction procedures on larger corpora. Die vorliegende Arbeit beschreibt einen Ansatz zur semi-automatischen Analyse von deutschen Pradikaten. Verben, Nomina und Mehrwortausdrucke (MWAs) werden automatisch aus den Copora extrahiert und nach ihren Valenzeigenschaften klassifiziert. In dieser Arbeit berucksichtigen wir nur satzformige Komplemente, obwohl diese Methode fur die Extraktion weiterer Komplementtypen geeignet ist. Neben der subkategorisierungsbasierten Klassifikation wollen wir auch die Eigenschaften morphologisch verwandter Pradikate (e.g. Verben und ihrer Nominalisierungen) vergleichen. In den meisten Ansatzen wird generell angenommen, dass Nominalisierungen ihre Valenzeigenschaften von den Basisverben ubernehmen oder erben. Dennoch zeigen unsere Extraktionsexperimente, dass diese Annahme nicht immer stimmt. Deswegen befast sich diese Arbeit mit dem Vergleich der Valenzeigenschaften von Verben und Nominalisierungen und der Analyse ihrer Ubereinstimmungen und Unterschiede. Dafur entwerfen wir ein semi-atomatisches Verfahren zur Extraktion und Klassifikation der Valenzeigenschaften deutscher Pradikate, sowie der Relationen zwischen Valenzeigenschaften von Verben und ihren Nominalisierungen. Die extrahierten Daten konnen fur symbolische NLP-Systeme angewendet werden, besonders fur die symbolischen Grammatiktheorien LFG und HPSG1. Ausfuhrliche lexikalische Informationen sind fur diese Grammatiken sehr wichtig. Auserdem sind Informationen uber Subkategorisierung fur Linguistik, Lexikographie, sowie multilinguale Ansatze, z.B. Fremdsprachenunterricht oder Ubersetzungen, notwendig. Unser Ziel ist hohere Prazision der Extraktions- und Klassifikationsergebnisse zu erreichen. Somit wird ihre Vollstandigkeit vernachlassigt, was wir durch die Anzahl der verwendeten Corpora ausgleichen wollen.

[1]  Martin Forst Treebank Conversion - Establishing a testsuite for a broad-coverage LFG from the TIGER treebank , 2003, LINC@EACL.

[2]  C. Blanche-Benveniste,et al.  Syntaxe et Mécanismes Descriptifs: Présentation de l'approche pronominale , 1978 .

[3]  Sabine Bartsch Structural and functional properties of collocations in English : a corpus study of lexical and pragmatic constraints on lexical co-occurrence , 2004 .

[4]  Helmut Schumacher,et al.  Lucien Tesnière--syntaxe structurale et opérations mentales : Akten des deutsch-französischen Kolloquiums anläßlich der 100. Wiederkehr seines Geburtstages Strasbourg 1993 , 1996 .

[5]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[6]  J. Mackenzie Nouns are avalent - and nominalizations too , 1997 .

[7]  Mary L. Nunes,et al.  Argument linking in English derived nominals , 1992 .

[8]  Anoop Sarkar,et al.  Automatic Extraction of Subcategorization Frames for Czech , 2000, COLING.

[9]  D. Bourigault,et al.  Syntex, analyseur syntaxique de corpus , 2005 .

[10]  Neal R. Norrick,et al.  Syntaktische Aspekte der Phraseologie I: Valenztheoretische Ansätze , 2007 .

[11]  Alexandra Kinyon,et al.  Identifying Verb Arguments and their Syntactic Function in the Penn Treebank , 2002, LREC.

[12]  Violeta Seretan,et al.  Collocation extraction based on syntactic parsing , 2008 .

[13]  Stefan J. Schierholz Präpositionalattribute : Syntaktische und semantische Analysen , 2001 .

[14]  Kerstin Schwabe,et al.  On the Semantics of German Declarative and Interrogative Root and Complement Clauses , 2006 .

[15]  Ted Briscoe,et al.  Automatic Extraction of Subcategorization from Corpora , 1997, ANLP.

[16]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[17]  Zygmunt Vetulani,et al.  Verb-Noun Collocation SyntLex Dictionary: Corpus-Based Approach , 2008, LREC.

[18]  Jeroen Groenendijk,et al.  On the semantics of questions and the pragmatics of answers , 1984 .

[19]  Ted Briscoe,et al.  Robust Accurate Statistical Annotation of General Text , 2002, LREC.

[20]  Manfred Sailer,et al.  Cranberry Expressions in English and in German , 2008 .

[21]  H. Schumacher,et al.  Kleines Valenzlexikon deutscher Verben , 1980 .

[22]  Ulrich Heid,et al.  Providing corpus data for a dictionary for German juridical phraseology , 2008, KONVENS.

[23]  Ralph Grishman,et al.  Designing a dictionary of derived nominals , 2000 .

[24]  Martha Palmer,et al.  Class-Based Construction of a Verb Lexicon , 2000, AAAI/IAAI.

[26]  Judith Eckle-Kohler,et al.  Methods for quality assurance in semi-automatic lexicon acquisition from corpora , 1998 .

[27]  Ulrich Heid,et al.  SMOR: A German Computational Morphology Covering Derivation, Composition and Inflection , 2004, LREC.

[28]  Judith N. Levi,et al.  The syntax and semantics of complex nominals , 1978 .

[29]  Alfa-Informatica,et al.  Corpus-based acquisition of collocational prepositional phrases , 2022 .

[30]  Valeria de Paiva,et al.  Deverbal Nouns in Knowledge Representation , 2006, FLAIRS Conference.

[31]  Paula Chesley,et al.  Automatic extraction of subcategorization frames for French , 2006, LREC.

[32]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[33]  Martha Palmer,et al.  Consistent Criteria for Sense Distinctions , 2000, Comput. Humanit..

[34]  Gerhard Helbig,et al.  Deutsche Grammatik : ein Handbuch für den Ausländerunterricht , 2001 .

[35]  Laurie Bauer,et al.  Introducing Linguistic Morphology , 1988 .

[36]  Peter Matthews,et al.  The scope of valency in grammar , 2007 .

[37]  Ralph Grishman,et al.  NOMLEX: a lexicon of nominalizations , 1998 .

[38]  Michael R. Brent,et al.  From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[39]  Katrin Götz-Votteler,et al.  Valency issues in FrameNet , 2007 .

[40]  Miriam Butt,et al.  The Parallel Grammar Project , 2002, COLING 2002.

[41]  Thomas Herbst,et al.  Valency data for Natural Language Processing: What can the Valency Dictionary of English provide? , 2007 .

[42]  Caroline Sporleder,et al.  Semantic Role Assignment for Event Nominalisations by Leveraging Verbal Data , 2008, COLING.

[43]  Sussi Olsen Some aspects of the syntactic encoding of nouns in a computational lexicon: the STO project , 2002 .

[44]  Stephan Oepen,et al.  A Lexicon Module for a Grammar Development Environment , 2004, LREC.

[45]  Josef Ruppenhofer,et al.  FrameNet: Theory and Practice , 2003 .

[46]  Richard Montague,et al.  The Proper Treatment of Quantification in Ordinary English , 1973 .

[47]  Charles J. Fillmore,et al.  THE CASE FOR CASE. , 1967 .

[48]  W. Busse,et al.  Französisches Verblexikon : die Konstruktion der Verben im Französischen , 1983 .

[49]  H. Burger Phraseologie : eine Einführung am Beispiel des Deutschen , 1998 .

[50]  Suzanne Stevenson,et al.  Automatic Verb Classification Based on Statistical Distributions of Argument Structure , 2001, CL.

[51]  Andy Way,et al.  Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks , 2005, Computational Linguistics.

[52]  Gerhard Helbig,et al.  W? orterbuch zur Valenz und Distribution deutscher Verben , 1975 .

[53]  Sabine Schulte im Walde Experiments on the Automatic Induction of German Semantic Verb Classes , 2006, CL.

[54]  Ralph Grishman,et al.  Standardization of the Complement/Adjunct Distinction , 1996 .

[55]  Sven Hartrumpf,et al.  The semantically based computer lexicon HaGenLex. Structure and technological environment , 2003 .

[56]  Hang Li,et al.  Generalizing Case Frames Using a Thesaurus and the MDL Principle , 1995, CL.

[57]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[58]  Ulrich Heid,et al.  Extraction tools for collocations and their morphosyntactic specificities , 2006, LREC.

[59]  Stefan J. Schierholz Valenzwörterbücher für Substantive , 2005 .

[60]  Fabienne Fritzinger,et al.  Pattern-Based Extraction of Negative Polarity Items from Dependency-Parsed Text , 2010, LREC.

[61]  Sabine Schulte im Walde 44. The induction of verb frames and verb classes from corpora , 2009 .

[62]  Christiane Fellbaum,et al.  Corpus-based Studies of German Idioms and Light Verbs , 2006 .

[63]  John A. Carroll,et al.  The Automatic Acquisition of Verb Subcategorisations and Their Impact on the Performance of an HPSG Parser , 2004, IJCNLP.

[64]  María Begoña Villada Moirón,et al.  University of Groningen Data-driven identification of fixed expressions and their modifiability , 2005 .

[65]  Thomas Herbst,et al.  Valency: Theoretical, Descriptive and Cognitive Issues , 2007 .

[66]  Vito Pirrelli,et al.  Unsupervised Acquisition of Verb Subcategorization Frames from Shallow-Parsed Corpora , 2008, LREC.

[67]  C. L. Hamblin QUESTIONS IN MONTAGUE ENGLISH , 1976 .

[68]  F. Grossmann,et al.  Les collocations: analyse et traitement. , 2003 .

[69]  Pierrette Bouillon,et al.  Compound Nouns in a Unification-Based MT System , 1992, ANLP.

[70]  Afsaneh Fazly,et al.  Automatically Constructing a Lexicon of Verb Phrase Idiomatic Combinations , 2006, EACL.

[71]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[72]  Ulrich Heid,et al.  A Dutch Chunker as a Basis for the Extraction of Linguistic Knowledge , 2002, CLIN.

[73]  Kerstin Schwabe,et al.  Semantic Characterizations of German Question-Embedding Predicates , 2007, TbiLLC.

[74]  Ekaterina Lapshinova-Koltunski,et al.  Head or Non-head? Semi-automatic Procedures for Extracting and Classifying Subcategorisation Properties of Compounds , 2008, LREC.

[75]  Thierry Poibeau LexSchem: A Large Subcategorization Lexicon for , 2008 .

[76]  Ralph Grishman,et al.  Comlex Syntax: Building a Computational Lexicon , 1994, COLING.

[77]  L. Karttunen Syntax and Semantics of Questions , 1977 .

[78]  Angelika Storrer,et al.  Corpus-based Investigations on German Support Verb Constructions , 2007 .

[79]  Serena Villata,et al.  Automatic extraction of subcategorization frames for Italian , 2008, LREC.

[80]  D. J. Allerton,et al.  Valency and the English verb , 1982 .

[81]  Gerhard Helbig Probleme der Valenz- und Kasustheorie , 1992 .

[82]  J. Bresnan Lexical-Functional Syntax , 2000 .

[83]  L. Eichinger,et al.  Valency and Semantic Roles: the Concept of Deep Structure Case , 2003 .

[84]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[85]  Veronika Ehrich,et al.  Sortale Bedeutung und Argumentstruktur: ung-Nominalislerungen im Deutschen , 2000 .

[86]  Rudolf Emons Valenzen englischer Prädikatsverben , 1974 .

[87]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[88]  R. Camacho,et al.  The argument structure of deverbal nouns in Brazilian Portuguese , 2004 .

[89]  Lorelies Ortner,et al.  Substantivkomposita: (Komposita und kompositionsähnliche Strukturen 1) , 1991 .

[90]  Alex Waibel,et al.  The Automatic Acquisition of Frequencies of Verb Subcategorization Frames from Tagged Corpora , 2002 .

[91]  Zygmunt Vetulani,et al.  Towards a Lexicon-Grammar of Polish: Extraction of Verbo-Nominal Collocations from Corpora , 2007, FLAIRS Conference.

[92]  Thierry Poibeau,et al.  Do we Still Need Gold Standards for Evaluation? , 2008, LREC.

[93]  Gerhard Helbig,et al.  Wörterbuch zur Valenz und Distribution deutscher Verben [Helbig, 1969] , 1969 .

[94]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[95]  Michael R. Brent Automatic Semantic Classification Of Verbs From Their Syntactic Contexts: An Implemented Classifier For Stativity , 1991, EACL.

[96]  Khurshid Ahmad,et al.  The head-modifier principle and multilingual term extraction , 2005, Natural Language Engineering.

[97]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[98]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[99]  Wolfgang Lezius,et al.  IMSLex – Representing Morphological and Syntactic Information in a Relational Database , 2000 .

[100]  Mats Rooth,et al.  Valence Induction with a Head-Lexicalized PCFG , 1998, EMNLP.

[101]  Sabine Schulte im Walde A Subcategorisation Lexicon for German Verbs induced from a Lexicalised PCFG , 2002, LREC.

[102]  M. Baltin,et al.  The Mental representation of grammatical relations , 1985 .

[103]  Charles J. Fillmore,et al.  The Structure of the Framenet Database , 2003 .

[104]  Sanda M. Harabagiu,et al.  Using Predicate-Argument Structures for Information Extraction , 2003, ACL.

[105]  Rainer Osswald,et al.  Die Verwendung von GermaNet zur Pflege und Erweiterung des Computerlexikons HaGenLex , 2004, LDV Forum.

[106]  Frank Richter,et al.  Cranberry Words in Formal Grammar , 2002 .

[107]  Christopher D. Manning Automatic Acquisition of a Large Sub Categorization Dictionary From Corpora , 1993, ACL.

[108]  David Heath,et al.  A valency dictionary of English: a corpus-based analysis of the complementation patterns of English verbs, nouns and adjectives , 2004 .

[109]  Helmut Schumacher,et al.  VALBU : Valenzwörterbuch deutscher Verben , 2004 .

[110]  Ulrich Heid,et al.  Tools for Collocation Extraction: Preferences for Active vs. Passive , 2008, LREC.

[111]  P. Hanks,et al.  German Light Verb Constructions in Corpora and Dictionaries , 2006 .

[112]  Stephen Shiffer,et al.  Language-Created Language-Independent Entities , 1996 .

[113]  Laurie Bauer,et al.  English Word-Formation: Frontmatter , 1983 .

[114]  Paul Grebe,et al.  Duden Grammatik der deutschen Gegenwartssprache , 1973 .

[115]  Zygmunt Vetulani,et al.  Syntactic Lexicon of Polish Predicative Nouns , 2006, LREC.

[116]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .