Automatic Extraction of Subcategorization Frames from Corpora -improving Filtering with Diathesis Alternations

Attempts to extract subcategorization information from textual corpora by shallow parsing followed by statistical ltering of alternatives proposed for speciic predicates have met with some success (Briscoe & Carroll, 1997) but are not yet accurate enough. Examination of the errors suggests that the ltering of spurious hypotheses is the source of most errors in the system. This paper builds on the framework described in (Briscoe and Carroll, 1997) and proposes a knowledge-based approach for improvement of the ltering phase of the system. 1 Background Manual development of large subcategorised lexicons has proved very diicult because predicates change behaviour between sublanguages, domains and across time. Yet current parsers depend crucially on such information, and probabilistic parsers would greatly beneet from accurate information concerning relative likelihood of diierent subcategorization frames of a given predicate. This suggests that automatic construction of subcategorization dictionaries from textual corpora is a more promising method to apply. Briscoe & Carroll (1997) propose a technique and implemented system for constructing a subcategorization dictionary from textual corpora. Their system is capable of distinguishing 160 subcategorization classes, and able to both assign classes to individual verbal predicates and to rank them according to relative frequency. As described in Briscoe & Carroll (1997), the system consists of six overall components which are applied in sequence to sentences containing a speciic predicate in order to retrieve a set of subcategoriza-tion classes for that predicate: a tagger, a lemmatizer, a probabilistic LR parser, a patternset extractor, a pattern classiier, and a patternsets evaluator. Even though the system has met with some success it is not yet accurate enough. The experimental evaluation performed by Briscoe & Carroll shows that the ltering of spurious hypotheses in the patternsets evaluator stage is the weak link of the system. 2 Filtering In the current lter, the set of putative classes are ltered, following Brent (1993) by hypothesis testing on binomial frequency data. The system rst records the total number of patternsets n for a given predicate, the number of these patternsets containing a pattern supporting an entry for given class m, and estimates of the probability that a pattern for a class i will occur with a verb which is not a member of subcategorization class i. Briscoe & Carroll estimate the above probability by rst extracting the number of verbs which are members of each class in the ANLT dictionary (Boguraev et al. 1987), and converting …

[1]  Ralph Grishman,et al.  Comlex Syntax: Building a Computational Lexicon , 1994, COLING.

[2]  Frederick B. Thompson,et al.  English for the computer , 1899, AFIPS '66 (Fall).

[3]  David Elworthy,et al.  Does Baum-Welch Re-estimation Help Taggers? , 1994, ANLP.

[4]  Y. Wilks,et al.  A General Architecture for Text Engineering (gate) { a New Approach to Language Engineering R&d a General Architecture for Text Engineering (gate) | a New Approach to Language Engineering R&d a E G T , 1995 .

[5]  C. Chapelle The Computational Analysis of English—A Corpus‐Based Approach , 1988 .

[6]  Michael R. Brent,et al.  From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[7]  John A. Carroll Relating Complexity to Practical Performance in Parsing With Wide-Coverage Unification Grammars , 1994, ACL.

[8]  Ted Briscoe,et al.  The Derivation of a Grammatically Indexed Lexicon from the Longman Dictionary of Contemporary English , 1987, ACL.

[9]  Ted Briscoe,et al.  Apportioning Development Effort in a Probabilistic LR Parsing System Through Evaluation , 1996, EMNLP.

[10]  Yves Schabes,et al.  Stochastic Lexicalized Tree-adjoining Grammars , 1992, COLING.

[11]  D. Biber The computational analysis of English: A corpus-based approach: Roger Garside, Geoffrey Leech and Godfrey Sampson, eds., London: Longman, 1987. xii + p.£12.95. , 1991 .

[12]  Gregory P. Knowles,et al.  Manual of information to accompany the SEC corpus , 1988 .

[13]  Christopher D. Manning Automatic Acquisition of a Large Sub Categorization Dictionary From Corpora , 1993, ACL.

[14]  John A. Carroll Practical unification-based parsing of Natural Language , 1993 .