论文信息 - A Corpus-based Probabilistic Grammar with Only Two Non-terminals

A Corpus-based Probabilistic Grammar with Only Two Non-terminals

The availabil i ty of large, syntactically-bracketed corpora such as the Penn Tree Bank affords us the opportunity to automatically build or train broad-coverage grammars, and in particular t.o train probabilistic grammars. A number of recent parsing experiments have also indicated that. grammars whose production probabilities are dependent on the ,context can be more effective than context-free grammars in selecting a correct parse. To make maximal use of context, we have automatically constructed, from the Penn Tree Bank version 2, a grammar in which the symbols S and NP are the only real non terminals, and the other non-terminals or grammatical nodes are in effect embedded into the right-hand-sides of the S and NP rules. For example, one of the rnles extraded from the tree bank would be S -> NP VBX JJ CC VBX NP [1] ( where NP is a non-terminal and the other symbols are terminals part-of-speech tags of the Tr-ee Bank) . Tbe most common structure in t.he Tree Bank a5sociat.ed with this expansion is (S ·NP (VP (VP VB.I (ADJ J J ) C C (VP VBX NP ) ) ) ) [2] . So i f our parser uses rule [l] j n parsing a sentence, i t. will generate structure [2] for the corresponding part of the sentence. l. sing 94% of the Penn Tree Bank for training, we extracted 32,296 distinct rules (2:3 ,386 for S, and � .910 for NP ) . We also built a smaller version of the grammar based ,on higher fequency patterns for use a5 a back-up when the larger grammar is unable to produce a parse due to memory limitation . We applied this parser to 1 ,989 Wall St1·eet Journal sentences (separate from the training set and with no lirrnt on sentence length) . Of the parsed sentences ( 1 ,899 ) , the percentage of no-crossing sentences is 33:9%, and Parseval recall and precision are 73.43% and 72 .61 %.

Ralph Grishman | Satoshi Sekine

[1] Ted Briscoe,et al. Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars , 1993, CL.

[2] John D. Lafferty,et al. Towards History-based Grammars: Using Richer Models for Probabilistic Parsing , 1993, ACL.

[3] David M. Magerman. Statistical Decision-Tree Models for Parsing , 1995, ACL.

[4] Ralph Grishman,et al. Statistical Parsing of Messages , 1990, HLT.

[5] Eric Brill,et al. Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach , 1993, ACL.

[6] Ralph Grishman,et al. Generalizing Automatically Generated Selectional Patterns , 1994, COLING.

[7] P MarcusMitchell,et al. Building a large annotated corpus of English , 1993 .

[8] Eric Brill,et al. Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach , 1993, ACL.

[9] Jeremy J. Carroll,et al. Linguistic Knowledge Generator , 1992, COLING.

[10] Robert F. Simmons,et al. The Acquisition and Application of Context Sensitive Grammar for English , 1991, ACL.

[11] David M. Magerman,et al. Efficiency, Robustness and Accuracy in Picky Chart Parsing , 1992, ACL.

[12] Rens Bod. Using an Annotated Corpus as a Stochastic Grammar , 1993, EACL.

[13] Roger Garside,et al. A Probabilistic Parser , 1985, EACL.

[14] Ralph Grishman,et al. A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.