A Corpus-based Probabilistic Grammar with Only Two Non-terminals

The availabil i ty of large, syntactically-bracketed corpora such as the Penn Tree Bank affords us the opportunity to automatically build or train broad-coverage grammars, and in particular t.o train probabilistic grammars. A number of recent parsing experiments have also indicated that. grammars whose production probabilities are dependent on the ,context can be more effective than context-free grammars in selecting a correct parse. To make maximal use of context, we have automatically constructed, from the Penn Tree Bank version 2, a grammar in which the symbols S and NP are the only real non­ terminals, and the other non-terminals or grammatical nodes are in effect embedded into the right-hand-sides of the S and NP rules. For example, one of the rnles extraded from the tree bank would be S -> NP VBX JJ CC VBX NP [1] ( where NP is a non-terminal and the other symbols are terminals part-of-speech tags of the Tr-ee Bank) . Tbe most common structure in t.he Tree Bank a5sociat.ed with this expansion is (S ·NP (VP (VP VB.I (ADJ J J ) C C (VP VBX NP ) ) ) ) [2] . So i f our parser uses rule [l] j n parsing a sentence, i t. will generate structure [2] for the corresponding part of the sentence. l. sing 94% of the Penn Tree Bank for training, we extracted 32,296 distinct rules (2:3 ,386 for S, and � .910 for NP ) . We also built a smaller version of the grammar based ,on higher fequency patterns for use a5 a back-up when the larger grammar is unable to produce a parse due to memory limitation . We applied this parser to 1 ,989 Wall St1·eet Journal sentences (separate from the training set and with no lirrnt on sentence length) . Of the parsed sentences ( 1 ,899 ) , the percentage of no-crossing sentences is 33:9%, and Parseval recall and precision are 73.43% and 72 .61 %.