论文信息 - Using an Annotated Language Corpus as a Virtual Stochastic Grammar

Using an Annotated Language Corpus as a Virtual Stochastic Grammar

In Data Oriented Parsing (DOP), an annotated language corpus is used as a virtual stochastic grammar. An input string is parsed by combining subtrees from the corpus. As a consequence, one parse tree can usually be generated by several derivations that involve different subtrees. This leads to a statistics where the probability of a parse is equal to the sum of the probabilities of all its derivations. In (Scha, 1990) an informal introduction to DOP is given, while (Bod, 1992) provides a formalization of the theory. In this paper we show that the maximum probability parse can be estimated in polynomial time by applying Monte Carlo techniques. The model was tested on a set of hand-parsed strings from the Air Travel Information System (ATIS) corpus. Preliminary experiments yield 96% test set parsing accuracy.

Rens Bod | R. Bod

[1] Mitchell P. Marcus. Very Large Annotated Database of American English , 1990, HLT.

[2] Philip Resnik,et al. Probabilistic Tree-Adjoining Grammar as a Framework for Statistical Natural Language Processing , 1992, COLING.

[3] J. Hammersley,et al. Monte Carlo Methods , 1965 .

[4] George R. Doddington,et al. The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[5] Yves Schabes,et al. Stochastic Lexicalized Tree-adjoining Grammars , 1992, COLING.

[6] Rens Bod,et al. A Computational Model of Language Performance: Data Oriented Parsing , 1992, COLING.

[7] Frederick Jelinek,et al. Basic Methods of Probabilistic Context Free Grammars , 1992 .