In Data Oriented Parsing (DOP), an annotated corpus is used as a stochastic grammar. An input string is parsed by combining subtrees from the corpus. As a consequence, one parse tree can usually be generated by several derivations that involve different subtrees. This leads to a statistics where the probability of a parse is equal to the sum of the probabilities of all its derivations. In (Scha, 1990) an informal introduction to DOP is given, while (Bod, 1992a) provides a formalization of the theory. In this paper we compare DOP with other stochastic grammars in the context of Formal Language Theory. It it proved that it is not possible to create for every DOP-model a strongly equivalent stochastic CFG which also assigns the same probabilities to the parses. We show that the maximum probability parse can be estimated in polynomial time by applying Monte Carlo techniques. The model was tested on a set of hand-parsed strings from the Air Travel Information System (ATIS) spoken language corpus. Preliminary experiments yield 96% test set parsing accuracy.
[1]
Rens Bod,et al.
A Computational Model of Language Performance: Data Oriented Parsing
,
1992,
COLING.
[2]
George R. Doddington,et al.
The ATIS Spoken Language Systems Pilot Corpus
,
1990,
HLT.
[3]
Yves Schabes,et al.
Stochastic Lexicalized Tree-adjoining Grammars
,
1992,
COLING.
[4]
Frederick Jelinek,et al.
Basic Methods of Probabilistic Context Free Grammars
,
1992
.
[5]
Fernando Pereira,et al.
Inside-Outside Reestimation From Partially Bracketed Corpora
,
1992,
HLT.
[6]
Philip Resnik,et al.
Probabilistic Tree-Adjoining Grammar as a Framework for Statistical Natural Language Processing
,
1992,
COLING.
[7]
Mitchell P. Marcus.
Very Large Annotated Database of American English
,
1990,
HLT.