Evaluating the Coverage of LTAGs on Annotated Corpora

Abstract Lexicalized Tree Adjoining Grammars (LTAGs) have been appl ied to many NLP applications. Evaluating the coverage of s LTAG is important for both its developers and it s users. In this paper, we describe a method, which estimates a grammar’s coverage on annotated corpora by first automatically extracting a Treebank grammar from the corpus and then calculating the overlap between the two g rammars. We used the method to test the coverage of the XTAG grammar, which is a large-scale hand-crafted gra mm r for English, on the English Penn Treebank, and the result shows that the grammar can cover at least 97.2% of template tokens in the Treebank. This method has several advantages: first, the whole process is semi-aut om tic and requires little human effort; second, the coverage can be calculated at sentence level or more fine-gra ined levels, third, the method provides a set of new templates that can be added to the grammar to improve its cove rage. Fourth, there is no need to parse the corpus.