Using Foreign Inclusion Detection to Improve Parsing Performance

Inclusions from other languages can be a significant source of errors for monolingual parsers. We show this for English inclusions, which are sufficiently frequent to present a problem when parsing German. We describe an annotation-free approach for accurately detecting such inclusions, and develop two methods for interfacing this approach with a state-of-the-art parser for German. An evaluation on the TIGER corpus shows that our inclusion entity model achieves a performance gain of 4.3 points in F-score over a baseline of no inclusion detection, and even outperforms a parser with access to gold standard part-of-speech tags.

[1]  Amit Dubey,et al.  What to Do When Lexicalization Fails: Parsing German with Suffix Analysis and Smoothing , 2005, ACL.

[2]  Amit Dubey,et al.  Statistical parsing for German: modeling syntactic properties and annotation differences , 2005 .

[3]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[4]  Dekang Lin,et al.  A dependency-based method for evaluating broad-coverage parsers , 1995, Natural Language Engineering.

[5]  Beatrice Alex,et al.  An Unsupervised System for Identifying English Inclusions in German Text , 2005, ACL.

[6]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[7]  Beatrice Alex,et al.  Integrating Language Knowledge Resources to Extend the English Inclusion Classifier to a New Language , 2006 .

[8]  Ronald M. Kaplan,et al.  The importance of precise tokenizing for deep grammars , 2006, LREC.

[9]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[10]  Wojciech Skut,et al.  A Linguistically Interpreted Corpus of German Newspaper Text , 1998, LREC.

[11]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[12]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[13]  Dan Klein,et al.  Named Entity Recognition with Character-Level Models , 2003, CoNLL.

[14]  Claire Waast-Richard,et al.  A transformation-based learning approach to language identification for mixed-lingual text-to-speech synthesis , 2005, INTERSPEECH.