Document Parsing: Towards Realistic Syntactic Analysis

In this work we take a view of syntactic analysis as processing ‘raw’, running text instead of idealised, pre-segmented inputs—a task we dub document parsing. We observe the state of the art in sentence boundary detection and tokenisation, and their effects on syntactic parsing (for English), observing that common evaluation metrics are ill-suited for the comparison of an ‘end-to-end’ syntactic analysis pipeline. To provide a more informative assessment of performance levels and error propagation throughout the full pipeline, we propose a unified evaluation framework and gauge document parsing accuracies for common processors and data sets.

[1]  Stephan Oepen,et al.  Sentence Boundary Detection: A Long Solved Problem? , 2012, COLING.

[2]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[3]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[4]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[5]  Eugene Charniak,et al.  Reranking and Self-Training for Parser Adaptation , 2006, ACL.

[6]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[7]  Yi Zhang,et al.  Cross-Domain Dependency Parsing Using a Deep Linguistic Grammar , 2009, ACL/IJCNLP.

[8]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[9]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[10]  Stephan Oepen,et al.  Tokenization: Returning to a Long Solved Problem — A Survey, Contrastive Experiment, Recommendations, and Toolkit — , 2012, ACL.

[11]  Murhaf Fares,et al.  Machine Learning for High-Quality Tokenization Replicating Variable Tokenization Schemes , 2013, CICLing.

[12]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[13]  Marti A. Hearst,et al.  Adaptive Multilingual Sentence Boundary Disambiguation , 1997, CL.

[14]  Evelina Andersson,et al.  Joint Evaluation of Morphological Segmentation and Syntactic Parsing , 2012, ACL.

[15]  Mary P. Harper,et al.  SParseval: Evaluation Metrics for Parsing Speech , 2006, LREC.