Parsing engineering and empirical robustness

Robustness has been traditionally stressed as a general desirable property of any computational model and system. The human NL interpretation device exhibits this property as the ability to deal with odd sentences. However, the difficulties in a theoretical explanation of robustness within the linguistic modelling suggested the adoption of an empirical notion. In this paper, we propose an empirical definition of robustness based on the notion of performance. Furthermore, a framework for controlling the parser robustness in the design phase is presented. The control is achieved via the adoption of two principles: the modularisation, typical of the software engineering practice, and the availability of domain adaptable components. The methodology has been adopted for the production of CHAOS, a pool of syntactic modules, which has been used in real applications. This pool of modules enables a large validation of the notion of empirical robustness, on the one side, and of the design methodology, on the other side, over different corpora and two different languages (English and Italian).

[1]  Wolfgang Menzel,et al.  Robust Processing of Natural Language , 1995, KI.

[2]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[3]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[4]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Prepositional Phrase Attachment , 1994, HLT.

[5]  Michael R. Brent,et al.  From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[6]  Roberto Basili,et al.  Efficient Parsing for Information Extraction , 1998, ECAI.

[7]  Ted Briscoe,et al.  Can Subcategorisation Probabilities Help a Statistical Parser , 1998, VLC@COLING/ACL.

[8]  Jean-Pierre Chanod,et al.  Incremental Finite-State Parsing , 1997, ANLP.

[9]  Steven Abney,et al.  Part-of-Speech Tagging and Partial Parsing , 1997 .

[10]  Roberto Basili,et al.  Lexicalizing a shallow parser , 1999 .

[11]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[12]  Maria Teresa Pazienza,et al.  Information Extraction A Multidisciplinary Approach to an Emerging Information Technology , 1997, Lecture Notes in Computer Science.

[13]  Christian R. Huyck,et al.  A scheme for comparative evaluation of diverse parsing systems , 1998 .

[14]  Rodolfo Delmonte Linguistic and Inferential Processes in Text Analysis by Computer , 1992 .

[15]  Roberto Basili,et al.  A Shallow Syntactic Analyser to Extract Word Associations from Corpora , 1992 .

[16]  Yorick Wilks,et al.  Multilingual Authoring: the NAMIC Approach , 2001, HTLKM@ACL.

[17]  John D. Lafferty,et al.  A Robust Parsing Algorithm for Link Grammars , 1995, IWPT.

[18]  Roberto Basili,et al.  Corpus-Driven Unsupervised Learning of Verb Subcategorization Frames , 1997, AI*IA.

[19]  Roberto Basili,et al.  Customizable Modular Lexicalized Parsing , 2000, IWPT.

[20]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[21]  Douglas E. Appelt,et al.  FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997, ArXiv.

[22]  C. J. Rupp,et al.  Towards Robust Understanding of Speech by Combination of Partial Analyses , 1998, ECAI.

[23]  Eric Brill,et al.  A Rule-Based Approach to Prepositional Phrase Attachment Disambiguation , 1994, COLING.