An Integrated Statistical Model for Tagging and Chunking Unrestricted Text

In this paper, we present a corpus-based approach for tagging and chunking. The formalism used is based on stochastic finite-state automata. Therefore, it can include n-grams models or any stochastic finite-state automata learnt using grammatical inference techniques. As the models involved in our system are learnt automatically, it allows for a very flexible and portable system for different languages and chunk definitions. In order to show the viability of our approach, we present results for tagging and chunking using different combinations of bigrams and other more complex automata learnt by means of the Error Correcting Grammatical Inference (ECGI) algorithm. The experimentation was carried out on the Wall Street Journal corpus for English and on the Lexesp corpus for Spanish.

[1]  Maria Domenica Di Benedetto The design and construction of digital speech processing systems to serve as an aid to the hard-to-hearing , 1982, Speech Commun..

[2]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[3]  Robert C. Berwick,et al.  Principle-Based Parsing , 1987 .

[4]  David M. Magerman,et al.  Learning grammatical stucture using statistical decision-trees , 1996, ICGI.

[5]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[6]  Didier Bourigault,et al.  Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases , 1992, COLING.

[7]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[8]  Jean-Pierre Chanod,et al.  Incremental Finite-State Parsing , 1997, ANLP.

[9]  Eva I. Ejerhed,et al.  Finding Clauses in Unrestricted Text by Finitary and Stochastic Methods , 1988, ANLP.

[10]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[11]  Lluís Padró,et al.  Developing a hybrid NP parser , 1997, ANLP.

[12]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[13]  Atro Voutilainen A syntax-based part-of-speech analyser , 1995, EACL.

[14]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[15]  Natividad Prieto,et al.  Using grammatical inference methos for automatic part-of-speech tagging , 1998 .

[16]  Shlomo Argamon,et al.  A Memory-Based Approach to Learning Shallow Natural Language Patterns , 1998, ACL.

[17]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[18]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[19]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[20]  Atro Voutilainen,et al.  NPtool, a Detector of English Noun Phrases , 1995, VLC@ACL.

[21]  Enrique Vidal,et al.  Learning language models through the ECGI method , 1991, Speech Commun..

[22]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.