Blueprint for a High Performance NLP Infrastructure

Natural Language Processing (NLP) system developers face a number of new challenges. Interest is increasing for real-world systems that use NLP tools and techniques. The quantity of text now available for training and processing is increasing dramatically. Also, the range of languages and tasks being researched continues to grow rapidly. Thus it is an ideal time to consider the development of new experimental frameworks. We describe the requirements, initial design and exploratory implementation of a high performance NLP infrastructure.

[1]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[2]  Andrei Alexandrescu,et al.  Modern C++ design: generic programming and design patterns applied , 2001 .

[3]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[5]  H. Cunningham,et al.  Developing Language Processing Components with GATE , 2001 .

[6]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[7]  Marc Moens,et al.  LT TTT - A Flexible Tokenisation Tool , 2000, LREC.

[8]  Krzysztof Czarnecki,et al.  Generative programming - methods, tools and applications , 2000 .

[9]  Mehryar Mohri,et al.  A Rational Design for a Weighted Finite-State Transducer Library , 1997, Workshop on Implementing Automata.

[10]  Marc Moens,et al.  XML Tools And Architecture for Named Entity Recognition , 1999, Markup Lang..

[11]  Christine Doran,et al.  Dialogue Interaction with the DARPA Communicator Infrastructure: The Development of Useful Software , 2001, HLT.

[12]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[13]  Kadri Hacioglu,et al.  A distributed architecture for robust automatic speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[14]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[15]  Lynette Hirschman,et al.  Mixed-Initiative Development of Language Processing Systems , 1997, ANLP.

[16]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[17]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[18]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[19]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[20]  James R. Curran,et al.  Investigating GIS and Smoothing for Maximum Entropy Taggers , 2003, EACL.

[21]  James R. Curran,et al.  Scaling Context Space , 2002, ACL.

[22]  Shlomo Argamon,et al.  Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers , 1995 .

[23]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[24]  HAMISH CUNNINGHAM,et al.  Software architecture for language engineering , 2000 .

[25]  Yves Schabes,et al.  Deterministic Part-of-Speech Tagging with Finite-State Transducers , 1995, Comput. Linguistics.

[26]  Nancy Ide,et al.  The American National Corpus: More Than the Web Can Provide , 2002, LREC.

[27]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[28]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[29]  James R. Curran,et al.  Bootstrapping POS-taggers using unlabelled data , 2003, CoNLL.

[30]  Eric Brill,et al.  A corpus-based approach to language learning , 1993 .