Evaluating natural language processing systems

A variety of factions may be interested in evaluating natural language processing (NLP) systems, ranging from funding authorities who must choose between competing research projects and justify their choices through the results subsequently obtained, to the end user who needs to choose between competing products. If the product is expensive, or if it implies a major reorganization of workflow, the user may also need to provide post hoc justification. Surprisingly, the literature on evaluation is relatively sparse for several reasons. First, evaluations are often carried out under consultancy arrangements for a particular customer. Not only is the evaluation then tailor-made to suit that customer and therefore not considered to be of general interest, but the customer may be reluctant to have the results of the evaluation made public, either because of an agreement with the manufacturer whose product has been evaluated, or an unwillingness to reveal the results to competitors. Secondly, evaluation of research proposals and of projects in the academic area has traditionally been by peer review. This has changed somewhat in particular areas, under the influence of the DARPA/ARPA series of evaluations, but is still the most common pattern. The only result of a peer review evaluation is a report which is often confidential. Thirdly, evaluation acquired a bad name as a result of the Automatic Language Processing Advisory Committee’s (ALPAC) Evaluating Natural Language Processing Systems Designing customized methods for testing various NLP systems may be costly and

[1]  Beth Sundheim,et al.  Overview of the Third Message Understanding Evaluation and Conference , 1991, MUC.

[2]  Matthias Jarke,et al.  A Field Evaluation of Natural Language for Data Retrieval , 1983, IEEE Transactions on Software Engineering.

[3]  Dennis H. Klatt,et al.  Review of the ARPA speech understanding project , 1990 .

[4]  Harry R. Tennant Experience with the Evaluation of Natural Language Question Answerers , 1979, IJCAI.

[5]  Jonathan Slocum Machine Translation: its History, Current Status, and Future Prospects , 1984, COLING.

[6]  John R. Pierce,et al.  Language and Machines: Computers in Translation and Linguistics , 1966 .

[7]  David Philip Frost,et al.  The design of a natural language interface for medical expert systems , 1989 .

[8]  Eva L. Baker,et al.  Artificial Intelligence Measurement System, Overview and Lessons Learned. Final Project Report. , 1991 .

[9]  Margaret King,et al.  Using Test Suites in Evaluation of Machine Translation Systems , 1990, COLING.

[10]  Ralph Grishman,et al.  Evaluating syntax performance of parser/grammars , 1991 .

[11]  Mark Liberman,et al.  Text on Tap: the ACL/DCI , 1989, HLT.

[12]  Frederick B. Thompson,et al.  English for the computer , 1899, AFIPS '66 (Fall).

[13]  Mary A. Flanagan,et al.  Error Classification for MT Evaluation , 1994, AMTA.

[14]  Andy Way,et al.  Declarative Evaluation of an MT system: Practical Experiences , 1991 .

[15]  Madeleine Bates,et al.  A Proposal for Incremental Dialogue Evaluation , 1991, HLT.

[16]  Jeannette G. Neal,et al.  An Evaluation Methodology for Natural Language Processing Systems , 1992 .

[17]  Leonard S. Rutman,et al.  Evaluation Research Methods: A Basic Guide , 1984 .

[18]  Jaime G. Carbonell,et al.  Discourse Pragmatics and Ellipsis Resolution in Task-Oriented Natural Language Interfaces , 1983, ACL.

[19]  Philip J. Hayes,et al.  Automatic Extraction of Facts from Press Releases to Generate News Stories , 1992, ANLP.

[20]  David S. Pallett Session 2: DARPA Resource Management and ATIS Benchmark Test Poster Session , 1991, HLT.

[21]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[22]  John S. White,et al.  The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches , 1994, AMTA.

[23]  Michael G. Dyer,et al.  Evaluating natural language systems: a sourcebook approach , 1988, COLING.

[24]  Madeleine Bates,et al.  A Practical Methodology for the Evaluation of Spoken Language Systems , 1992, ANLP.

[25]  E. House Evaluating with Validity , 1980 .

[26]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[27]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[28]  Jonathan G. Fiscus,et al.  DARPA February 1992 ATIS Benchmark Test Results , 1992, HLT.

[29]  Beth Sundheim,et al.  A Performance Evaluation of Text-Analysis Technologies , 1991, AI Mag..

[30]  Ii Gerald Francis Dejong Skimming stories in real time: an experiment in integrated understanding. , 1979 .

[31]  W. J. Hutchins Machine Translation: Past, Present, Future , 1986 .

[32]  Julia Galliers,et al.  Evaluating natural language processing systems , 1995 .

[33]  Karen Spärck Jones,et al.  Natural language interfaces to databases , 1990, The Knowledge Engineering Review.

[34]  Ralph Grishman,et al.  Evaluating Parsing Strategies Using Standardized Parse Files , 1992, ANLP.

[35]  D. W. Barron Machine Translation , 1968, Nature.

[36]  G. Guida,et al.  Evaluation of natural language processing systems: Issues and approaches , 1986, Proceedings of the IEEE.

[37]  David L. Waltz,et al.  An English language question answering system for a large relational database , 1978, CACM.

[38]  H. Alshawi,et al.  The Core Language Engine , 1994 .

[39]  Lynette Hirschman,et al.  Evaluating Message Understanding Systems: An Analysis of the Third Message Understanding Conference (MUC-3) , 1993, CL.

[40]  Marilyn A. Walker,et al.  Natural language in a desktop environment , 1989 .

[41]  Ralph Grishman,et al.  The Consortium for Lexical Research , 1991, HLT.

[42]  Mitchell P. Marcus Very Large Annotated Database of American English , 1990, HLT.

[43]  Lynette Hirschman,et al.  Multi-Site Data Collection for a Spoken Language Corpus , 1992, HLT.

[44]  Timothy W. Finin,et al.  Workshop on the Evaluation of Natural Language Processing Systems , 1990, Comput. Linguistics.

[45]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[46]  David Allport,et al.  The TIC: Parsing Interesting Text , 1988, ANLP.

[47]  Douglas E. Appelt,et al.  TEAM: An Experiment in the Design of Transportable Natural-Language Interfaces , 1987, Artif. Intell..