Evaluation of NLP Systems

• (YDOXDWLRQ LV LWVHOI D ILUVW FODVV UHVHDUFK DFWLYLW\ FUHDWLRQ RI HIIHFWLYH HYDOXDWLRQ PHWKRGV GULYHV PRUH UDSLG SURJUHVV DQG EHWWHU FRPPXQLFDWLRQ ZLWKLQ D UHVHDUFK FRPPXQLW\ (Hirschman, 1998:302f) • >%HIRUH@ WKHUH ZHUH QR FRPPRQ PHDVXUHV DQG QR VKDUHG GDWD $V D FRQVHTXHQFH V\VWHPV DQG DSSURDFKHV FRXOG QRW EH SUHFLVHO\ FRPSDUHG DQG UHVXOWV FRXOG QRW EH UHSOLFDWHG (Gaizauskas, 1998:249) Lack of evaluation history

[1]  Yuval Krymolowski Using the Distribution of Performance for Studying Statistical NLP Systems and Corpora , 2001, ACL 2001.

[2]  Marine Carpuat,et al.  Improving Statistical Machine Translation Using Word Sense Disambiguation , 2007, EMNLP.

[3]  Lluís Màrquez i Villodre,et al.  A Comparison between Supervised Learning Algorithms for Word Sense Disambiguation , 2000, CoNLL/LLL.

[4]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[5]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications , 2007 .

[6]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[7]  Donna K. Harman,et al.  The TREC Test Collections , 2005 .

[8]  Philip Resnik,et al.  A Perspective on Word Sense Disambiguation Methods and Their Evaluation , 2002 .

[9]  Jimmy J. Lin,et al.  What Makes a Good Answer? The Role of Context in Question Answering , 2003, INTERACT.

[10]  Jimmy J. Lin,et al.  Selectively Using Relations to Improve Precision in Question Answering , 2003 .

[11]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[12]  Janyce Wiebe,et al.  Word-Sense Disambiguation Using Decomposable Models , 1994, ACL.

[13]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[14]  Yorick Wilks,et al.  A Preferential, Pattern-Seeking, Semantics for Natural Language Inference , 1975, Artif. Intell..

[15]  S S Stevens,et al.  On the Theory of Scales of Measurement. , 1946, Science.

[16]  Leonard R. Sussman,et al.  Nominal, Ordinal, Interval, and Ratio Typologies are Misleading , 1993 .

[17]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[18]  David Yarowsky,et al.  Statistical Machine Translation: Final Report , 1999 .

[19]  Adam Kilgarriff,et al.  SENSEVAL: an exercise in evaluating world sense disambiguation programs , 1998, LREC.

[20]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[21]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[22]  吴道平 Everything That Linguists Have Always Wanted to Know About Logic But Were Ashamed to Ask , 1985 .

[23]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[24]  Daniel Gildea,et al.  Automatic Labeling of Semantic Roles , 2000, ACL.

[25]  Alexander M. Fraser,et al.  A Smorgasbord of Features for Statistical Machine Translation , 2004, NAACL.

[26]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[27]  Hwee Tou Ng,et al.  Getting Serious about Word Sense Disambiguation , 2002 .

[28]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[29]  David Yarowsky,et al.  Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs , 1992, ACL.

[30]  Margaret King,et al.  Evaluating natural language processing systems , 1996, CACM.

[31]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[32]  Daniel Jurafsky,et al.  Automatic Labeling of Semantic Roles , 2002, CL.

[33]  ProgramsAdam Kilgarri Itri SENSEVAL : An Exercise in Evaluating Word SenseDisambiguation , 1998 .

[34]  Louise Guthrie,et al.  Lexical Disambiguation using Simulated Annealing , 1992, HLT.

[35]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[36]  Julie Weeds,et al.  Finding Predominant Word Senses in Untagged Text , 2004, ACL.

[37]  Sanjeev Khudanpur,et al.  Language model adaptation using cross-lingual information , 2003, INTERSPEECH.

[38]  Karen Sparck Jones,et al.  Book Reviews: Evaluating Natural Language Processing Systems: An Analysis and Review , 1996, CL.

[39]  Christiane Fellbaum,et al.  Making fine-grained and coarse-grained sense distinctions, both manually and automatically , 2006, Natural Language Engineering.

[40]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[41]  Ellen M. Voorhees,et al.  Corpus-Based Statistical Sense Resolution , 1993, HLT.

[42]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[43]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[44]  Richard M. Schwartz,et al.  A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate? , 2005, IEEvaluation@ACL.

[45]  ResnikPhilip,et al.  Distinguishing systems and distinguishing senses: new evaluation methods for Word Sense Disambiguation , 1999 .

[46]  Yehoshua Bar-Hillel,et al.  The Present Status of Automatic Translation of Languages , 1960, Adv. Comput..

[47]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.