A model for quantitative evaluation of an end-to-end question-answering system

interactive question-answering systems and illustrate it with application to the High-Quality Interactive QuestionAnswering (HITIQA) system. Our objectives were (a) to design a method to realistically and reliably assess interactive question-answering systems by comparing the quality of reports produced using different systems, (b) to conduct a pilot test of this method, and (c) to perform a formative evaluation of the HITIQA system. Far more important than the specific information gathered from this pilot evaluation is the development of (a) a protocol for evaluating an emerging technology, (b) reusable assessment instruments, and (c) the knowledge gained in conducting the evaluation. We conclude that this method, which uses a surprisingly small number of subjects and does not rely on predetermined relevance judgments, measures the impact of system change on work produced by users. Therefore this method can be used to compare the product of interactive systems that use different underlying technologies.

[1]  Inderjeet Mani,et al.  How to Evaluate Your Question Answering System Every Day ... and Still Get Real Work Done , 2000, LREC.

[2]  Paul B. Kantor,et al.  A study of information seeking and retrieving. II. Users, questions, and effectiveness , 1988, J. Am. Soc. Inf. Sci..

[3]  Stephen P. Harter,et al.  Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[4]  Stephen P. Harter,et al.  Evaluation of information retrieval systems : Approaches, issues, and methods , 1997 .

[5]  Pertti Vakkari,et al.  Task-based information searching , 2005, Annu. Rev. Inf. Sci. Technol..

[6]  Ellen M. Voorhees,et al.  The Evaluation of Question Answering Systems : Lessons Learned from the TREC QA Track , 2002 .

[7]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[8]  Nicholas J. Belkin,et al.  Characteristics of Texts Affecting Relevance Judgments , 1993 .

[9]  Pia Borlund,et al.  The concept of relevance in IR , 2003, J. Assoc. Inf. Sci. Technol..

[10]  Nina Wacholder,et al.  HITIQA: Towards Analytical Question Answering , 2004, COLING.

[11]  Amanda Spink,et al.  Multiple Search Sessions Model of End-User Behavior: An Exploratory Study , 1996, J. Am. Soc. Inf. Sci..

[12]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[13]  Jean Tague-Sutcliffe,et al.  Some Perspectives on the Evaluation of Information Retrieval Systems , 1996, J. Am. Soc. Inf. Sci..

[14]  Stephen E. Robertson,et al.  Evaluating Interactive Systems in TREC , 1996, J. Am. Soc. Inf. Sci..

[15]  Karen Spärck Jones Automatic language and information processing: rethinking evaluation , 2001, Natural Language Engineering.

[16]  Arne Jönsson,et al.  Wizard of Oz studies: why and how , 1993, IUI '93.

[17]  Tomek Strzalkowski,et al.  HITIQA: An Interactive Question Answering System: A Preliminary Report , 2003, ACL 2003.

[18]  Nina Wacholder,et al.  Cross evaluation - A pilot application of a new evaluation mechanism , 2004, ASIST.

[19]  Nina Wacholder,et al.  Using interview data to identify evaluation criteria for interactive, analytical question-answering systems , 2007, J. Assoc. Inf. Sci. Technol..

[20]  Paul B. Kantor,et al.  The Information Quest: A Dynamic Model of User's Information Needs. , 1999 .

[21]  Pertti Vakkari,et al.  Changes in relevance criteria and problem stages in task performance , 2000, J. Documentation.

[22]  Andrew Turpin,et al.  Why batch and user evaluations do not give the same results , 2001, SIGIR '01.

[23]  Karen Sparck Jones Is question answering a rational task , 2003 .

[24]  Nina Wacholder,et al.  HITIQA : A Question Answering Analytical Tool , .

[25]  Mark T. Maybury Toward a Question Answering Roadmap , 2003, New Directions in Question Answering.

[26]  Lynette Hirschman,et al.  Deep Read: A Reading Comprehension System , 1999, ACL.

[27]  Paul Over,et al.  The TREC interactive track: an annotated bibliography , 2001, Inf. Process. Manag..

[28]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[29]  Paul B. Kantor,et al.  Cross-Evaluation: A new model for information system evaluation , 2006, J. Assoc. Inf. Sci. Technol..

[30]  Peter Ingwersen,et al.  The development of a method for the evaluation of interactive information retrieval systems , 1997, J. Documentation.

[31]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[32]  Cyril W. Cleverdon,et al.  The significance of the Cranfield tests on index languages , 1991, SIGIR '91.

[33]  Lisa Krizan,et al.  Intelligence Essentials for Everyone , 1999 .

[34]  Ellen M. Voorhees,et al.  Implementing a Question Answering Evaluation , 2007 .

[35]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track , 2001, LREC.

[36]  Tefko Saracevic,et al.  RELEVANCE: A review of and a framework for the thinking on the notion in information science , 1997, J. Am. Soc. Inf. Sci..

[37]  Martin Chodorow,et al.  Automated Essay Scoring for Nonnative English Speakers , 1999 .

[38]  Andrew Turpin,et al.  Do batch and user evaluations give the same results? , 2000, SIGIR '00.

[39]  Ellen M. Voorhees,et al.  Evaluating the Evaluation: A Case Study Using the TREC 2002 Question Answering Track , 2003, NAACL.

[40]  Ellen M. Voorhees,et al.  Building a question answering test collection , 2000, SIGIR '00.

[41]  Paul B. Kantor,et al.  A study of information seeking and retrieving. III. Searchers, searches, and overlap , 1988, J. Am. Soc. Inf. Sci..

[42]  Paul B. Kantor,et al.  A study of information seeking and retrieving. I. background and methodology , 1988 .