Quantifying Human-Perceived Answer Utility in Non-factoid Question Answering

Taking a user-centric approach, we study the features that render an answer to a non-factoid question useful in the eyes of the person who asked that question. An editorial study, where participants assess the usefulness of the answers they received in response to their questions, as well as 12 different aspects associated with the answers, indicates considerable correlation between certain aspects such as relevance, correctness, and completeness with the user-perceived usefulness of answers. Moreover, we investigate the effectiveness of some commonly used answer quality measures, such as ROGUE, BLEU, METEOR, and BERTScore, demonstrating that these measures are limited in their ability to capture the aspects of usefulness and have room for improvement. The question answering dataset created in our work was made publicly available.

[1]  Jin Zhang,et al.  Multidimensional relevance modeling via psychometrics and crowdsourcing , 2014, SIGIR.

[2]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[3]  Tefko Saracevic,et al.  The Notion of Relevance in Information Science: Everybody knows what relevance is. But, what is it really? , 2016, The Notion of Relevance in Information Science.

[4]  Pnina Fichman,et al.  A comparative assessment of answer quality on four question answering sites , 2011, J. Inf. Sci..

[5]  W. Bruce Croft,et al.  Performance Prediction for Non-Factoid Question Answering , 2019, ICTIR.

[6]  Yunjie Calvin Xu,et al.  Relevance judgment: What do information users consider beyond topicality? , 2006, J. Assoc. Inf. Sci. Technol..

[7]  Philipp Koehn,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  Wenhan Xiong,et al.  TWEETQA: A Social Media Focused Question Answering Dataset , 2019, ACL.

[10]  Iryna Gurevych,et al.  A Multi-Dimensional Model for Assessing the Quality of Answers in Social Q&A Sites , 2009, ICIQ.

[11]  Chirag Shah,et al.  Building a parsimonious model for identifying best answers using interaction history in community Q&A , 2015, ASIST.

[12]  Sanghee Oh,et al.  Evaluating answer quality across knowledge domains: Using textual and non‐textual features in social Q&A , 2015, ASIST.

[13]  Andrew B. Whinston,et al.  Is Best Answer Really the Best Answer? The Politeness Bias , 2019, MIS Q..

[14]  W. Bruce Croft,et al.  Answer Interaction in Non-factoid Question Answering Systems , 2019, CHIIR.

[15]  Jeffrey Pomerantz,et al.  Evaluating and predicting answer quality in community QA , 2010, SIGIR.

[16]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[17]  Pia Borlund,et al.  The concept of relevance in IR , 2003, J. Assoc. Inf. Sci. Technol..

[18]  Feng Xu,et al.  Detecting high-quality posts in community question answering sites , 2015, Inf. Sci..

[19]  Jimmy J. Lin,et al.  What Makes a Good Answer? The Role of Context in Question Answering , 2003, INTERACT.

[20]  L. Given Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community , 2015 .

[21]  Berkant Barla Cambazoglu,et al.  Linguistic Benchmarks of Online News Article Quality , 2016, ACL.

[22]  Chirag Shah,et al.  Evaluating the quality of educational answers in community question-answering , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[23]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[24]  Anita Sarma,et al.  Perceptions of answer quality in an online technical question and answer forum , 2014, CHASE.

[25]  K. Haerling,et al.  Making Sense of Methods and Measurement: Spearman-Rho Ranked-Order Correlation Coefficient , 2014 .

[26]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[27]  Berkant Barla Cambazoglu,et al.  A large-scale sentiment analysis for Yahoo! answers , 2012, WSDM '12.

[28]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.