COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval

We present a large challenging dataset, COUGH, for COVID-19 FAQ retrieval. Specifically, similar to a standard FAQ dataset, COUGH consists of three parts: FAQ Bank, User Query Bank and Annotated Relevance Set. FAQ Bank contains ~16K FAQ items scraped from 55 credible websites (e.g., CDC and WHO). For evaluation, we introduce User Query Bank and Annotated Relevance Set, where the former contains 1201 human-paraphrased queries while the latter contains ~32 human-annotated FAQ items for each query. We analyze COUGH by testing different FAQ retrieval models built on top of BM25 and BERT, among which the best model achieves 0.29 under P@5, indicating that the dataset presents a great challenge for future research. Our dataset is freely available at this https URL.

[1]  C. Lee Giles,et al.  CODA-19: Using a Non-Expert Crowd to Annotate Research Aspects on 10,000+ Abstracts in the COVID-19 Open Research Dataset , 2020, NLPCOVID19.

[2]  Kurohashi Sadao,et al.  TSUBAKI: An Open Search Engine Infrastructure for Developing Information Access Methodology (特集:情報爆発時代におけるIT基盤技術) , 2011 .

[3]  Kristina Lerman,et al.  Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set , 2020, JMIR public health and surveillance.

[4]  Kristian J. Hammond,et al.  Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System , 1997, AI Mag..

[5]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[6]  Jungyun Seo,et al.  High-performance FAQ retrieval using an automatic clustering method of query logs , 2006, Inf. Process. Manag..

[7]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[8]  Jan Snajder,et al.  FAQIR - A Frequently Asked Questions Retrieval Test Collection , 2016, TSD.

[9]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[10]  Matthew Henderson,et al.  Efficient Natural Language Response Suggestion for Smart Reply , 2017, ArXiv.

[11]  Jan Snajder,et al.  Paraphrase-focused learning to rank for domain-specific frequently asked questions retrieval , 2018, Expert Syst. Appl..

[12]  João Sedoc,et al.  An Analysis of BERT FAQ Retrieval Models for COVID-19 Infobot , 2020 .

[13]  Sadao Kurohashi,et al.  FAQ Retrieval using Query-Question Similarity and BERT-Based Query-Answer Relevance , 2019, SIGIR.

[14]  Xian-Ling Mao,et al.  Weibo-COV: A Large-Scale COVID-19 Tweets Dataset from Webio , 2020 .

[15]  David Konopnicki,et al.  Unsupervised FAQ Retrieval with Question Generation and BERT , 2020, ACL.

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[18]  Vitor R. Carvalho,et al.  FAQ Retrieval Using Attentive Matching , 2019, SIGIR.

[19]  Adam Poliak,et al.  Collecting Verified COVID-19 Question Answer Pairs , 2020, NLP4COVID@EMNLP.

[20]  Oren Etzioni,et al.  CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.

[21]  Xian-Ling Mao,et al.  Weibo-COV: A Large-Scale COVID-19 Social Media Dataset from Weibo , 2020, NLP4COVID@EMNLP.

[22]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[23]  Soroush Vosoughi,et al.  What Are People Asking About COVID-19? A Question Classification Dataset , 2020, NLPCOVID19.