论文信息 - COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval - 字舞流文

COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval

We present a large challenging dataset, COUGH, for COVID-19 FAQ retrieval. Specifically, similar to a standard FAQ dataset, COUGH consists of three parts: FAQ Bank, User Query Bank and Annotated Relevance Set. FAQ Bank contains ~16K FAQ items scraped from 55 credible websites (e.g., CDC and WHO). For evaluation, we introduce User Query Bank and Annotated Relevance Set, where the former contains 1201 human-paraphrased queries while the latter contains ~32 human-annotated FAQ items for each query. We analyze COUGH by testing different FAQ retrieval models built on top of BM25 and BERT, among which the best model achieves 0.29 under P@5, indicating that the dataset presents a great challenge for future research. Our dataset is freely available at this https URL.

Heming Sun | Xiang Yue | Huan Sun | Xinliang Frederick Zhang | Emmett Jesrani | Simon Lin

[1] C. Lee Giles,et al. CODA-19: Using a Non-Expert Crowd to Annotate Research Aspects on 10,000+ Abstracts in the COVID-19 Open Research Dataset , 2020, NLPCOVID19.

[2] Kurohashi Sadao,et al. TSUBAKI: An Open Search Engine Infrastructure for Developing Information Access Methodology (特集:情報爆発時代におけるIT基盤技術) , 2011 .

[3] Kristina Lerman,et al. Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set , 2020, JMIR public health and surveillance.

[4] Kristian J. Hammond,et al. Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System , 1997, AI Mag..

[5] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[6] Jungyun Seo,et al. High-performance FAQ retrieval using an automatic clustering method of query logs , 2006, Inf. Process. Manag..

[7] Hugo Zaragoza,et al. The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[8] Jan Snajder,et al. FAQIR - A Frequently Asked Questions Retrieval Test Collection , 2016, TSD.

[9] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[10] Matthew Henderson,et al. Efficient Natural Language Response Suggestion for Smart Reply , 2017, ArXiv.

[11] Jan Snajder,et al. Paraphrase-focused learning to rank for domain-specific frequently asked questions retrieval , 2018, Expert Syst. Appl..

[12] João Sedoc,et al. An Analysis of BERT FAQ Retrieval Models for COVID-19 Infobot , 2020 .

[13] Sadao Kurohashi,et al. FAQ Retrieval using Query-Question Similarity and BERT-Based Query-Answer Relevance , 2019, SIGIR.

[14] Xian-Ling Mao,et al. Weibo-COV: A Large-Scale COVID-19 Tweets Dataset from Webio , 2020 .

[15] David Konopnicki,et al. Unsupervised FAQ Retrieval with Question Generation and BERT , 2020, ACL.

[16] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[18] Vitor R. Carvalho,et al. FAQ Retrieval Using Attentive Matching , 2019, SIGIR.

[19] Adam Poliak,et al. Collecting Verified COVID-19 Question Answer Pairs , 2020, NLP4COVID@EMNLP.

[20] Oren Etzioni,et al. CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.

[21] Xian-Ling Mao,et al. Weibo-COV: A Large-Scale COVID-19 Social Media Dataset from Weibo , 2020, NLP4COVID@EMNLP.

[22] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[23] Soroush Vosoughi,et al. What Are People Asking About COVID-19? A Question Classification Dataset , 2020, NLPCOVID19.