论文信息 - CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization - 字舞流文

CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization

The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines. As of May 2020, 128,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset Challenge. Here we present CO-Search, a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers during a time of crisis. The retriever is built from a Siamese-BERT encoder that is linearly composed with a TF-IDF vectorizer, and reciprocal-rank fused with a BM25 vectorizer. The ranker is composed of a multi-hop question-answering module, that together with a multi-paragraph abstractive summarizer adjust retriever scores. To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations, creating 1.3 million (citation title, paragraph) tuples for training the encoder. We evaluate our system on the data of the TREC-COVID information retrieval challenge. CO-Search obtains top performance on the datasets of the first and second rounds, across several key metrics: normalized discounted cumulative gain, precision, mean average precision, and binary preference.

Dragomir R. Radev | Wenpeng Yin | Richard Socher | Dragomir Radev | Kazuma Hashimoto | Romain Paulus | Andre Esteva | Anuprit Kale | R. Socher | Romain Paulus | Wenpeng Yin | Andre Esteva | Kazuma Hashimoto | Anuprit Kale | Dragomir Radev | A. Esteva

[1] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[2] Charles L. A. Clarke,et al. Reciprocal rank fusion outperforms condorcet and individual rank learning methods , 2009, SIGIR.

[3] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[5] Jianfeng Gao,et al. A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[6] J. Shane Culpepper,et al. The effect of pooling and evaluation depth on IR metrics , 2016, Information Retrieval Journal.

[7] Jimmy J. Lin,et al. Anserini: Enabling the Use of Lucene for Information Retrieval Research , 2017, SIGIR.

[8] Yoshua Bengio,et al. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[9] Omid Shahmirzadi,et al. Text Similarity in Vector Space Models: A Comparative Study , 2018, ArXiv.

[10] Angela Fan,et al. Controllable Abstractive Summarization , 2017, NMT@ACL.

[11] William W. Cohen,et al. PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[12] Xiaodong Liu,et al. Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[13] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[14] Iz Beltagy,et al. SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[15] Iryna Gurevych,et al. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[16] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[17] Rémi Louf,et al. Transformers : State-ofthe-art Natural Language Processing , 2019 .

[18] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19] Jimmy J. Lin,et al. Rapidly Bootstrapping a Question Answering Dataset for COVID-19 , 2020, ArXiv.

[20] Alexandra Luccioni,et al. Mapping the Landscape of Artificial Intelligence Applications against COVID-19 , 2020, J. Artif. Intell. Res..

[21] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[22] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[23] Arman Cohan,et al. SLEDGE: A Simple Yet Effective Baseline for Coronavirus Scientific Knowledge Search , 2020, ArXiv.

[24] Oren Etzioni,et al. CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.

[25] Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering , 2019, ICLR.

[26] Yan Xu,et al. CAiRE-COVID: A Question Answering and Multi-Document Summarization System for COVID-19 Research , 2020, ArXiv.

[27] Kirk Roberts,et al. TREC-COVID , 2020, SIGIR Forum.

[28] Jimmy J. Lin,et al. Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset , 2020, NLPCOVID19.

[29] Peter J. Liu,et al. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.