Can questions summarize a corpus? Using question generation for characterizing COVID-19 research

What are the latent questions on some textual data? In this work, we investigate using question generation models for exploring a collection of documents. Our method, dubbed corpus2question, consists of applying a pre-trained question generation model over a corpus and aggregating the resulting questions by frequency and time. This technique is an alternative to methods such as topic modelling and word cloud for summarizing large amounts of textual data. Results show that applying corpus2question on a corpus of scientific articles related to COVID-19 yields relevant questions about the topic. The most frequent questions are "what is covid 19" and "what is the treatment for covid". Among the 1000 most frequent questions are "what is the threshold for herd immunity" and "what is the role of ace2 in viral entry". We show that the proposed method generated similar questions for 13 of the 27 expert-made questions from the CovidQA question answering dataset. The code to reproduce our experiments and the generated questions are available at: this https URL

[1]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[2]  Jimmy J. Lin,et al.  Rapidly Bootstrapping a Question Answering Dataset for COVID-19 , 2020, ArXiv.

[3]  Dhanya Pramod,et al.  Document clustering: TF-IDF approach , 2016, 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT).

[4]  Pilsung Kang,et al.  Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec , 2019, Inf. Sci..

[5]  David M. Mimno,et al.  Care and Feeding of Topic Models , 2014, Handbook of Mixed Membership Models and Their Applications.

[6]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[7]  Timothy Baldwin,et al.  COVID-SEE: Scientific Evidence Explorer for COVID-19 Related Research , 2020, ArXiv.

[8]  Christopher E. Moody,et al.  Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec , 2016, ArXiv.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Charibeth Cheng,et al.  Transformer-based End-to-End Question Generation , 2020, ArXiv.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[13]  Artit Wangperawong,et al.  Question Generation by Transformers , 2019, ArXiv.

[14]  D. Mimno,et al.  Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements , 2014 .

[15]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[16]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[17]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[18]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[19]  Ming Liu,et al.  G-Asks: An Intelligent Automatic Question Generation System for Academic Writing Support , 2012, Dialogue Discourse.

[20]  M. Narasimha Murty,et al.  On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations , 2010, PAKDD.

[21]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[22]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[23]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[24]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[25]  Oren Etzioni,et al.  CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.

[26]  Tassilo Klein,et al.  Learning to Answer by Learning to Ask: Getting the Best of GPT-2 and BERT Worlds , 2019, ArXiv.

[27]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[28]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[29]  Xinya Du,et al.  Learning to Ask: Neural Question Generation for Reading Comprehension , 2017, ACL.

[30]  Furu Wei,et al.  Context preserving dynamic word cloud visualization , 2010, 2010 IEEE Pacific Visualization Symposium (PacificVis).

[31]  Bijan Parsia,et al.  A Systematic Review of Automatic Question Generation for Educational Purposes , 2019, International Journal of Artificial Intelligence in Education.

[32]  Michael Collins,et al.  Synthetic QA Corpora Generation with Roundtrip Consistency , 2019, ACL.

[33]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[34]  Philip Resnik,et al.  Developing a Curated Topic Model for COVID-19 Medical Research Literature , 2020, NLP4COVID@EMNLP.