COVIDSeer : Filling missing pieces in the CORD-19 dataset

We develop an enhanced version of CORD19 dataset released by the Allen Institute for AI. Our tools in our SeerSuite project are used to exploit information in original articles but not directly provided in the CORD-19 datasets. We add 728 new abstracts, 70,102 figures and 31,446 tables with captions that are not provided in the current data release. We also built a vertical search engine COVIDSeer based on the new dataset we created. COVIDSeer is a relatively simple architecture with features like keyword filtering, and similar paper recommendation. The goal was to provide a system and dataset that can help scientists better navigate through the literature around COVID19. The enriched dataset serves as a supplement to the existing dataset. The search engine, which offers keyphrase-enhanced search, will hopefully help biomedical and life science researchers, medical students, and the general public to explore coronavirus-related literature more effectively. The entire data set and system will be made open source.

[1]  Cornelia Caragea,et al.  Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach , 2014, EMNLP.

[2]  Cornelia Caragea,et al.  PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search , 2015, K-CAP.

[3]  Eric Horvitz,et al.  SciSight: Combining faceted navigation and research group detection for COVID-19 exploratory scientific search , 2020, EMNLP.

[4]  Madian Khabsa,et al.  SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web , 2010, WebApps.

[5]  Oren Etzioni,et al.  CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.

[6]  Jimmy J. Lin Is searching full text more effective than searching abstracts? , 2009, BMC Bioinformatics.

[7]  Yi Zhang,et al.  Personalized interactive faceted search , 2008, WWW.

[8]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[9]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[10]  Chandra Bhagavatula,et al.  Content-Based Citation Recommendation , 2018, NAACL.

[11]  Hung-Hsuan Chen,et al.  Compiling Keyphrase Candidates for Scientific Literature Based on Wikipedia , 2017, TDDL/MDQual/Futurity@TPDL.

[12]  Jimmy J. Lin,et al.  Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset , 2020, NLPCOVID19.

[13]  Jian Wu,et al.  CiteSeerX: 20 years of service to scholarly big data , 2019, AIDR.

[14]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[15]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[16]  David Ardia,et al.  COVID-19 Data Hub , 2020, J. Open Source Softw..