COVIDSeer: Extending the CORD-19 Dataset

We develop an enhanced version of CORD-19 dataset released by the Allen Institute for AI. Tools in the SeerSuite project are used to exploit information in original articles not directly provided in the CORD-19 datasets. We add 728 new abstracts, 70,102 figures and 31,446 tables with captions that are not provided in the current data release. We also built a vertical search engine COVIDSeer based on the new dataset we created. COVIDSeer has a relatively simple architecture with features like keyword filtering, and similar paper recommendation. The goal was to provide a system and dataset that can help scientists better navigate through the literature concerning COVID-19. The enriched dataset can serve as a supplement to the existing dataset. The search engine, which offers keyphrase-enhanced search, will hopefully help biomedical and life science researchers, medical students, and the general public to more effectively explore coronavirus-related literature. The entire data set and the system will be made open source.

[1]  David Ardia,et al.  COVID-19 Data Hub , 2020, J. Open Source Softw..

[2]  Jian Wu,et al.  CiteSeerX: 20 years of service to scholarly big data , 2019, AIDR.

[3]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[4]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[5]  Jimmy J. Lin Is searching full text more effective than searching abstracts? , 2009, BMC Bioinformatics.

[6]  Cornelia Caragea,et al.  Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach , 2014, EMNLP.

[7]  Madian Khabsa,et al.  SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web , 2010, WebApps.

[8]  Cornelia Caragea,et al.  PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search , 2015, K-CAP.

[9]  Chandra Bhagavatula,et al.  Content-Based Citation Recommendation , 2018, NAACL.

[10]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[11]  Hung-Hsuan Chen,et al.  Compiling Keyphrase Candidates for Scientific Literature Based on Wikipedia , 2017, TDDL/MDQual/Futurity@TPDL.

[12]  Jimmy J. Lin,et al.  Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset , 2020, NLPCOVID19.

[13]  Oren Etzioni,et al.  CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.

[14]  Yi Zhang,et al.  Personalized interactive faceted search , 2008, WWW.

[15]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.