论文信息 - COVIDSeer: Extending the CORD-19 Dataset

COVIDSeer: Extending the CORD-19 Dataset

We develop an enhanced version of CORD-19 dataset released by the Allen Institute for AI. Tools in the SeerSuite project are used to exploit information in original articles not directly provided in the CORD-19 datasets. We add 728 new abstracts, 70,102 figures and 31,446 tables with captions that are not provided in the current data release. We also built a vertical search engine COVIDSeer based on the new dataset we created. COVIDSeer has a relatively simple architecture with features like keyword filtering, and similar paper recommendation. The goal was to provide a system and dataset that can help scientists better navigate through the literature concerning COVID-19. The enriched dataset can serve as a supplement to the existing dataset. The search engine, which offers keyphrase-enhanced search, will hopefully help biomedical and life science researchers, medical students, and the general public to more effectively explore coronavirus-related literature. The entire data set and the system will be made open source.

[1] David Ardia,et al. COVID-19 Data Hub , 2020, J. Open Source Softw..

[2] Jian Wu,et al. CiteSeerX: 20 years of service to scholarly big data , 2019, AIDR.

[3] Iz Beltagy,et al. SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[4] Kyunghyun Cho,et al. Passage Re-ranking with BERT , 2019, ArXiv.

[5] Jimmy J. Lin. Is searching full text more effective than searching abstracts? , 2009, BMC Bioinformatics.

[6] Cornelia Caragea,et al. Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach , 2014, EMNLP.

[7] Madian Khabsa,et al. SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web , 2010, WebApps.

[8] Cornelia Caragea,et al. PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search , 2015, K-CAP.

[9] Chandra Bhagavatula,et al. Content-Based Citation Recommendation , 2018, NAACL.

[10] Christopher Andreas Clark,et al. Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[11] Hung-Hsuan Chen,et al. Compiling Keyphrase Candidates for Scientific Literature Based on Wikipedia , 2017, TDDL/MDQual/Futurity@TPDL.

[12] Jimmy J. Lin,et al. Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset , 2020, NLPCOVID19.

[13] Oren Etzioni,et al. CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.

[14] Yi Zhang,et al. Personalized interactive faceted search , 2008, WWW.

[15] Patrice Lopez,et al. GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.