PMC text mining subset in BioC: about three million full-text articles and growing

Summary Interest in text mining full text biomedical research articles is growing. To facilitate automated processing of nearly 3 million full-text articles (in PMC Open Access and Author Manuscript subsets) and to improve interoperability, we convert these articles to BioC, a community-driven simple data structure in either XML or JSON format for conveniently sharing text and annotations. Results The resultant articles can be downloaded via both ftp for bulk access and a Web API for updates or a more focused collection. Since the availability of the Web API in 2017, our BioC collection has been widely used by the research community. Availability https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/.

[1]  Xu Han,et al.  Extending the evaluation of Genia Event task toward knowledge base construction and comparison to Gene Regulation Ontology task , 2015, BMC Bioinformatics.

[2]  Adrian J Shepherd,et al.  Mining biological networks from full-text articles. , 2014, Methods in molecular biology.

[3]  Søren Brunak,et al.  A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts , 2018, PLoS Comput. Biol..

[4]  Yue Wang,et al.  PubAnnotation - a persistent and sharable corpus and annotation repository , 2012, BioNLP@HLT-NAACL.

[5]  Benjamin M. Gyori,et al.  From word models to executable models of signaling networks using automated assembly , 2017, bioRxiv.

[6]  Kara Dolinski,et al.  The BioC-BioGRID corpus: full text articles annotated for curation of protein–protein and genetic interactions , 2017, Database J. Biol. Databases Curation.

[7]  Wanli Liu,et al.  BioC implementations in Go, Perl, Python and Ruby , 2014, Database J. Biol. Databases Curation.

[8]  Jari Björne,et al.  Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization , 2013, PloS one.

[9]  Ioannis Xenarios,et al.  SourceData: a semantic platform for curating and searching figures , 2016, Nature Methods.

[10]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[11]  Senay Kafkas,et al.  Section level search functionality in Europe PMC , 2015, J. Biomed. Semant..

[12]  Zhiyong Lu,et al.  BC4GO: a full-text corpus for the BioCreative IV GO task , 2014, Database J. Biol. Databases Curation.

[13]  Cathy H. Wu,et al.  Construction of phosphorylation interaction networks by text mining of full-length articles using the eFIP system , 2015, Database J. Biol. Databases Curation.

[14]  Haibin Liu,et al.  Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus , 2014, Database J. Biol. Databases Curation.

[15]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[16]  Burkhard Rost,et al.  tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles , 2014, Database J. Biol. Databases Curation.