A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

[1]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[2]  Christian Blaschke,et al.  Text Mining for Metabolic Pathways, Signaling Cascades, and Protein Networks , 2005, Science's STKE.

[3]  Hong Yu,et al.  Learning for Biomedical Information Extraction: Methodological Review of Recent Advances , 2016, ArXiv.

[4]  Damian Szklarczyk,et al.  STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data , 2015, Nucleic Acids Res..

[5]  Christopher Ré,et al.  Large-scale extraction of gene interactions from full-text literature using DeepDive , 2015, Bioinform..

[6]  Minoru Kanehisa,et al.  KEGG as a reference resource for gene and protein annotation , 2015, Nucleic Acids Res..

[7]  Andrei Voronkov,et al.  PDFX: fully-automated PDF-to-XML conversion of scientific literature , 2013, ACM Symposium on Document Engineering.

[8]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[9]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[10]  D. Rebholz-Schuhmann,et al.  Text-mining solutions for biomedical research: enabling integrative biology , 2012, Nature Reviews Genetics.

[11]  Ana Azevedo Integration of Data Mining in Business Intelligence Systems , 2014 .

[12]  Daniel P. Lopresti Optical character recognition errors and their effects on natural language processing , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[13]  Catherine Blake,et al.  Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles , 2010, J. Biomed. Informatics.

[14]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[15]  Jan Gorodkin,et al.  Protein-driven inference of miRNA–disease associations , 2013, Bioinform..

[16]  Janos X. Binder,et al.  DISEASES: Text mining and data integration of disease–gene associations , 2014, bioRxiv.

[17]  Jason H. Moore,et al.  Chapter 11: Genome-Wide Association Studies , 2012, PLoS Comput. Biol..

[18]  Fei Wang,et al.  Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec , 2017, BMC Medical Informatics and Decision Making.

[19]  Grant Lewison,et al.  Trends in the global funding and activity of cancer research , 2008, Molecular oncology.

[20]  Sophia Ananiadou,et al.  Event-based text mining for biology and functional genomics , 2014, Briefings in functional genomics.

[21]  M. Worboys,et al.  Text Mining the History of Medicine , 2016, PloS one.

[22]  Davide Heller,et al.  STRING v10: protein–protein interaction networks, integrated over the tree of life , 2014, Nucleic Acids Res..

[23]  Zhiyong Lu,et al.  Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health. , 2016, Advances in experimental medicine and biology.

[24]  Russ B. Altman,et al.  Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text , 2009, BMC Bioinformatics.

[25]  Minoru Kanehisa,et al.  KEGG: new perspectives on genomes, pathways, diseases and drugs , 2016, Nucleic Acids Res..

[26]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[27]  Markus Bundschus,et al.  Text mining patents for biomedical knowledge. , 2016, Drug discovery today.

[28]  W. Alkema,et al.  Application of text mining in the biomedical domain. , 2015, Methods.

[29]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[30]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[31]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.

[32]  Peter Szolovits,et al.  Text Mining in Cancer Gene and Pathway Prioritization , 2014, Cancer informatics.

[33]  Zhiyong Lu,et al.  Text mining tools for assisting literature curation , 2014, BCB.

[34]  Georgios A. Pavlopoulos,et al.  Protein-protein interaction predictions using text mining methods. , 2015, Methods.

[35]  Eric G. Bremer,et al.  Analysis of Protein/Protein Interactions Through Biomedical Literature: Text Mining of Abstracts vs. Text Mining of Full Text Articles , 2004, KELSI.

[36]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[37]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[38]  Christian Stolte,et al.  Comprehensive comparison of large-scale tissue expression datasets , 2015, bioRxiv.

[39]  Antje Chang,et al.  The BRENDA Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources , 2010, Nucleic Acids Res..

[40]  Martin Hofmann-Apitius,et al.  Text mining for systems biology. , 2014, Drug discovery today.

[41]  Pontus Plavén-Sigray,et al.  The readability of scientific texts is decreasing over time , 2017, bioRxiv.

[42]  Christian Stolte,et al.  COMPARTMENTS: unification and visualization of protein subcellular localization evidence , 2014, Database J. Biol. Databases Curation.

[43]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[44]  Min-Yen Kan,et al.  Logical Structure Recovery in Scholarly Articles with Rich Document Features , 2010, Int. J. Digit. Libr. Syst..

[45]  William B. Langdon,et al.  BioRAT: extracting biological information from full-length papers , 2004, Bioinform..

[46]  Michael Schroeder,et al.  Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? , 2008, Briefings Bioinform..

[47]  Jonathan Adams Collaborations: The rise of research networks , 2012, Nature.

[48]  Eduard H. Hovy,et al.  Layout-aware text extraction from full-text PDF of scientific articles , 2012, Source Code for Biology and Medicine.

[49]  Casey S. Greene,et al.  Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery , 2015, Briefings Bioinform..

[50]  Xiaohui Yuan,et al.  Mining online full-text literature for novel protein interaction discovery , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[51]  Janan T Eppig,et al.  The mammalian phenotype ontology: enabling robust annotation and comparative analysis , 2009, Wiley interdisciplinary reviews. Systems biology and medicine.

[52]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.