Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids

BackgroundComputational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts obtained from PubMed to discover putative biomarkers of breast and lung cancers in specific biofluids.MethodologyA positive set of abstracts was defined by the terms ‘breast cancer’ and ‘lung cancer’ in conjunction with 14 separate ‘biofluids’ (bile, blood, breastmilk, cerebrospinal fluid, mucus, plasma, saliva, semen, serum, synovial fluid, stool, sweat, tears, and urine), while a negative set of abstracts was defined by the terms ‘(biofluid) NOT breast cancer’ or ‘(biofluid) NOT lung cancer.’ More than 5.3 million total abstracts were obtained from PubMed and examined for biomarker-disease-biofluid associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). Biological entities such as genes and proteins were tagged using ABNER, and processed using Python scripts to produce a list of putative biomarkers. Z-scores were calculated, ranked, and used to determine significance of putative biomarkers found. Manual verification of relevant abstracts was performed to assess our method’s performance.ResultsBiofluid-specific markers were identified from the literature, assigned relevance scores based on frequency of occurrence, and validated using known biomarker lists and/or databases for lung and breast cancer [NCBI’s On-line Mendelian Inheritance in Man (OMIM), Cancer Gene annotation server for cancer genomics (CAGE), NCBI’s Genes & Disease, NCI’s Early Detection Research Network (EDRN), and others]. The specificity of each marker for a given biofluid was calculated, and the performance of our semi-automated literature mining method assessed for breast and lung cancer.ConclusionsWe developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancer. New knowledge is presented in the form of biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids.

[1]  T. Veenstra,et al.  Proteomics approaches to biomarker detection. , 2005, Briefings in functional genomics & proteomics.

[2]  Jeffrey B. Colombe,et al.  Finding relevant references to genes and proteins in Medline using a Bayesian approach , 2002, Bioinform..

[3]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[4]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[5]  Joyce A. Mitchell,et al.  Using literature-based discovery to identify disease candidate genes , 2005, Int. J. Medical Informatics.

[6]  Sergei Egorov,et al.  MedScan, a natural language processing engine for MEDLINE abstracts , 2003, Bioinform..

[7]  D. Wong,et al.  Saliva: an emerging biofluid for early detection of diseases. , 2009, American journal of dentistry.

[8]  Aleix Prat Aparicio Comprehensive molecular portraits of human breast tumours , 2012 .

[9]  Jonathan D. Wren,et al.  Knowledge discovery by automated identification and ranking of implicit relationships , 2004, Bioinform..

[10]  D. Swanson Medical literature as a potential source of new knowledge. , 1990, Bulletin of the Medical Library Association.

[11]  Russ B. Altman,et al.  PharmGKB: the Pharmacogenetics Knowledge Base , 2002, Nucleic Acids Res..

[12]  Hisham Al-Mubaid,et al.  A New Text Mining Approach for Finding Protein-to-Disease Associations , 2005 .

[13]  Gil Alterovitz,et al.  System-Wide Peripheral Biomarker Discovery Using Information Theory , 2007, Pacific Symposium on Biocomputing.

[14]  Hiroshi Mamitsuka,et al.  Application of a New Probabilistic Model for Mining Implicit Associated Cancer Genes from OMIM and Medline , 2006, Cancer informatics.

[15]  Jacob de Vlieg,et al.  Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases , 2010, PLoS Comput. Biol..

[16]  Hui Li,et al.  Biomarker Identification Using Text Mining , 2012, Comput. Math. Methods Medicine.

[17]  Lada A. Adamic,et al.  A literature based method for identifying gene-disease connections , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[18]  G. Eibl,et al.  Systemic Disease-Induced Salivary Biomarker Profiles in Mouse Models of Melanoma and Non-Small Cell Lung Cancer , 2009, PloS one.

[19]  R. Jonsson,et al.  Biomarker profiles in serum and saliva of experimental Sjögren's syndrome: associations with specific autoimmune manifestations , 2008, Arthritis research & therapy.

[20]  P. Wagner,et al.  New paradigms in translational science research in cancer biomarkers. , 2012, Translational research : the journal of laboratory and clinical medicine.

[21]  Padmini Srinivasan,et al.  Text mining: Generating hypotheses from MEDLINE , 2004, J. Assoc. Inf. Sci. Technol..

[22]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[23]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[24]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[25]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[26]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[27]  T. Veenstra,et al.  Analysis of biofluids for biomarker research , 2008, Proteomics. Clinical applications.

[28]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[29]  Fan Meng,et al.  Medline search engine for finding genetic markers with biological significance , 2007, Bioinform..

[30]  Doheon Lee,et al.  CaGe: A Web-Based Cancer Gene Annotation System for Cancer Genomics , 2012, Genomics & informatics.

[31]  Martin N. Rossor,et al.  Advanced online publication. , 2005, Nature structural biology.

[32]  Vanathi Gopalakrishnan,et al.  A Multiplexed Serum Biomarker Immunoassay Panel Discriminates Clinical Lung Cancer Patients from High-Risk Individuals Found to be Cancer-Free by CT Screening , 2012, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[33]  Martin Hofmann-Apitius,et al.  Challenges and opportunities for oncology biomarker discovery. , 2013, Drug discovery today.

[34]  Martin Hofmann-Apitius,et al.  Mining biomarker information in biomedical literature , 2012, BMC Medical Informatics and Decision Making.

[35]  Brian L Hood,et al.  Biomarkers: Mining the Biofluid Proteome* , 2005, Molecular & Cellular Proteomics.

[36]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..