Biomarker Discovery with Text Mining and Literature Based Discovery

The huge numbers of biomedical publications provide us valuable data for research. However, how to get usable information from these integrated but unstructured biomedical is a difficult problem in front of us, which calls for biomedical text mining techniques aiming at extracting novel knowledge from scientific texts. In this chapter, we will introduce basis of text mining and examine some frequently used algorithms, tools, and data sets. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in systems biology and each phase of the workflow. Finally, we will discuss the text mining technology for research on biomarkers.

[1]  J. Manson,et al.  Diabetes, metformin, and breast cancer in postmenopausal women. , 2012, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[2]  Z. Chaudhry,et al.  Health related quality of life assessment in Pakistani paediatric cancer patients using PedsQLTM 4.0 generic core scale and PedsQL™ cancer module , 2012, Health and Quality of Life Outcomes.

[3]  Gopal R. Gopinath,et al.  Correction: Reactome: a knowledge base of biologic pathways and processes , 2009, Genome Biology.

[4]  Michael Hehenberger,et al.  Text-based knowledge discovery: search and mining of life-sciences documents. , 2002, Drug discovery today.

[5]  Zhiyong Lu,et al.  Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases , 2011 .

[6]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[7]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[8]  Byungkyu Brian Park,et al.  HPID: The Human Protein Interaction Database , 2004, Bioinform..

[9]  Georgios A. Pavlopoulos,et al.  Caipirini: using gene sets to rank literature , 2012, BioData Mining.

[10]  Bin Wang BRCA1 tumor suppressor network: focusing on its tail , 2012, Cell & Bioscience.

[11]  Bairong Shen,et al.  Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing , 2012, PloS one.

[12]  S. Hayasaka,et al.  A Network of Genes, Genetic Disorders, and Brain Areas , 2011, PloS one.

[13]  A. Valencia,et al.  A gene network for navigating the literature , 2004, Nature Genetics.

[14]  J. Ajani,et al.  Barrett's Esophagus: A Review of Biology and Therapeutic Approaches. , 2012, Gastrointestinal cancer research : GCR.

[15]  Vladimir A. Ivanisenko,et al.  Finding biomarkers in non-model species: literature mining of transcription factors involved in bovine embryo development , 2012, BioData Mining.

[16]  José Luís Oliveira,et al.  Concept-based query expansion for retrieving gene related publications from MEDLINE , 2010, BMC Bioinformatics.

[17]  Jitender Sareen,et al.  The Canadian Network for Mood and Anxiety Treatments (CANMAT) task force recommendations for the management of patients with mood disorders and comorbid substance use disorders. , 2012, Annals of clinical psychiatry : official journal of the American Academy of Clinical Psychiatrists.

[18]  Carlo A. Trugenberger,et al.  Discovery of novel biomarkers and phenotypes by semantic technologies , 2012, BMC Bioinformatics.

[19]  A. Jemal,et al.  Global Cancer Statistics , 2011 .

[20]  Wen-Lian Hsu,et al.  New Challenges for Biological Text-Mining in the Next Decade , 2010, Journal of Computer Science and Technology.

[21]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[22]  Ying He,et al.  Biological Entity Recognition with Conditional Random Fields , 2008, AMIA.

[23]  Jeffrey Xu Yu,et al.  Context-Sensitive Document Ranking , 2009, Journal of Computer Science and Technology.

[24]  Chi-Ren Shyu,et al.  Predicting Cancer Interaction Networks Using Text-Mining and Structure Understanding , 2006, AMIA.

[25]  Jonathan D. Wren,et al.  Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network , 2004, Bioinform..

[26]  Flora S. Tsai Text mining and visualisation of Protein-Protein Interactions , 2011, Int. J. Comput. Biol. Drug Des..

[27]  Francisco Azuaje,et al.  Bioinformatics as a driver, not a passenger, of translational biomedical research: Perspectives from the 6th Benelux bioinformatics conference , 2012, Journal of Clinical Bioinformatics.

[28]  U. Urzúa,et al.  Tumor and reproductive traits are linked by RNA metabolism genes in the mouse ovary: a transcriptome-phenotype association analysis , 2010, BMC Genomics.

[29]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[30]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[31]  Pierre Zweigenbaum,et al.  Automatic extraction of semantic relations between medical entities: a rule based approach , 2011, J. Biomed. Semant..

[32]  Jyrki Lötjönen,et al.  Design and Application of a Generic Clinical Decision Support System for Multiscale Data , 2012, IEEE Transactions on Biomedical Engineering.

[33]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[34]  Laura Inés Furlong,et al.  Assessment of NER solutions against the first and second CALBC Silver Standard Corpus , 2011, Semantic Mining in Biomedicine.

[35]  A. Liekens,et al.  BioGraph: unsupervised biomedical knowledge discovery via automated hypothesis generation , 2011, Genome Biology.

[36]  M. Tripathi,et al.  Epilepsy Surgery in a Pediatric Population: A Retrospective Study of 129 Children from a Tertiary Care Hospital in a Developing Country along with Assessment of Quality of Life , 2011, Pediatric Neurosurgery.

[37]  T. Morita,et al.  Primary concerns of advanced cancer patients identified through the structured life review process: A qualitative study using a text mining technique , 2007, Palliative and Supportive Care.

[38]  Eugene Agichtein,et al.  Combining Text Mining and Sequence Analysis to Discover Protein Functional Regions , 2003, Pacific Symposium on Biocomputing.

[39]  Barry Smith,et al.  Saliva Ontology: An ontology-based framework for a Salivaomics Knowledge Base , 2010, BMC Bioinformatics.

[40]  A. Korhonen,et al.  Text Mining for Literature Review and Knowledge Discovery in Cancer Risk Assessment and Research , 2012, PloS one.

[41]  Ulf Leser,et al.  What makes a gene name? Named entity recognition in the biomedical literature , 2005, Briefings Bioinform..

[42]  Sergei Egorov,et al.  MedScan, a natural language processing engine for MEDLINE abstracts , 2003, Bioinform..

[43]  Dietrich Rebholz-Schuhmann,et al.  The BioLexicon: a large-scale terminological resource for biomedical text mining , 2011, BMC Bioinformatics.

[44]  Sébastien Montel,et al.  The EORTC QLQ-OH17: a supplementary module to the EORTC QLQ-C30 for assessment of oral health and quality of life in cancer patients. , 2012, European journal of cancer.

[45]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[46]  P. Massion,et al.  The State of Molecular Biomarkers for the Early Detection of Lung Cancer , 2012, Cancer Prevention Research.

[47]  Lucy Skrabanek,et al.  PDZBase: a protein?Cprotein interaction database for PDZ-domains , 2005, Bioinform..

[48]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[49]  Russ B. Altman,et al.  A literature-based method for assessing the functional coherence of a gene group , 2003, Bioinform..

[50]  Hong Yu,et al.  Simple and efficient machine learning frameworks for identifying protein-protein interaction relevant articles and experimental methods used to study the interactions , 2011, BMC Bioinformatics.

[51]  Bob Carpenter Character Language Models for Chinese Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[52]  David L Robertson,et al.  HIV-host interactions: a map of viral perturbation of the host system. , 2009, AIDS.

[53]  Y. Zhang,et al.  IntAct—open source resource for molecular interaction data , 2006, Nucleic Acids Res..

[54]  Amitabh Sharma,et al.  Lipids in Health and Disease , 2006 .

[55]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[56]  Hui Li,et al.  Biomarker Identification Using Text Mining , 2012, Comput. Math. Methods Medicine.

[57]  J McEntyre,et al.  PubMed: bridging the information gap. , 2001, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[58]  A. Pakpour,et al.  Translation and validation of the EORTC brain cancer module (EORTC QLQ-BN20) for use in Iran , 2012, Health and Quality of Life Outcomes.

[59]  David L Robertson,et al.  Cataloguing the HIV type 1 human protein interaction network. , 2008, AIDS research and human retroviruses.

[60]  Naoaki Okazaki,et al.  Data and text mining Building an abbreviation dictionary using a term recognition approach , 2006 .

[61]  Helen L. Johnson,et al.  Corpus Refactoring: a Feasibility Study , 2007, Journal of biomedical discovery and collaboration.

[62]  Neri Merhav,et al.  Hidden Markov processes , 2002, IEEE Trans. Inf. Theory.

[63]  Lishuang Li,et al.  Two-phase biomedical named entity recognition using CRFs , 2009, Comput. Biol. Chem..

[64]  Cheng Zhang,et al.  Biomedical text mining and its applications in cancer research , 2013, J. Biomed. Informatics.

[65]  Yongliang Yang,et al.  Target discovery from data mining approaches. , 2012, Drug discovery today.

[66]  N. Saadat,et al.  Encapsulated Insular Carcinoma of the Thyroid Arising in Graves’ Disease , 2012, International journal of surgical pathology.

[67]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[68]  Jugal K. Kalita,et al.  Scalable biomedical Named Entity Recognition: investigation of a database-supported SVM approach , 2010, Int. J. Bioinform. Res. Appl..

[69]  Lodewyk F. A. Wessels,et al.  A Critical Evaluation of Network and Pathway-Based Classifiers for Outcome Prediction in Breast Cancer , 2011, PloS one.

[70]  S. Boyer,et al.  Automatic mining of the literature to generate new hypotheses for the possible link between periodontitis and atherosclerosis: lipopolysaccharide as a case study. , 2007, Journal of clinical periodontology.

[71]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[72]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[73]  S. Chatterjee,et al.  Unraveling the Design Principle for Motif Organization in Signaling Networks , 2011, PloS one.

[74]  Balázs Papp,et al.  Systems-biology approaches for predicting genomic evolution , 2011, Nature Reviews Genetics.

[75]  Adeeba Kamarulzaman,et al.  AIDS Res Hum Retroviruses , 2006 .

[76]  Richard J. Epstein,et al.  Unblocking Blockbusters: Using Boolean Text-Mining to Optimise Clinical Trial Design and Timeline for Novel Anticancer Drugs , 2009, Cancer informatics.

[77]  T. Park,et al.  Pathway-Based Evaluation in Early Onset Colorectal Cancer Suggests Focal Adhesion and Immunosuppression along with Epithelial-Mesenchymal Transition , 2012, PloS one.

[78]  C. Dass,et al.  Cell and Molecular Biology Underpinning the Effects of PEDF on Cancers in General and Osteosarcoma in Particular , 2012, Journal of biomedicine & biotechnology.

[79]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[80]  A. Sinha,et al.  Cytokine networks in Pemphigus vulgaris: An integrated viewpoint , 2012, Autoimmunity.

[81]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[82]  Su Jian,et al.  Exploring Deep Knowledge Resources in Biomedical Name Recognition , 2004, NLPBA/BioNLP.

[83]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001 .

[84]  Gregory Piatetsky-Shapiro,et al.  Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[85]  Xiaoyan Zhu,et al.  Building Disease-Specific Drug-Protein Connectivity Maps from Molecular Interaction Networks and PubMed Abstracts , 2009, PLoS Comput. Biol..

[86]  Colin Macilwain,et al.  Systems Biology: Evolving into the Mainstream , 2011, Cell.

[87]  Sophia Ananiadou,et al.  How to make the most of NE dictionaries in statistical NER , 2008, BMC Bioinformatics.

[88]  Shih-Hung Wu,et al.  Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities , 2006, Expert systems with applications.

[89]  Xing-Ming Zhao,et al.  Identifying dysregulated pathways in cancers from pathway interaction networks , 2012, BMC Bioinformatics.

[90]  Barbara Rosario,et al.  Multi-way Relation Classification: Application to Protein-Protein Interactions , 2005, HLT.

[91]  Barbara Rosario,et al.  Classifying Semantic Relations in Bioscience Texts , 2004, ACL.

[92]  Edward L. Giovannucci,et al.  Lycopene, Tomato Products, and Prostate Cancer Incidence: A Review and Reassessment in the PSA Screening Era , 2012, Journal of oncology.

[93]  Jiabao Xu,et al.  A mouse protein interactome through combined literature mining with multiple sources of interaction evidence , 2010, Amino Acids.

[94]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[95]  Simon C. Potter,et al.  A Genome-Wide Association Search for Type 2 Diabetes Genes in African Americans , 2012, PLoS ONE.

[96]  R. Altman,et al.  Using text analysis to identify functionally coherent gene groups. , 2002, Genome research.

[97]  Donna R. Maglott,et al.  Human immunodeficiency virus type 1, human protein interaction database at NCBI , 2008, Nucleic Acids Res..

[98]  Jeremy S Logue,et al.  Complexity in the signaling network: insights from the use of targeted inhibitors in cancer therapy. , 2012, Genes & development.

[99]  Zhiyong Lu,et al.  Overview of the BioCreative III Workshop , 2011, BMC Bioinformatics.

[100]  Ioannis N. Melas,et al.  Construction of signaling pathways and identification of drug effects on the liver cancer cell HepG2 , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[101]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[102]  Teruyoshi Hishiki,et al.  Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning , 2005, Pacific Symposium on Biocomputing.

[103]  Barbara Rosario,et al.  Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy , 2001, EMNLP.

[104]  Lodovico Balducci,et al.  Cancer and age: general considerations. , 2012, Clinics in geriatric medicine.

[105]  Yael Garten,et al.  Recent progress in automatically extracting information from the pharmacogenomic literature. , 2010, Pharmacogenomics.

[106]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[107]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[108]  Mohammed Al-Shalalfa,et al.  Protein network-based Lasso regression model for the construction of disease-miRNA functional interactions , 2013, EURASIP J. Bioinform. Syst. Biol..