Graph pattern mining techniques to identify potential model organisms

Recent advances in high throughput technologies have led to an increasing amount of rich and diverse biological data and related literature. Model organisms are classically selected as subjects for studying human disease based on their genotypic and phenotypic features. A significant problem with model organism identification is the determination of characteristic features related to biological processes that can provide insights into the mechanisms underlying diseases. These insights could have a positive impact on the diagnosis and management of diseases and the development of therapeutic drugs. The increased availability of biological data presents an opportunity to develop data mining methods that can address these challenges and help scientists formulate and test data-driven hypotheses. In this dissertation, data mining methods were developed to provide a quantitative approach for the identification of potential model organisms based on underlying features that may be correlated with disease manifestation in humans. The work encompassed three major types of contributions that aimed to address challenges related to inferring information from biological data available from a range of sources. First, new statistical models and algorithms for graph pattern mining were developed and tested on diverse genres of data (biological networks, drug chemical compounds, and text documents). Second, data mining techniques were developed and shown to identify characteristic disease patterns (disease fingerprints), predict potentially new genetic pathways, and facilitate the assessment of organisms as potential disease models. Third, a methodology was developed that combined the application of graph-based models with information derived from natural language processing methods to identify statistically significant patterns in biomedical text. Together, the approaches developed for this dissertation show promise for summarizing the information about biological processes and phenomena associated with organisms broadly and for the potential assessment of their suitability to study human diseases.

[1]  T. Garland,et al.  Behaviour of house mice artificially selected for high levels of voluntary wheel running , 1999, Animal Behaviour.

[2]  Melinda R. Dwinell,et al.  The Rat Genome Database 2009: variation, ontologies and pathways , 2008, Nucleic Acids Res..

[3]  Martin Aigner,et al.  A Characterization of the bell numbers , 1999, Discret. Math..

[4]  Giovanni Scardoni,et al.  Metscape 2 bioinformatics tool for the analysis and visualization of metabolomics and gene expression data , 2012, Bioinform..

[5]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[6]  Stuart Maudsley,et al.  Bioinformatic approaches to metabolic pathways analysis. , 2011, Methods in molecular biology.

[7]  Concetto Spampinato,et al.  Combining literature text mining with microarray data: advances for system biology modeling , 2012, Briefings Bioinform..

[8]  J. Lemontt,et al.  REV3, a Saccharomyces cerevisiae gene whose function is required for induced mutagenesis, is predicted to encode a nonessential DNA polymerase , 1989, Journal of bacteriology.

[9]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, Machine Learning.

[10]  Steffen Staab,et al.  From Manual to Semi-Automatic Semantic Annotation: About Ontology-Based Text Annotation Tools , 2000, SAIC@COLING.

[11]  E. Birney,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Research.

[12]  Hans-Peter Kriegel,et al.  Shortest-path kernels on graphs , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[13]  Malay Haldar,et al.  A conditional mouse model of synovial sarcoma: insights into a myogenic origin. , 2007, Cancer cell.

[14]  Leonard I. Zon,et al.  Cancer genetics and drug discovery in the zebrafish , 2003, Nature Reviews Cancer.

[15]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[16]  R. Wolff,et al.  Colon tumor mutations and epigenetic changes associated with genetic polymorphism: insight into disease pathways. , 2009, Mutation research.

[17]  O. Heidenreich,et al.  Understanding the cancer stem cell , 2010, British Journal of Cancer.

[18]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[19]  Damian Szklarczyk,et al.  The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored , 2010, Nucleic Acids Res..

[20]  M. Capecchi,et al.  Synovial Sarcoma: From Genetics to Genetic-based Animal Modeling , 2008, Clinical orthopaedics and related research.

[21]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[22]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[23]  Neville Ryant,et al.  Extending VerbNet with Novel Verb Classes , 2006, LREC.

[24]  Zaïd Harchaoui,et al.  Image Classification with Segmentation Graph Kernels , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jiong Yang,et al.  PathFinder: mining signal transduction pathway segments from protein-protein interaction networks , 2007, BMC Bioinformatics.

[26]  Tianhong Pan,et al.  The role of autophagy-lysosome pathway in neurodegeneration associated with Parkinson's disease. , 2008, Brain : a journal of neurology.

[27]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[28]  A. Roses,et al.  Identification of miRNA Changes in Alzheimer's Disease Brain and CSF Yields Putative Biomarkers and Insights into Disease Pathways , 2008 .

[29]  T. Davidson,et al.  Searching the Literature Using Medical Subject Headings versus Text Word with PubMed , 2006, The Laryngoscope.

[30]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[31]  Olga G. Troyanskaya,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm332 Data and text mining , 2022 .

[32]  Wei Wang,et al.  GAIA: graph classification using evolutionary computation , 2010, SIGMOD Conference.

[33]  L. Zon,et al.  In vivo drug discovery in the zebrafish , 2005, Nature Reviews Drug Discovery.

[34]  D. Koller,et al.  Automated identification of pathways from quantitative genetic interaction data , 2010, Molecular systems biology.

[35]  S. V. N. Vishwanathan,et al.  Graph kernels , 2007 .

[36]  Karsten M. Borgwardt,et al.  Fast subtree kernels on graphs , 2009, NIPS.

[37]  Gary A. Churchill,et al.  The future of model organisms in human disease research , 2011, Nature Reviews Genetics.

[38]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[39]  Kuo-Chen Chou,et al.  Analysis of Protein Pathway Networks Using Hybrid Properties , 2010, Molecules.

[40]  Jeffrey Xu Yu,et al.  Taming verification hardness: an efficient algorithm for testing subgraph isomorphism , 2008, Proc. VLDB Endow..

[41]  Wyeth W. Wasserman,et al.  ConSite: web-based prediction of regulatory elements using cross-species comparison , 2004, Nucleic Acids Res..

[42]  Fred H. Gage,et al.  Mechanisms Underlying Inflammation in Neurodegeneration , 2010, Cell.

[43]  E. Olson,et al.  MicroRNA regulatory networks in cardiovascular development. , 2010, Developmental cell.

[44]  Simon Heath,et al.  Implication of the immune system in Alzheimer's disease: evidence from genome-wide pathway analysis. , 2010, Journal of Alzheimer's disease : JAD.

[45]  E. Kunkel Systems biology in drug discovery , 2004, Nature Biotechnology.

[46]  Guimei Liu,et al.  Complex discovery from weighted PPI networks , 2009, Bioinform..

[47]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Gultekin Özsoyoglu,et al.  Mining biological networks for unknown pathways , 2007, Bioinform..

[49]  Carey E. Priebe,et al.  Graph Classification Using Signal-Subgraphs: Applications in Statistical Connectomics , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Yiling Lu,et al.  Exploiting the PI3K/AKT Pathway for Cancer Drug Discovery , 2005, Nature Reviews Drug Discovery.

[51]  Jan Ramon,et al.  Expressivity versus efficiency of graph kernels , 2003 .

[52]  Benno Schwikowski,et al.  Graph-based methods for analysing networks in cell biology , 2006, Briefings Bioinform..

[53]  Kriston L. McGary,et al.  Systematic discovery of nonobvious human disease models through orthologous phenotypes , 2010, Proceedings of the National Academy of Sciences.

[54]  Y. Zhang,et al.  IntAct—open source resource for molecular interaction data , 2006, Nucleic Acids Res..

[55]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[56]  Kuo-Chen Chou,et al.  Classification and Analysis of Regulatory Pathways Using Graph Property, Biochemical and Physicochemical Property, and Functional Property , 2011, PloS one.

[57]  Jill P. Mesirov,et al.  AraPath: a knowledgebase for pathway analysis in Arabidopsis , 2012, Bioinform..

[58]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[59]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[60]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[61]  Thomas Gärtner,et al.  Cyclic pattern kernels for predictive graph mining , 2004, KDD.

[62]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[63]  D. K. Arrell,et al.  Network Systems Biology for Drug Discovery , 2010, Clinical pharmacology and therapeutics.

[64]  Peter D. Karp,et al.  Discovering novel subsystems using comparative genomics , 2011, Bioinform..

[65]  Xiaohua Hu,et al.  A coherent graph-based semantic clustering and summarization approach for biomedical literature and a new summarization evaluation method , 2007, BMC Bioinformatics.

[66]  Ehud Gudes,et al.  Discovering Frequent Graph Patterns Using Disjoint Paths , 2006, IEEE Transactions on Knowledge and Data Engineering.

[67]  Philip S. Yu,et al.  Mining significant graph patterns by leap search , 2008, SIGMOD Conference.

[68]  Ali Shokoufandeh,et al.  Indexing using a spectral encoding of topological structure , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[69]  K. Kinzler,et al.  Cancer genes and the pathways they control , 2004, Nature Medicine.

[70]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[71]  S. L. Wong,et al.  Towards a proteome-scale map of the human protein–protein interaction network , 2005, Nature.

[72]  Joel D. Martin,et al.  Getting to the (c)ore of knowledge: mining biomedical literature , 2002, Int. J. Medical Informatics.

[73]  Ge Yu,et al.  Efficiently Indexing Large Sparse Graphs for Similarity Search , 2012, IEEE Transactions on Knowledge and Data Engineering.

[74]  Geoffrey J. Barton,et al.  PIPs: human protein–protein interaction prediction database , 2008, Nucleic Acids Res..

[75]  M. Wickens,et al.  A three-hybrid system to detect RNA-protein interactions in vivo. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[76]  Alexa T. McCray,et al.  UMLS Language and Vocabulary Tools: AMIA 2003 Open Source Expo , 2003, AMIA.

[77]  A. Barabasi,et al.  Network medicine : a network-based approach to human disease , 2010 .

[78]  George Karypis,et al.  Comparison of descriptor spaces for chemical compound retrieval and classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[79]  N. Trayanova,et al.  Systems Approach to Understanding Electromechanical Activity in the Human Heart: A National Heart, Lung, and Blood Institute Workshop Summary , 2008, Circulation.

[80]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[81]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[82]  D. Largaespada,et al.  Mouse models of human disease. Part II: recent progress and future directions. , 1997, Genes & development.

[83]  G. Parmigiani,et al.  A multidimensional analysis of genes mutated in breast and colorectal cancers. , 2007, Genome research.

[84]  Edwin Cuppen,et al.  Zebrafish as a Cancer Model , 2008, Molecular Cancer Research.

[85]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[86]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[87]  A. Debnath,et al.  Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. , 1991, Journal of medicinal chemistry.

[88]  Ambuj K. Singh,et al.  GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[89]  Chris Mungall,et al.  AmiGO: online access to ontology and annotation data , 2008, Bioinform..

[90]  Kimberly Van Auken,et al.  WormBase: a comprehensive resource for nematode research , 2009, Nucleic Acids Res..

[91]  Con Sullivan,et al.  Zebrafish as a model for infectious disease and immune function. , 2008, Fish & shellfish immunology.

[92]  Hans-Peter Kriegel,et al.  Protein function prediction via graph kernels , 2005, ISMB.

[93]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[94]  B. Palsson,et al.  Theory for the systemic definition of metabolic pathways and their use in interpreting metabolic function from a pathway-oriented perspective. , 2000, Journal of theoretical biology.

[95]  Hans-Werner Mewes,et al.  MPact: the MIPS protein interaction resource on yeast , 2005, Nucleic Acids Res..

[96]  Stefan Kramer,et al.  Online Structural Graph Clustering Using Frequent Subgraph Mining , 2010, ECML/PKDD.

[97]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[98]  Hongliang Fei,et al.  Structure feature selection for graph classification , 2008, CIKM '08.

[99]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[100]  M. Vidal,et al.  Protein interaction maps for model organisms , 2001, Nature Reviews Molecular Cell Biology.

[101]  Marko Grobelnik,et al.  Learning Sub-structures of Document Semantic Graphs for Document Summarization , 2004 .

[102]  D. Rebholz-Schuhmann,et al.  Text-mining solutions for biomedical research: enabling integrative biology , 2012, Nature Reviews Genetics.

[103]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[104]  Alexander R. Pico,et al.  WikiPathways: Pathway Editing for the People , 2008, PLoS biology.

[105]  Indra Neil Sarkar,et al.  Structural network analysis of biological networks for assessment of potential disease model organisms , 2014, J. Biomed. Informatics.

[106]  K. Hristova,et al.  Role of receptor tyrosine kinase transmembrane domains in cell signaling and human pathologies. , 2006, Biochemistry.

[107]  Juan Miguel García-Gómez,et al.  BIOINFORMATICS APPLICATIONS NOTE Sequence analysis Manipulation of FASTQ data with Galaxy , 2005 .

[108]  S. V. N. Vishwanathan,et al.  Fast Computation of Graph Kernels , 2006, NIPS.

[109]  Russ B Altman,et al.  Challenges for biomedical informatics and pharmacogenomics. , 2002, Annual review of pharmacology and toxicology.

[110]  R. Overbeek,et al.  Missing genes in metabolic pathways: a comparative genomics approach. , 2003, Current opinion in chemical biology.

[111]  Derek G. Corneil,et al.  The graph isomorphism disease , 1977, J. Graph Theory.

[112]  C. Wijmenga,et al.  Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. , 2006, American journal of human genetics.

[113]  T. Akutsu,et al.  Compound analysis via graph kernels incorporating chirality. , 2010, Journal of bioinformatics and computational biology.

[114]  Susumu Goto,et al.  KEGG for representation and analysis of molecular networks involving diseases and drugs , 2009, Nucleic Acids Res..

[115]  Indra Neil Sarkar,et al.  Mining Disease Fingerprints From Within Genetic Pathways , 2012, AMIA.

[116]  H. Lehrach,et al.  A Human Protein-Protein Interaction Network: A Resource for Annotating the Proteome , 2005, Cell.

[117]  M A Musen,et al.  Representation of clinical data using SNOMED III and conceptual graphs. , 1992, Proceedings. Symposium on Computer Applications in Medical Care.

[118]  Kiyoshi Asai,et al.  Marginalized kernels for biological sequences , 2002, ISMB.

[119]  Kurt Mehlhorn,et al.  Efficient graphlet kernels for large graph comparison , 2009, AISTATS.

[120]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[121]  C. Sander,et al.  Automated Network Analysis Identifies Core Pathways in Glioblastoma , 2010, PloS one.

[122]  Sandra Payette,et al.  Fedora: an architecture for complex objects and their relationships , 2005, International Journal on Digital Libraries.

[123]  Thomas Schlitt,et al.  Protein-protein interaction databases: keeping up with growing interactomes , 2009, Human Genomics.

[124]  David Osumi-Sutherland,et al.  FlyBase: enhancing Drosophila Gene Ontology annotations , 2008, Nucleic Acids Res..

[125]  R. Bellazzi,et al.  TWEAK is a positive regulator of cardiomyocyte proliferation. , 2010, Cardiovascular research.

[126]  P. Lieberman,et al.  The Replisome Pausing Factor Timeless Is Required for Episomal Maintenance of Latent Epstein-Barr Virus , 2011, Journal of Virology.

[127]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[128]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[129]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[130]  Pjotr Prins,et al.  BioRuby: bioinformatics software for the Ruby programming language , 2010, Bioinform..

[131]  R. Klemke,et al.  Catch of the day: zebrafish as a human cancer model , 2008, Oncogene.

[132]  Mary Shimoyama,et al.  The Rat Genome Database, update 2007—Easing the path from disease to data and back again , 2006, Nucleic Acids Res..

[133]  Lise Getoor,et al.  Preserving the Privacy of Sensitive Relationships in Graph Data , 2007, PinKDD.

[134]  Peter D. Karp,et al.  The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases , 2007, Nucleic Acids Res..

[135]  Sebastian Nowozin,et al.  gBoost: a mathematical programming approach to graph classification and regression , 2009, Machine Learning.

[136]  Michael A. Thomas,et al.  Gene set enrichment analysis of microarray data from Pimephales promelas (Rafinesque), a non-mammalian model organism , 2011, BMC Genomics.

[137]  Fred W. DePiero,et al.  An algorithm using length-r paths to approximate subgraph isomorphism , 2003, Pattern Recognit. Lett..

[138]  Xue-wen Chen,et al.  Identification of genes involved in the same pathways using a Hidden Markov Model-based approach , 2009, Bioinform..

[139]  John D. Lafferty,et al.  Diffusion Kernels on Graphs and Other Discrete Input Spaces , 2002, ICML.

[140]  Philip S. Yu,et al.  Graph Indexing: Tree + Delta >= Graph , 2007, VLDB.

[141]  Lawrence B. Holder,et al.  Substructure Analysis of Metabolic Pathways by Graph-Based Relational Learning , 2009, Biomedical Data and Applications.

[142]  Michael Jackman,et al.  Conceptual graphs , 1988 .

[143]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[144]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[145]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[146]  PietraVincent J. Della,et al.  The mathematics of statistical machine translation , 1993 .

[147]  Albert Sorribas,et al.  Saccharomyces cerevisiae as a Model Organism: A Comparative Study , 2011, PloS one.

[148]  Philip S. Yu,et al.  Dual active feature and sample selection for graph classification , 2011, KDD.

[149]  E. Chautard,et al.  Interaction networks: from protein functions to drug discovery. A review. , 2009, Pathologie-biologie.

[150]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[151]  Hans-Peter Kriegel,et al.  Graph Kernels For Disease Outcome Prediction From Protein-Protein Interaction Networks , 2006, Pacific Symposium on Biocomputing.

[152]  P. Park,et al.  Discovering statistically significant pathways in expression profiling studies. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[153]  Jiawei Han,et al.  Mining coherent dense subgraphs across massive biological networks for functional discovery , 2005, ISMB.

[154]  A. Zhernakova,et al.  Detecting shared pathogenesis from the shared genetics of immune-related diseases , 2009, Nature Reviews Genetics.

[155]  Peter D. Karp,et al.  The MetaCyc Database , 2002, Nucleic Acids Res..

[156]  Ashwin Srinivasan,et al.  The Predictive Toxicology Challenge 2000-2001 , 2001, Bioinform..

[157]  R. Kitsis,et al.  Cell death in the pathogenesis of heart disease: mechanisms and significance. , 2010, Annual review of physiology.

[158]  E. Koonin,et al.  Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. , 2001, Genome research.

[159]  Alexandre Arenas,et al.  Neural Network Based Quantitative Structural Property Relations (QSPRs) for Predicting Boiling Points of Aliphatic Hydrocarbons , 2000, J. Chem. Inf. Comput. Sci..