A Survey of Graph Mining Techniques for Biological Datasets

Mining structured information has been the source of much research in the data mining community over the last decade. The field of bioinformatics has emerged as important application area in this context. Examples abound ranging from the analysis of protein interaction networks to the analysis of phylogenetic data. In this article we survey the principal results in the field examining them both from the algorithmic contributions and applicability in the domain in ques- tion. We conclude this article with a discussion of the key results and identify some interesting directions for future research.

[1]  D. Sankoff,et al.  An efficient algorithm for supertrees , 1995 .

[2]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[3]  Yoshihiro Yamanishi,et al.  Glycan classification with tree kernels , 2007, Bioinform..

[4]  H. Schwalbe,et al.  NMR Spectroscopy of RNA , 2003, Chembiochem : a European journal of chemical biology.

[5]  Tao Jiang,et al.  Aligning sequences via an evolutionary tree: complexity and approximation , 1994, STOC '94.

[6]  I. Tinoco,et al.  How RNA folds. , 1999, Journal of molecular biology.

[7]  Bin Ma,et al.  A General Edit Distance between RNA Structures , 2002, J. Comput. Biol..

[8]  Amihood Amir,et al.  Maximum Agreement Subtree in a Set of Evolutionary Trees: Metrics and Efficient Algorithms , 1997, SIAM J. Comput..

[9]  Tao Jiang,et al.  Some MAX SNP-Hard Results Concerning Unordered Labeled Trees , 1994, Inf. Process. Lett..

[10]  Nicolle H. Packer,et al.  GlycoSuiteDB: a new curated relational database of glycoprotein glycan structures and their biological sources , 2001, Nucleic Acids Res..

[11]  A. D. Gordon A measure of the agreement between rankings , 1979 .

[12]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[13]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[14]  Srinivasan Parthasarathy,et al.  An ensemble framework for clustering protein-protein interaction networks , 2007, ISMB/ECCB.

[15]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[16]  Srinivasan Parthasarathy,et al.  MotifMiner: Efficient discovery of common substructures in biochemical molecules , 2005, Knowledge and Information Systems.

[17]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[18]  J. Marth,et al.  Glycosylation in Cellular Mechanisms of Health and Disease , 2006, Cell.

[19]  Dmitrii A. Polshakov,et al.  A New Approach to Protein Structure Mining and Alignment , 2004, BIOKDD.

[20]  I. Tinoco,et al.  RNA folding and unfolding. , 2004, Current opinion in structural biology.

[21]  Ron Shamir,et al.  A clustering algorithm based on graph connectivity , 2000, Inf. Process. Lett..

[22]  Srinivasan Parthasarathy,et al.  Scalable graph clustering using stochastic flows: applications to community discovery , 2009, KDD.

[23]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[24]  Haifeng Li,et al.  Systematic discovery of functional modules and context-specific functional annotation of human genome , 2007, ISMB/ECCB.

[25]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[26]  M. Ragan Phylogenetic inference based on matrix representation of trees. , 1992, Molecular phylogenetics and evolution.

[27]  Jun Dong,et al.  Geometric Interpretation of Gene Coexpression Network Analysis , 2008, PLoS Comput. Biol..

[28]  Mark Ettinger The complexity of comparing reaction systems , 2002, Bioinform..

[29]  Philip S. Yu,et al.  A graph-based approach to systematically reconstruct human transcriptional regulatory modules , 2007, ISMB/ECCB.

[30]  Tatsuya Akutsu,et al.  Application of a new probabilistic model for recognizing complex patterns in glycans , 2004, ISMB/ECCB.

[31]  Tatsuya Akutsu,et al.  Efficient tree-matching methods for accurate carbohydrate database queries. , 2003, Genome informatics. International Conference on Genome Informatics.

[32]  Cédric Chauve,et al.  An Edit Distance Between RNA Stem-Loops , 2005, SPIRE.

[33]  Limsoon Wong,et al.  Using indirect protein-protein interactions for protein complex predication. , 2007, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[34]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[35]  Roded Sharan,et al.  Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Robert Giegerich,et al.  Pure multiple RNA secondary structure alignments: a progressive profile approach , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.

[38]  Tao Jiang,et al.  Approximation algorithms for tree alignment with a given phylogeny , 1996, Algorithmica.

[39]  Tao Jiang,et al.  A More Efficient Approximation Scheme for Tree Alignment , 2000, SIAM J. Comput..

[40]  Yoshihiro Yamanishi,et al.  Extraction of leukemia specific glycan motifs in humans by computational glycomics. , 2005, Carbohydrate research.

[41]  Mikkel Thorup,et al.  Fast comparison of evolutionary trees , 1994, SODA '94.

[42]  G Benedetti,et al.  A graph-topological approach to recognition of pattern and similarity in RNA secondary structures. , 1996, Biophysical chemistry.

[43]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[45]  Fred R. McMorris,et al.  Consensusn-trees , 1981 .

[46]  Lusheng Wang,et al.  Alignment of trees: an alternative to tree edit , 1995 .

[47]  Lawrence B. Holder,et al.  Application of Graph-based Data Mining to Metabolic Pathways , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[48]  Andy Purvis,et al.  Phylogenetic supertrees: Assembling the trees of life. , 1998, Trends in ecology & evolution.

[49]  James C Paulson,et al.  Frontiers in Glycomics; Bioinformatics and Biomarkers in Disease September 11‐13, 2006 Natcher Conference Center, NIH Campus, Bethesda, MD, USA , 2007, Proteomics.

[50]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[51]  Wojciech Szpankowski,et al.  Biclustering gene-feature matrices for statistically significant dense patterns , 2004 .

[52]  Hiroshi Yasuda,et al.  A gram distribution kernel applied to glycan classification and motif extraction. , 2006, Genome informatics. International Conference on Genome Informatics.

[53]  R. Ravi,et al.  Computing Similarity between RNA Strings , 1996, CPM.

[54]  Srinivasan Parthasarathy,et al.  Parallel algorithms for mining frequent structural motifs in scientific data , 2004, ICS '04.

[55]  Minoru Kanehisa,et al.  Mining significant tree patterns in carbohydrate sugar chains , 2008, ECCB.

[56]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[57]  Inderjit S. Dhillon,et al.  A fast kernel-based multilevel algorithm for graph clustering , 2005, KDD '05.

[58]  Srinivasan Parthasarathy,et al.  Discovering frequent topological structures from graph datasets , 2005, KDD '05.

[59]  S. vanDongen Graph Clustering by Flow Simulation , 2000 .

[60]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[61]  Lawrence B. Holder,et al.  Substucture Discovery in the SUBDUE System , 1994, KDD Workshop.

[62]  Jiong Yang,et al.  SPIN: mining maximal frequent subgraphs from graph databases , 2004, KDD.

[63]  Kiyoko F. Aoki-Kinoshita,et al.  A General Probabilistic Framework for Mining Labeled Ordered Trees , 2004, SDM.

[64]  L. Infante,et al.  Hierarchical Clustering , 2020, International Encyclopedia of Statistical Science.

[65]  Kaizhong Zhang,et al.  Comparing multiple RNA secondary structures using tree comparisons , 1990, Comput. Appl. Biosci..

[66]  達也 阿久津 An RNC Algorithm for Finding a Largest Common Subtree of Two Trees , 1991 .

[67]  Roberto Avogadri,et al.  Fuzzy ensemble clustering based on random projections for DNA microarray data analysis , 2009, Artif. Intell. Medicine.

[68]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[69]  D. Durocher,et al.  The molecular basis of FHA domain:phosphopeptide binding specificity and implications for phospho-dependent signaling mechanisms. , 2000, Molecular cell.

[70]  Eiichi Tanaka,et al.  The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..

[71]  Kiyoko F. Aoki-Kinoshita,et al.  KEGG as a glycome informatics resource. , 2006, Glycobiology.

[72]  Robert Giegerich,et al.  A comprehensive comparison of comparative RNA structure prediction approaches , 2004, BMC Bioinformatics.

[73]  Uri Alon,et al.  Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs , 2004, Bioinform..

[74]  Andreas Bohne,et al.  SWEET-DB: an attempt to create annotated data collections for carbohydrates , 2002, Nucleic Acids Res..

[75]  Mohammed J. Zaki,et al.  Efficiently mining maximal frequent itemsets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[76]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[77]  Jiawei Han,et al.  Mining coherent dense subgraphs across massive biological networks for functional discovery , 2005, ISMB.

[78]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[79]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[80]  Gabriel Valiente,et al.  Algorithms on Trees and Graphs , 2002, Springer Berlin Heidelberg.

[81]  Sen Zhang,et al.  Unordered tree mining with applications to phylogeny , 2004, Proceedings. 20th International Conference on Data Engineering.

[82]  Ruth Nussinov,et al.  RNA secondary structures: comparison and determination of frequently recurring substructures by consensus , 1989, Comput. Appl. Biosci..

[83]  Mong-Li Lee,et al.  Labeling network motifs in protein interactomes for protein function prediction , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[84]  Frank Y. Shih,et al.  Threshold Decomposition of Gray-Scale Morphology into Binary Morphology , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[85]  Jeffrey B. Cheng,et al.  A Box H/ACA Small Nucleolar RNA-Like Domain at the Human Telomerase RNA 3′ End , 1999, Molecular and Cellular Biology.

[86]  J. Winderickx,et al.  Inferring transcriptional modules from ChIP-chip, motif and microarray data , 2006, Genome Biology.

[87]  Christos Faloutsos,et al.  Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[88]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[89]  Joshua M. Stuart,et al.  A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules , 2003, Science.

[90]  C. W. von der Lieth,et al.  LINUCS: linear notation for unique description of carbohydrate sequences. , 2001, Carbohydrate research.

[91]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[92]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[93]  Susumu Goto,et al.  Prediction of glycan structures from gene expression data based on glycosyltransferase reactions , 2005, Bioinform..

[94]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[95]  A. D. Gordon Consensus supertrees: The synthesis of rooted trees containing overlapping sets of labeled leaves , 1986 .

[96]  Mong-Li Lee,et al.  NeMoFinder: dissecting genome-wide protein-protein interactions with meso-scale network motifs , 2006, KDD '06.

[97]  Tatsuya Akutsu,et al.  A score matrix to reveal the hidden links in glycans , 2005, Bioinform..

[98]  Na Liu,et al.  A method for rapid similarity analysis of RNA secondary structures , 2006, BMC Bioinformatics.

[99]  Nicola J. Rinaldi,et al.  Computational discovery of gene modules and regulatory networks , 2003, Nature Biotechnology.

[100]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[101]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[102]  Tharam S. Dillon,et al.  Mining Substructures in Protein Data , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[103]  C. Lieth,et al.  GlycoCT-a unifying sequence format for carbohydrates. , 2008, Carbohydrate research.

[104]  Jennifer Widom,et al.  Proceedings of the 1996 ACM SIGMOD international conference on Management of data , 1996, PODS 1996.

[105]  Naomi Nishimura,et al.  Finding Largest Subtrees and Smallest Supertrees , 1998, Algorithmica.

[106]  Sen Zhang,et al.  Discovering Frequent Agreement Subtrees from Phylogenetic Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[107]  William S York,et al.  GLYDE-an expressive XML standard for the representation of glycan structure. , 2005, Carbohydrate research.

[108]  T. Schlick,et al.  Exploring the repertoire of RNA secondary motifs using graph theory; implications for RNA design. , 2003, Nucleic acids research.

[109]  See-Kiong Ng,et al.  Discovering protein complexes in dense reliable neighborhoods of protein interaction networks. , 2007, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[110]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[111]  Wojciech Szpankowski,et al.  An efficient algorithm for detecting frequent subgraphs in biological networks , 2004, ISMB/ECCB.

[112]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[113]  F. F. Yao,et al.  Approximation Algorithms for the Largest Common Subtree Problem. , 1995 .

[114]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[115]  Tandy J. Warnow,et al.  Statistically based postprocessing of phylogenetic analysis by clustering , 2002, ISMB.

[116]  Tatsuya Akutsu,et al.  KCaM (KEGG Carbohydrate Matcher): a software tool for analyzing the structures of carbohydrate sugar chains , 2004, Nucleic Acids Res..

[117]  Homin K. Lee,et al.  Coexpression analysis of human genes across many microarray data sets. , 2004, Genome research.

[118]  Limsoon Wong,et al.  Using Indirect protein-protein Interactions for protein Complex Prediction , 2008, J. Bioinform. Comput. Biol..

[119]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[120]  Tatsuya Akutsu,et al.  A probabilistic model for mining labeled ordered trees: capturing patterns in carbohydrate sugar chains , 2005, IEEE Transactions on Knowledge and Data Engineering.

[121]  Joshua A. Grochow,et al.  Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking , 2007, RECOMB.

[122]  Clara Pizzuti,et al.  Multi-functional Protein Clustering in PPI Networks , 2008, BIRD.