Weighted graph matching approaches to structure comparison and alignment and their application to biological problems

KWANGBOM CHOI: Weighted Graph Matching Approaches to Structure Comparison and Alignment and their Application to Biological Problems. (Under the direction of Shawn M. Gomez.) In pattern recognition and machine learning, comparing and contrasting are the most fundamental operations: from similarities we derive common rules encoded in the systems, while from difference we infer what makes each system unique. The biological sciences are not an exception to these operations and, in fact, rely heavily on their use. More recently, the emergence of high-throughput measurement technologies has highlighted the need for novel approaches capable of enhancing our ability to understand complex relationships in these data sets. Often, these relationships can be best represented using graphs (or networks), where nodes are biochemical components such as genes, RNAs, proteins or metabolites, and edges indicate the types (and often quality) of relationship. Comparison of relationships is generally performed by aligning the networks of interest. For example, for protein-protein interaction (PPI) networks, the goal of network alignment is to find mappings between nodes (proteins) which are highly useful in identifying signaling pathways or protein complexes and to annotate genes of unknown functionality from subnetworks conserved across multiple species. Phylogenetic trees are also graph structures that describe evolutionary relationship among groups of organisms and their hypothetical ancestors. As it has been shown in a large volume of previous work, comparison of trees also opens the possibility of supporting or building new evolutionary hypotheses: for example, the detection of host-parasite symbiosis, gene coevolution as a signal of physical interactions among genes, or nonstandard events such as horizontal gene transfer. The goal of this thesis is to develop and implement a flexible set of algorithms and methodologies that can be used for the alignment of trees and/or networks having various sizes and properties. We first define a new relaxed model of graph isomorphism in which the shortest path lengths are preserved between corresponding intra-node pairs. Then, based on Google’s PageRank model, we present a new tree matching approach, phyloAligner, which resolves several weakness of previous approaches. We further generalize this tree matching algorithm to a broader flexible framework, MCS-Finder, as a scalable and error-tolerant approximation for identifying the maximum common substructure between

[1]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[2]  M. Allen Understanding Regression Analysis , 1997 .

[3]  Matthias Rarey,et al.  Maximum common subgraph isomorphism algorithms and their applications in molecular science: a review , 2011 .

[4]  M. Ragan Detection of lateral gene transfer among microbial genomes. , 2001, Current opinion in genetics & development.

[5]  T. Ideker,et al.  Modeling cellular machinery through biological network comparison , 2006, Nature Biotechnology.

[6]  P. Pardalos,et al.  An exact algorithm for the maximum clique problem , 1990 .

[7]  A. Valencia,et al.  Similarity of phylogenetic trees as indicator of protein-protein interaction. , 2001, Protein engineering.

[8]  G. Levi A note on the derivation of maximal common subgraphs of two directed or undirected graphs , 1973 .

[9]  Marvin Johnson,et al.  Concepts and applications of molecular similarity , 1990 .

[10]  Mario Vento,et al.  An Improved Algorithm for Matching Large Graphs , 2001 .

[11]  Fred R. McMorris,et al.  COMPARISON OF UNDIRECTED PHYLOGENETIC TREES BASED ON SUBTREES OF FOUR EVOLUTIONARY UNITS , 1985 .

[12]  Jiong Yang,et al.  SPIN: mining maximal frequent subgraphs from graph databases , 2004, KDD.

[13]  Azriel Rosenfeld,et al.  Digital Picture Processing , 1976 .

[14]  Johann Gasteiger,et al.  The Determination of Maximum Common Substructures by a Genetic Algorithm: Application in Synthesis Design and for the Structural Analysis of Biological Activity , 1994 .

[15]  James R. Cole,et al.  The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data , 2006, Nucleic Acids Res..

[16]  Raja Jothi,et al.  Co-evolutionary analysis of domains in interacting proteins reveals insights into domain-domain interactions mediating protein-protein interactions. , 2006, Journal of molecular biology.

[17]  M. T. Barakat,et al.  Molecular structure matching by simulated annealing. II. An exploration of the evolution of configuration landscape problems , 1990, J. Comput. Aided Mol. Des..

[18]  Bonnie Berger,et al.  IsoRankN: spectral methods for global alignment of multiple protein networks , 2009, Bioinform..

[19]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[20]  Kamalakar Karlapalem,et al.  MARGIN: Maximal Frequent Subgraph Mining , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Catherine H. Wu,et al.  Human RhoGAP domain‐containing proteins: structure, function and evolutionary relationships , 2002, FEBS letters.

[22]  David Haussler,et al.  Detecting Coevolution in and among Protein Domains , 2007, PLoS Comput. Biol..

[23]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[24]  Bonnie Berger,et al.  Global alignment of multiple protein interaction networks with application to functional orthology detection , 2008, Proceedings of the National Academy of Sciences.

[25]  Yoshihiro Yamanishi,et al.  Partial correlation coefficient between distance matrices as a new indicator of protein-protein interactions , 2006, Bioinform..

[26]  Luc Soler,et al.  Tree Matching Applied to Vascular System , 2005, GbRPR.

[27]  Arun K. Ramani,et al.  Exploiting the co-evolution of interacting proteins to discover interaction specificity. , 2003, Journal of molecular biology.

[28]  Egon Balas,et al.  Finding a Maximum Clique in an Arbitrary Graph , 1986, SIAM J. Comput..

[29]  Mikkel Thorup,et al.  An O(n log n) algorithm for the maximum agreement subtree problem for binary trees , 1996, SODA '96.

[30]  Harry G. Barrow,et al.  Subgraph Isomorphism, Matching Relational Structures and Maximal Cliques , 1976, Inf. Process. Lett..

[31]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[32]  J. Tukey,et al.  The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data , 1974 .

[33]  Roderic D. M. Page,et al.  Tangled trees : phylogeny, cospeciation, and coevolution , 2003 .

[34]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[35]  Kristina Schädler,et al.  A Connectionist Approach to Structural Simiarity Determination as a Basis of Clustering, Classification and Feature Detection , 1997, PKDD.

[36]  Eric Bapteste,et al.  Deduction of probable events of lateral gene transfer through comparison of phylogenetic trees by recursive consolidation and rearrangement , 2005, BMC Evolutionary Biology.

[37]  Teresa M. Przytycka,et al.  Predicting protein-protein interaction by searching evolutionary tree automorphism space , 2005, ISMB.

[38]  D. Robinson Comparison of labeled trees with valency three , 1971 .

[39]  William Stafford Noble,et al.  Learning to predict protein-protein interactions from protein sequences , 2003, Bioinform..

[40]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[41]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[42]  Mario Vento,et al.  Challenging Complexity of Maximum Common Subgraph Detection Algorithms: A Performance Analysis of Three Algorithms on a Wide Database of Graphs , 2007, J. Graph Algorithms Appl..

[43]  Benjamin A. Shoemaker,et al.  Correlated evolution of interacting proteins: looking behind the mirrortree. , 2009, Journal of molecular biology.

[44]  Michael T. Hallett,et al.  Towards Identifying Lateral Gene Transfer Events , 2002, Pacific Symposium on Biocomputing.

[45]  Temple F. Smith,et al.  On the similarity of dendrograms. , 1978, Journal of theoretical biology.

[46]  Antal F. Novak,et al.  networks Græmlin : General and robust alignment of multiple large interaction data , 2006 .

[47]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[48]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[49]  Chern-Sing Goh,et al.  Co-evolutionary analysis reveals insights into protein-protein interactions. , 2002, Journal of molecular biology.

[50]  Peter Willett,et al.  Maximum common subgraph isomorphism algorithms for the matching of chemical structures , 2002, J. Comput. Aided Mol. Des..

[51]  W. Delano The PyMOL Molecular Graphics System , 2002 .

[52]  Razvan C. Bunescu,et al.  Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome , 2005, Genome Biology.

[53]  Li Liao,et al.  Phylogenetic tree information aids supervised learning for predicting protein-protein interaction based on distance matrices , 2007, BMC Bioinformatics.

[54]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[55]  Yoshihiro Yamanishi,et al.  The inference of protein-protein interactions by co-evolutionary analysis is improved by excluding the information about the phylogenetic relationships , 2005, Bioinform..

[56]  Samarjit Chakraborty,et al.  Computing Largest Common Point Sets under Approximate Congruence , 2000, ESA.

[57]  Willem J. Heiser,et al.  Resistant orthogonal procrustes analysis , 1992 .

[58]  R. Campbell,et al.  Co-evolution of ligand-receptor pairs , 1994, Nature.

[59]  B. Snel,et al.  The yeast coexpression network has a small‐world, scale‐free architecture and can be explained by a simple model , 2004, EMBO reports.

[60]  James R. Brown Ancient horizontal gene transfer , 2003, Nature Reviews Genetics.

[61]  Shawn M. Gomez,et al.  Comparison of phylogenetic trees through alignment of embedded evolutionary distances , 2009, BMC Bioinformatics.

[62]  Marcello Pelillo,et al.  A unifying framework for relational structure matching , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[63]  Shinji Umeyama,et al.  An Eigendecomposition Approach to Weighted Graph Matching Problems , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[64]  F. Cohen,et al.  Co-evolution of proteins with their interaction partners. , 2000, Journal of molecular biology.

[65]  Mario Vento,et al.  Thirty Years Of Graph Matching In Pattern Recognition , 2004, Int. J. Pattern Recognit. Artif. Intell..

[66]  Michael Y. Galperin,et al.  Bacterial signal transduction network in a genomic perspective. , 2004, Environmental microbiology.

[67]  Robert L Charlebois,et al.  The human protein coevolution network. , 2009, Genome research.

[68]  K. J. Fryxell,et al.  The coevolution of gene family trees. , 1996, Trends in genetics : TIG.

[69]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[70]  Tao Jiang,et al.  On the Complexity of Comparing Evolutionary Trees , 1996, Discret. Appl. Math..

[71]  E. Koonin,et al.  Evolution of mosaic operons by horizontal gene transfer and gene displacement in situ , 2003, Genome Biology.

[72]  M. C. Jones,et al.  Comparison of Smoothing Parameterizations in Bivariate Kernel Density Estimation , 1993 .

[73]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[74]  M. Sternberg,et al.  Assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome. , 2005, Journal of molecular biology.

[75]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[76]  Samarjit Chakraborty,et al.  Approximation Algorithms for 3-D Commom Substructure Identification in Drug and Protein Molecules , 1999, WADS.

[77]  C. Der,et al.  GEF means go: turning on RHO GTPases with guanine nucleotide-exchange factors , 2005, Nature Reviews Molecular Cell Biology.

[78]  J. J. McGregor,et al.  Backtrack search algorithms and the maximal common subgraph problem , 1982, Softw. Pract. Exp..

[79]  Dale Schuurmans,et al.  Graphical Models and Point Pattern Matching , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Gavin MacBeath,et al.  A quantitative protein interaction network for the ErbB receptors using protein microarrays , 2006, Nature.

[81]  Baba C. Vemuri,et al.  A robust algorithm for point set registration using mixture of Gaussians , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[82]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[83]  Yuanfang Guan,et al.  A Genomewide Functional Network for the Laboratory Mouse , 2008, PLoS Comput. Biol..

[84]  Gi-Ho Sung,et al.  Ancient Tripartite Coevolution in the Attine Ant-Microbe Symbiosis , 2003, Science.

[85]  F. McLafferty,et al.  Computer‐aided interpretation of mass spectra , 1969 .

[86]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[87]  Tatsuya Akutsu,et al.  Distribution of Distances and Triangles in a Point Set and Algorithms for Computing the Largest Common Point Sets , 1998, Discret. Comput. Geom..

[88]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[89]  Steven Gold,et al.  A Graduated Assignment Algorithm for Graph Matching , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[90]  R. Warner Applied Statistics: From Bivariate through Multivariate Techniques [with CD-ROM]. , 2007 .

[91]  Tao Jiang,et al.  A maximum common substructure-based algorithm for searching and predicting drug-like compounds , 2008, ISMB.