A tree kernel to analyze phylog enetic profi les

Motivation: The phylogenetic profile of a protein is a string that encodes the presence or absence of the protein in every fully sequenced genome. Because proteins that participate in a common structural complex or metabolic pathway are likely to evolve in a correlated fashion, the phylogenetic profiles of such proteins are often “similar” or at least “related” to each other. The question we address in this paper is the following: how to measure the “similarity” between two profiles, in an evolutionarily relevant way, in order to develop efficient function prediction methods? Results: We show how the profiles can be mapped to a high-dimensional vector space which incorporates evolutionarily relevant information, and we provide an algorithm to compute efficiently the inner product in that space, which we call the tree kernel. The tree kernel can be used by any kernel-based analysis method for classification or data mining of phylogenetic profiles. As an application a Support Vector Machine (SVM) trained to predict the functional class of a gene from its phylogenetic profile is shown to perform better with the tree kernel than with a naive kernel that does not include any information about the phylogenetic relationships among species. Moreover a kernel principal component analysis (KPCA) of the phylogenetic profiles illustrates the sensitivity of the tree kernel to evolutionarily relevant variations. Availability: All data and softwares used are freely and publicly available upon request. Contact: Jean-Philippe.Vert@mines.org

[1]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[2]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[3]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[4]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[7]  Michael I. Jordan Graphical Models , 2003 .

[8]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[9]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[10]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[11]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[12]  C. Watkins Dynamic Alignment Kernels , 1999 .

[13]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[15]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[16]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[17]  H. Wesseling An introduction , 2000, European Review.

[18]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[19]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[21]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.

[22]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[23]  Arne Elofsson,et al.  The Use of Phylogenetic Profiles for Gene Predictions , 2002 .

[24]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[25]  Jean-Philippe Vert,et al.  Support Vector Machine Prediction of Signal Peptide Cleavage Site Using a New Class of Kernels for Strings , 2001, Pacific Symposium on Biocomputing.

[26]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.