Framework for kernel regularization with application to protein clustering.

We develop and apply a previously undescribed framework that is designed to extract information in the form of a positive definite kernel matrix from possibly crude, noisy, incomplete, inconsistent dissimilarity information between pairs of objects, obtainable in a variety of contexts. Any positive definite kernel defines a consistent set of distances, and the fitted kernel provides a set of coordinates in Euclidean space that attempts to respect the information available while controlling for complexity of the kernel. The resulting set of coordinates is highly appropriate for visualization and as input to classification and clustering algorithms. The framework is formulated in terms of a class of optimization problems that can be solved efficiently by using modern convex cone programming software. The power of the method is illustrated in the context of protein clustering based on primary sequence data. An application to the globin family of proteins resulted in a readily visualizable 3D sequence space of globins, where several subfamilies and subgroupings consistent with the literature were easily identifiable.

[1]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[2]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[3]  G. Wahba Practical Approximate Solutions to Linear Operator Equations When the Data are Noisy , 1977 .

[4]  J. Clegg,et al.  Structure of the zeta chain of human embryonic hemoglobin. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[5]  W. Fitch,et al.  Isolation and amino acid sequence of a monomeric hemoglobin in heart muscle of the bullfrog, Rana catesbeiana. , 1982, The Journal of biological chemistry.

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  G. Wahba Spline models for observational data , 1990 .

[8]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[9]  Gaston H. Gonnet,et al.  Advances in Computational Mathematics , 1996 .

[10]  R. Cashon,et al.  Kinetic characterization of myoglobins from vertebrates with vastly different body temperatures. , 1997, Comparative biochemistry and physiology. Part B, Biochemistry & molecular biology.

[11]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[12]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[13]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[14]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[15]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[16]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[17]  Andreas Buja,et al.  Visualization Methodology for Multidimensional Scaling , 2002, J. Classif..

[18]  Grace Wahba,et al.  Soft and hard classification by reproducing kernel Hilbert space methods , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[20]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[21]  Ron D. Appel,et al.  ExPASy: the proteomics server for in-depth protein knowledge and analysis , 2003, Nucleic Acids Res..

[22]  B. Honig,et al.  On the role of structural information in remote homology detection and sequence alignment: new methods using hybrid sequence profiles. , 2003, Journal of molecular biology.

[23]  Douglas L. Brutlag,et al.  Remote homology detection: a motif based approach , 2003, ISMB.

[24]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[25]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[26]  Kim-Chuan Toh,et al.  Solving semidefinite-quadratic-linear programs using SDPT3 , 2003, Math. Program..

[27]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[28]  G. Wahba,et al.  Multicategory Support Vector Machines , Theory , and Application to the Classification of Microarray Data and Satellite Radiance Data , 2004 .

[29]  R. Tibshirani,et al.  Efficient quadratic regularization for expression arrays. , 2004, Biostatistics.

[30]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[31]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[32]  Sung-Hou Kim,et al.  Global mapping of the protein structure space and application in structure-based inference of protein function. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[33]  J. Cavanaugh Biostatistics , 2005, Definitions.