论文信息 - On the Role of Local Matching for Efficient Semi-supervised Protein Sequence Classification

On the Role of Local Matching for Efficient Semi-supervised Protein Sequence Classification

Recent studies in protein sequence analysis have leveraged the power of unlabeled data. For example, the profile and mismatch neighborhood kernels have shown significant improvements over classifiers estimated under the fully supervised setting. In this study, we present a principled and biologically motivated framework that more effectively exploits the unlabeled data by only utilizing regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias kernel estimations that rely on unlabeled data, we also propose a method to remove this bias and improve performance of resulting classifiers.Combined with a computationally efficient sparse family of string kernels, our proposed framework achieves state-of-the-art accuracy in semi-supervised protein remote homology detection on three large unlabeled databases.

Vladimir Pavlovic | Pavel P. Kuksa | Pai-Hsi Huang

[1] David Haussler,et al. A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[2] Adam Godzik,et al. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[3] Tim J. P. Hubbard,et al. SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[4] Michael Gribskov,et al. Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[5] Vladimir Pavlovic,et al. Fast protein homology and fold detection with sparse spatial sample kernels , 2008, 2008 19th International Conference on Pattern Recognition.

[6] Jason Weston,et al. Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[7] T. N. Bhat,et al. The Protein Data Bank , 2000, Nucleic Acids Res..

[8] Jason Weston,et al. Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[9] Y. Freund,et al. Profile-based string kernels for remote homology detection and motif extraction. , 2005, Journal of bioinformatics and computational biology.

[10] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.