Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study

MOTIVATION Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences. RESULTS We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith-Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith-Waterman algorithm and two hidden Markov model-based algorithms.

[1]  Jodie J. Yin,et al.  A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes , 2004, Genome Biology.

[2]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[3]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[4]  Li Liao,et al.  Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[5]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[6]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .

[7]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[8]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[9]  Ute St. Clair,et al.  Fuzzy Set Theory: Foundations and Applications , 1997 .

[10]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .

[11]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[12]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[13]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[14]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[15]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[16]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[17]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[18]  Sándor Pongor,et al.  The SBASE domain sequence resource, release 12: prediction of protein domain-architecture using support vector machines , 2004, Nucleic Acids Res..

[19]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[20]  F T de Dombal,et al.  Use of receiver operating characteristic (ROC) curves to evaluate computer confidence threshold and clinical performance in the diagnosis appendicitis. , 1978, Methods of information in medicine.

[21]  David G. Stork,et al.  Pattern Classification , 1973 .

[22]  Tatsuya Akutsu,et al.  Image Compression-based Approach to Measuring the Similarity of Protein Structures , 2008, APBC.

[23]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[24]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[25]  Alessandro Bogliolo,et al.  Using sequence compression to speedup probabilistic profile matching , 2005, Bioinform..

[26]  Shmuel Pietrokovski,et al.  Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations , 1999, Bioinform..

[27]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[28]  Qianqiu Li,et al.  Taxonomic utility of a phylogenetic analysis of phosphoglycerate kinase proteins of Archaea, Bacteria, and Eukaryota: insights by Bayesian analyses. , 2005, Molecular phylogenetics and evolution.

[29]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[30]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[31]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[32]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[33]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[34]  Jonas S. Almeida,et al.  Comparative evaluation of word composition distances for the recognition of SCOP relationships , 2004, Bioinform..

[35]  Zheng Rong Yang,et al.  Biological applications of support vector machines , 2004, Briefings Bioinform..

[36]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[37]  Natalio Krasnogor,et al.  Measuring the similarity of protein structures by means of the universal similarity metric , 2004, Bioinform..

[38]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[39]  John C. Wootton,et al.  Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures , 1994, Comput. Chem..

[40]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[41]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .