Alignment-free sequence comparison-a review

MOTIVATION Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignment-free methods that overcome this limitation. The formulation of alternative metrics for dissimilarity between sequences and their algorithmic implementations are reviewed. RESULTS The overwhelming majority of work on alignment-free sequence has taken place in the past two decades, with most reports published in the past 5 years. Two main categories of methods have been proposed-methods based on word (oligomer) frequency, and methods that do not require resolving the sequence with fixed word length segments. The first category is based on the statistics of word frequency, on the distances defined in a Cartesian space defined by the frequency vectors, and on the information content of frequency distribution. The second category includes the use of Kolmogorov complexity and Chaos Theory. Despite their low visibility, alignment-free metrics are in fact already widely used as pre-selection filters for alignment-based querying of large applications. Recent work is furthering their usage as a scale-independent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment. AVAILABILITY Most of the alignment-free algorithms reviewed were implemented in MATLAB code and are available at http://bioinformatics.musc.edu/resources.html

[1]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[2]  Marin van Heel,et al.  A new family of powerful multivariate statistical sequence analysis techniques. , 1991 .

[3]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[4]  R. Durbin,et al.  Biological sequence analysis: Background on probability , 1998 .

[5]  J. Wootton Introduction to computational biology: Maps, sequences and genomes; Interdisciplinary statistics , 1997 .

[6]  T. Gisiger Scale invariance in biology: coincidence or footprint of a universal mechanism? , 2001, Biological reviews of the Cambridge Philosophical Society.

[7]  David Siegmund,et al.  Approximate P-Values for Local Sequence Alignments: Numerical Studies , 2001, J. Comput. Biol..

[8]  Winston A Hide,et al.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.

[9]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[10]  D. Davison,et al.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.

[11]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.

[12]  D B Davison,et al.  Alternative gene form discovery and candidate gene selection from gene indexing projects. , 1998, Genome research.

[13]  Vladimir V. V'yugin,et al.  Algorithmic Complexity and Stochastic Properties of Finite Binary Sequences , 1999, Comput. J..

[14]  Michael S. Waterman,et al.  Introduction to Computational Biology: Maps, Sequences and Genomes , 1998 .

[15]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[16]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[17]  D. B. Searls,et al.  Reading the book of life , 2001, Bioinform..

[18]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[19]  T K Attwood Genomics. The Babel of bioinformatics. , 2000, Science.

[20]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[21]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[22]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[23]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[24]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[25]  R. Mullin,et al.  The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. , 1989, Biometrics.

[26]  Winston Hide,et al.  Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison , 1994, J. Comput. Biol..

[27]  Jonas S. Almeida,et al.  Universal sequence map (USM) of arbitrary discrete sequences , 2002, BMC Bioinformatics.

[28]  T Reichhardt,et al.  It's sink or swim as a tidal wave of data approaches , 1999, Nature.

[29]  Pavel A. Pevzner,et al.  Statistical distance between texts and filtration methods in sequence comparison , 1992, Comput. Appl. Biosci..

[30]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[31]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[32]  Alan Christoffels,et al.  A Novel Approach Towards a Comprehensive Consensus Representation of the Expressed Human Genome , 1997 .

[33]  Pasquale Petrilli Classification of protein sequences by their dipeptide composition , 1993, Comput. Appl. Biosci..

[34]  James Ze Wang,et al.  SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size , 2002, Bioinform..

[35]  Christian Gautier,et al.  Statistical method for predicting protein coding regions in nucleic acid sequences , 1987, Comput. Appl. Biosci..

[36]  A A Zharkikh,et al.  Statistical analysis of L-tuple frequencies in eubacteria and organelles. , 1993, Bio Systems.

[37]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[38]  Robert Miller,et al.  STACK: Sequence Tag Alignment and Consensus Knowledgebase , 2001, Nucleic Acids Res..

[39]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Mireille Régnier,et al.  A unified approach to word statistics , 1998, RECOMB '98.

[41]  T. Lundstedt,et al.  Classification of G‐protein coupled receptors by alignment‐independent extraction of principal chemical properties of primary amino acid sequences , 2002, Protein science : a publication of the Protein Society.

[42]  W. Stemmer,et al.  Genome shuffling leads to rapid phenotypic improvement in bacteria , 2002, Nature.

[43]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[44]  P Petrilli,et al.  PFDB: A Protein Families DataBase for Macintosh Computers. The Effectiveness of Its Organization in Searching for Protein Similarity , 1997, Journal of protein chemistry.

[45]  N N Alexandrov,et al.  Statistical method for rapid homology search. , 1988, Nucleic acids research.

[46]  John E. Carpenter,et al.  Assessment of the parallelization approach of d2_cluster for high‐performance sequence clustering , 2002, J. Comput. Chem..

[47]  Jonas S. Almeida,et al.  Analysis of genomic sequences by Chaos Game Representation , 2001, Bioinform..

[48]  Brian Everitt,et al.  Cluster analysis , 1974 .

[49]  Daniel B. Davison,et al.  Brute force estimation of the number of human genes using EST clustering as a measure , 2001, IBM J. Res. Dev..

[50]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[51]  H Moereels,et al.  Classification and identification of proteins by means of common and specific amino acid n-tuples in unaligned sequences. , 1998, Computer methods and programs in biomedicine.

[52]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[53]  A. J. Gibbs,et al.  The Transition Matrix Method for Comparing Sequences; Its use in Describing and Classifying Proteins by their Amino Acid Sequences , 1971 .

[54]  Tiee-Jian Wu,et al.  Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition , 2001, Biometrics.

[55]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[56]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[57]  Rainer Fuchs From Sequence to Biology: The Impact on Bioinformatics , 2002, Bioinform..

[58]  James R. Schott,et al.  Matrix Analysis for Statistics , 2005 .

[59]  Benjamin Yakir,et al.  Approximate p-values for local sequence alignments , 2000 .

[60]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[61]  J. Leader,et al.  A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. , 2002, Molecular biology and evolution.

[62]  Victor V. Solovyev,et al.  A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization , 1993, Comput. Appl. Biosci..

[63]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[64]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[65]  E V Koonin The emerging paradigm and open problems in comparative genomics. , 1999, Bioinformatics.

[66]  M. Lynch Intron evolution as a population-genetic process , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[67]  Dónall A. Mac Dónaill,et al.  Representation of amino acids as five-bit or three-bit patterns for filtering protein databases , 2001, Bioinform..

[68]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[69]  A A Zharkikh,et al.  Quick assessment of similarity of two sequences by comparison of their L-tuple frequencies. , 1993, Bio Systems.

[70]  Teresa K. Attwood,et al.  The Babel of Bioinformatics , 2000, Science.

[71]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[72]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[73]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[74]  William R. Pearson Protein sequence comparison and protein evolution , 1995, ISMB 1995.