Pattern recognition and probabilistic measures in alignment-free sequence analysis

With the massive production of genomic and proteomic data, the number of available biological sequences in databases has reached a level that is not feasible anymore for exact alignments even when just a fraction of all sequences is used. To overcome this inevitable time complexity, ultrafast alignment-free methods are studied. Within the past two decades, a broad variety of nonalignment methods have been proposed including dissimilarity measures on classical representations of sequences like k-words or Markov models. Furthermore, articles were published that describe distance measures on alternative representations such as compression complexity, spectral time series or chaos game representation. However, alignments are still the standard method for real world applications in biological sequence analysis, and the time efficient alignment-free approaches are usually applied in cases when the accustomed algorithms turn out to fail or be too inconvenient.

[1]  Gary A. Churchill,et al.  Hidden Markov Chains and the Analysis of Genome Structure , 1992, Comput. Chem..

[2]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[3]  Dhundy Bastola,et al.  Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis , 2014, Briefings Bioinform..

[4]  Liqing Zhang,et al.  Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction , 2008, Nucleic acids research.

[5]  Tuan D. Pham,et al.  A probabilistic measure for alignment-free sequence comparison , 2004, Bioinform..

[6]  Marc S Halfon,et al.  Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs , 2008, Genome Biology.

[7]  Rolf Backofen,et al.  GraphClust: alignment-free structural clustering of local RNA secondary structures , 2012, Bioinform..

[8]  Dimitris Anastassiou,et al.  Genomic signal processing , 2001, IEEE Signal Process. Mag..

[9]  K. Hatje,et al.  A Phylogenetic Analysis of the Brassicales Clade Based on an Alignment-Free Sequence Comparison Method , 2012, Front. Plant Sci..

[10]  Yasin Bakis,et al.  Testing robustness of relative complexity measure method constructing robust phylogenetic trees for Galanthus L. Using the relative complexity measure , 2013, BMC Bioinformatics.

[11]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[12]  Jonas S. Almeida,et al.  Universal sequence map (USM) of arbitrary discrete sequences , 2002, BMC Bioinformatics.

[13]  Xiangde Zhang,et al.  Use of the Burrows–Wheeler similarity distribution to the comparison of the proteins , 2010, Amino Acids.

[14]  Cédric Notredame,et al.  Upcoming challenges for multiple sequence alignment methods in the high-throughput era , 2009, Bioinform..

[15]  Michael B. Eisen,et al.  Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments , 2006, BMC Bioinformatics.

[16]  Sean R. Eddy,et al.  Biological sequence analysis: Preface , 1998 .

[17]  Michael L. Bittner,et al.  Genomic Signal Processing: The Salient Issues , 2004, EURASIP J. Adv. Signal Process..

[18]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[19]  H. Karch,et al.  Alignment-Free Design of Highly Discriminatory Diagnostic Primer Sets for Escherichia coli O104:H4 Outbreak Strains , 2012, PloS one.

[20]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[21]  Geoffrey I. Webb,et al.  Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs , 2011, Nucleic acids research.

[22]  Bernhard Haubold,et al.  Alignment-free detection of local similarity among viral and bacterial genomes , 2011, Bioinform..

[23]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[24]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[25]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[26]  Tiee-Jian Wu,et al.  Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition , 2001, Biometrics.

[27]  Millaray Curilem Saldias,et al.  Image correlation method for DNA sequence alignment. , 2012 .

[28]  B. Li,et al.  LZ Complexity Distance of DNA Sequences and Its Application in Phylogenetic Tree Reconstruction , 2016, Genomics, proteomics & bioinformatics.

[29]  Somashekara Mt,et al.  Discovery of evolutionary patterns in ribosomal RNA data using markov models , 2012 .

[30]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[31]  Zhihua Liu,et al.  CGAP: a new comprehensive platform for the comparative analysis of chloroplast genomes , 2013, BMC Bioinformatics.

[32]  Tuan D. Pham,et al.  Spectral distortion measures for biological sequence comparisons and database searching , 2007, Pattern Recognit..

[33]  Cun-Quan Zhang,et al.  A Novel Model for DNA Sequence Similarity Analysis Based on Graph Theory , 2011, Evolutionary bioinformatics online.

[34]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[35]  Millaray Curilem Saldías,et al.  Image Correlation Method for DNA Sequence Alignment , 2012, PloS one.

[36]  M. Ragan,et al.  Next-generation phylogenomics , 2013, Biology Direct.

[37]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[38]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[39]  Qi Dai,et al.  Comparison study on k-word statistical measures for protein: From sequence to 'sequence space' , 2008, BMC Bioinformatics.

[40]  Matteo Comin,et al.  Alignment-free phylogeny of whole genomes using underlying subwords , 2012, Algorithms for Molecular Biology.

[41]  M. Waterman,et al.  Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[42]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[43]  Khalid Sayood,et al.  A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences , 2010, BMC Bioinformatics.

[44]  Luís M. S. Russo,et al.  Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis , 2012, Algorithms for Molecular Biology.

[45]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[47]  Wolfgang Maass,et al.  Fractal MapReduce decomposition of sequence alignment , 2012, Algorithms for Molecular Biology.

[48]  Alice C. McHardy,et al.  Alignment-Free Genome Tree Inference by Learning Group-Specific Distance Metrics , 2013, Genome biology and evolution.

[49]  Zu-Guo Yu,et al.  Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. , 2004, Journal of theoretical biology.

[50]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[51]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[52]  Sylvain Forêt,et al.  Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences , 2006, BMC Bioinformatics.

[53]  Raffaele Giancarlo,et al.  Textual data compression in computational biology: a synopsis , 2009, Bioinform..

[54]  Peter D. Karp,et al.  The comprehensive updated regulatory network of Escherichia coli K-12 , 2006, BMC Bioinformatics.

[55]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[56]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[57]  Philippe Dessen,et al.  GenXref. VI: Automatic generation of links between two heterogeneous databases , 1998, Bioinform..

[58]  P. Bork,et al.  A Molecular Study of Microbe Transfer between Distant Environments , 2008, PloS one.

[59]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[60]  Sharon L. R. Kardia,et al.  KGraph: a system for visualizing and evaluating complex genetic associations , 2007, Bioinform..

[61]  Jonas S. Almeida,et al.  Analysis of genomic sequences by Chaos Game Representation , 2001, Bioinform..

[62]  Winston Hide,et al.  Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison , 1994, J. Comput. Biol..