Alignment-free comparison of genome sequences by a new numerical characterization.

In order to compare different genome sequences, an alignment-free method has proposed. First, we presented a new graphical representation of DNA sequences without degeneracy, which is conducive to intuitive comparison of sequences. Then, a new numerical characterization based on the representation was introduced to quantitatively depict the intrinsic nature of genome sequences, and considered as a 10-dimensional vector in the mathematical space. Alignment-free comparison of sequences was performed by computing the distances between vectors of the corresponding numerical characterizations, which define the evolutionary relationship. Two data sets of DNA sequences were constructed to assess the performance on sequence comparison. The results illustrate well validity of the method. The new numerical characterization provides a powerful tool for genome comparison.

[1]  M. I. A. E. Maaty,et al.  3D graphical representation of protein sequences and their statistical characterization , 2010 .

[2]  Chandan Raychaudhury,et al.  Indexing Scheme and Similarity Measures for Macromolecular Sequences , 1999, J. Chem. Inf. Comput. Sci..

[3]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[4]  Changchuan Yin,et al.  A Novel Construction of Genome Space with Biological Geometry , 2010, DNA research : an international journal for rapid publication of reports on genes and genomes.

[5]  Bo Liao,et al.  New 2D graphical representation of DNA sequences , 2004, J. Comput. Chem..

[6]  J. Faith,et al.  Evolution of base-substitution gradients in primate mitochondrial genomes. , 2005, Genome research.

[7]  M. Waterman,et al.  Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Zhao-Hui Qi,et al.  New 3D graphical representation of DNA sequence based on dual nucleotides , 2007, Journal of Theoretical Biology.

[9]  Tiee-Jian Wu,et al.  Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition , 2001, Biometrics.

[10]  Bo Liao,et al.  A 3D graphical representation of DNA sequences and its application , 2006, Theor. Comput. Sci..

[11]  Guohua Huang,et al.  H–L curve: A novel 2D graphical representation for DNA sequences , 2008 .

[12]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[13]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[14]  J. Leader,et al.  A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. , 2002, Molecular biology and evolution.

[15]  S. Basak,et al.  Mathematical descriptors of DNA sequences: development and applications , 2006 .

[16]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[17]  Milan Randic,et al.  On the Similarity of DNA Primary Sequences , 2000, J. Chem. Inf. Comput. Sci..

[18]  Guohua Huang,et al.  Similarity studies of DNA sequences based on a new 2D graphical representation. , 2009, Biophysical chemistry.

[19]  Dejan Plavšić,et al.  Novel 2-D graphical representation of DNA sequences and their numerical characterization , 2003 .

[20]  I. Korf,et al.  Applying word-based algorithms: the IMEter. , 2009, Methods in molecular biology.

[21]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[22]  Se-Ran Jun,et al.  Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution , 2009, Proceedings of the National Academy of Sciences.

[23]  Milan Randic,et al.  On 3‐D Graphical Representation of DNA Primary Sequences and Their Numerical Characterization. , 2001 .

[24]  Dejan Plavšić,et al.  Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation , 2003 .

[25]  Se-Ran Jun,et al.  Whole-genome phylogeny of mammals: Evolutionary information in genic and nongenic regions , 2009, Proceedings of the National Academy of Sciences.

[26]  Tiee-Jian Wu,et al.  Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences , 2005, Bioinform..

[27]  Jie Song A new 3-D graphical representation of DNA sequences and their numerical characterization , 2009, 2009 4th International Conference on Computer Science & Education.

[28]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[29]  Jia Wen,et al.  A 2D graphical representation of protein sequence and its numerical characterization , 2009 .