DNA sequence comparison by a novel probabilistic method

This paper proposes a novel method for comparing DNA sequences. By using a graphical representation, we are able to construct the probability distributions of DNA sequences. These probability distributions can then be used to make similarity studies by using the symmetrised Kullback-Leibler divergence. After presenting our method, we test it using six DNA sequences taken from the threonine operons of Escherichia coli K-12 and Shigella flexneri. Our approach is then used to study the evolution of primates using mitochondrial DNA data. Our method allows us to reconstruct a phylogenetic tree for primate evolution. In addition, we use our technique to analyze the classification and phylogeny of the Tomato Yellow Leaf Curl Virus (TYLCV) based on its whole genome sequences. These examples show that large volumes of DNA sequences can be handled more easily and more quickly by our approach than by the existing multiple alignment methods. Moreover, our method, unlike other approaches, does not require human intervention, because it can be applied automatically.

[1]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[2]  Amir Niknejad,et al.  DNA sequence representation without degeneracy. , 2003, Nucleic acids research.

[3]  Tuan D. Pham,et al.  A probabilistic measure for alignment-free sequence comparison , 2004, Bioinform..

[4]  Joel Dudley,et al.  MEGA: A biologist-centric software for evolutionary analysis of DNA and protein sequences , 2008, Briefings Bioinform..

[5]  Kareem Carr,et al.  A Rapid Method for Characterization of Protein Relatedness Using Feature Vectors , 2010, PloS one.

[6]  Bo Liao,et al.  New 2D graphical representation of DNA sequences , 2004, J. Comput. Chem..

[7]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[8]  Hitoshi Iba,et al.  Evolutionary modeling and inference of gene network , 2002, Inf. Sci..

[9]  M. A. GATES,et al.  Simpler DNA sequence representations , 1985, Nature.

[10]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  EUGENE HAMORI,et al.  Novel DNA sequence representations , 1985, Nature.

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  Mourad Elloumi,et al.  Comparison of Strings Belonging to the Same Family , 1998, Inf. Sci..

[15]  Allan C. Wilson,et al.  Mitochondrial DNA sequences of primates: Tempo and mode of evolution , 2005, Journal of Molecular Evolution.

[16]  Nello Cristianini,et al.  Introduction to computational genomics - a case studies approach , 2007 .

[17]  Changchuan Yin,et al.  A Novel Construction of Genome Space with Biological Geometry , 2010, DNA research : an international journal for rapid publication of reports on genes and genomes.

[18]  Dejan Plavšić,et al.  Novel 2-D graphical representation of DNA sequences and their numerical characterization , 2003 .

[19]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[20]  L. R. Rabiner,et al.  A probabilistic distance measure for hidden Markov models , 1985, AT&T Technical Journal.

[21]  E. Holmes,et al.  Phylogenetic Evidence for Rapid Rates of Molecular Evolution in the Single-Stranded DNA Begomovirus Tomato Yellow Leaf Curl Virus , 2007, Journal of Virology.

[22]  Chenglong Yu,et al.  A protein map and its application. , 2008, DNA and cell biology.

[23]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Dominique Lavenier,et al.  Coding Region Prediction Based on a Universal DNA Sequence Representation Method , 2008, J. Comput. Biol..

[25]  Ricardo J. G. B. Campello,et al.  On comparing two sequences of numbers and its applications to clustering analysis , 2009, Inf. Sci..

[26]  Libin Liu,et al.  Clustering DNA sequences by feature vectors. , 2006, Molecular phylogenetics and evolution.

[27]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[28]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[29]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[30]  Guohua Huang,et al.  Similarity studies of DNA sequences based on a new 2D graphical representation. , 2009, Biophysical chemistry.

[31]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .