Large Local Analysis of the Unaligned Genome and Its Application

We describe a novel method for the local analysis of complete genomes. A local distance measure called LODIST is proposed, which is based on the relationship between the longest common words and the shortest absent words of two genomes we compared. LODIST can perform better than local alignment when the local region is large enough to cover some recombination genes. A distance measure called SILD.k.t with resolution k and step t is derived by the integral LODISTs of whole genomes. It is shown that the algorithm for computing the LODISTs and SILD.k.t is linear, which is fast enough to consider the problem of the genome comparison. We verify this method by recognizing the subtypes of the HIV-1 complete genomes and genome segments.

[1]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[2]  Se-Ran Jun,et al.  Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution , 2009, Proceedings of the National Academy of Sciences.

[3]  Antonio Restivo,et al.  Distance measures for biological sequences: Some recent approaches , 2008, Int. J. Approx. Reason..

[4]  Brian T. Foley,et al.  HIV-1 Subtype and Circulating Recombinant Form (CRF) Reference Sequences, 2005 , 2005 .

[5]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[6]  Tuan D. Pham,et al.  A probabilistic measure for alignment-free sequence comparison , 2004, Bioinform..

[7]  Milan Randić,et al.  2-D Graphical representation of proteins based on physico-chemical properties of amino acids , 2007 .

[8]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[9]  Randy Goebel,et al.  Nucleotide composition string selection in HIV-1 subtyping using whole genomes , 2007, Bioinform..

[10]  Friedrich Möller,et al.  Genome comparison without alignment using shortest unique substrings , 2005, BMC Bioinformatics.

[11]  Chun Li,et al.  Analysis of similarity/dissimilarity of protein sequences , 2008, Proteins.

[12]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[13]  Tuan D. Pham,et al.  Spectral distortion measures for biological sequence comparisons and database searching , 2007, Pattern Recognit..

[14]  Dachao Li,et al.  Conditional LZ Complexity of DNA Sequences Analysis and its Application in Phylogenetic Tree Reconstruction , 2008, 2008 International Conference on BioMedical Engineering and Informatics.

[15]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[16]  Thomas Wiehe,et al.  Estimating Mutation Distances from Unaligned Genomes , 2009, J. Comput. Biol..

[17]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[18]  Ren Zhang,et al.  The Z curve database: a graphic representation of genome sequences , 2003, Bioinform..

[19]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[20]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[21]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[22]  Ling Li,et al.  Similarity/dissimilarity studies of protein sequences based on a new 2D graphical representation , 2010, J. Comput. Chem..

[23]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[24]  Bo Liao,et al.  Phylogenetic tree construction based on 2D graphical representation , 2006 .

[25]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[26]  Armando J. Pinho,et al.  On finding minimal absent words , 2009, BMC Bioinformatics.