Local Decoding of Sequences and Alignment-Free Comparison

Subword composition plays an important role in a lot of analyses of sequences. Here we define and study the "local decoding of order N of sequences," an alternative that avoids some drawbacks of "subwords of length N" approaches while keeping informations about environments of length N in the sequences ("decoding" is taken here in the sense of hidden Markov modeling, i.e., associating some state to all positions of the sequence). We present an algorithm for computing the local decoding of order N of a given set of sequences. Its complexity is linear in the total length of the set (whatever the order N) both in time and memory space. In order to show a use of local decoding, we propose a very basic dissimilarity measure between sequences which can be computed both from local decoding of order N and composition in subwords of length N. The accuracies of these two dissimilarities are evaluated, over several datasets, by computing their linear correlations with a reference alignment-based distance. These accuracies are also compared to the one obtained from another recent alignment-free comparison.

[1]  C. Kuiken,et al.  HIV-1 Subtyping , 2002 .

[2]  Gilles Didier Characterization of N -writings and application to the study of complexity sequences ultimately n + c ste , 1999 .

[3]  Yves Van de Peer,et al.  zt: A Sofware Tool for Simple and Partial Mantel Tests , 2002 .

[4]  M. Buchheim,et al.  PHYLOGENY OF THE CHLOROPHYCEAE WITH SPECIAL REFERENCE TO THE SPHAEROPLEALES: A STUDY OF 18S AND 26S rDNA DATA , 2001 .

[5]  G. Spicer,et al.  Molecular phylogeny of songbirds (Passeriformes) inferred from mitochondrial 16S ribosomal RNA gene sequences. , 2004, Molecular phylogenetics and evolution.

[6]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[7]  David L. Robertson,et al.  HIV-1 nomenclature proposal: a reference guide to HIV-1 classification. , 2000 .

[8]  Frank E. Anderson,et al.  Bilaterian Phylogeny Based on Analyses of a Region of the Sodium–Potassium ATPase β-Subunit Gene , 2004, Journal of Molecular Evolution.

[9]  Tuan D. Pham,et al.  A probabilistic measure for alignment-free sequence comparison , 2004, Bioinform..

[10]  Christopher J. Lee,et al.  Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems , 2004, Bioinform..

[11]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[12]  Karina Yusim,et al.  The Los Alamos hepatitis C sequence database , 2005, Bioinform..

[13]  Michael Nappier,et al.  COMPENDIUM 外科 潜在精巣 , 2008 .

[14]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[15]  P. Sharp,et al.  Diversity and Evolution of Primate Lentiviruses , 2000 .

[16]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[17]  R DeSalle,et al.  Multiple sources of character information and the phylogeny of Hawaiian drosophilids. , 1997, Systematic biology.

[18]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[19]  S Karlin,et al.  Comparisons of eukaryotic genomic sequences. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[21]  Brian T. Foley,et al.  An overview of the molecular phylogeny of lentiviruses , 2001 .

[22]  A Hénaut,et al.  HIV-1 and HIV-2 LTR nucleotide sequences: assessment of the alignment by N-block presentation, "retroviral signatures" of overrepeated oligonucleotides, and a probable important role of scrambled stepwise duplications/deletions in molecular evolution. , 2001, Molecular biology and evolution.

[23]  J Wöstemeyer,et al.  Phylogeny and origin of 82 zygomycetes from all 54 genera of the Mucorales and Mortierellales based on combined analysis of actin and translation elongation factor EF-1alpha genes. , 2001, Gene.

[24]  M. Blaxter,et al.  Caenorhabditis elegans is a nematode. , 1998, Science.

[25]  B. Fisher,et al.  Dracula ant phylogeny as inferred by nuclear 28S rDNA sequences and implications for ant systematics (Hymenoptera: Formicidae: Amblyoponinae). , 2004, Molecular phylogenetics and evolution.

[26]  J. Palmer,et al.  Evidence from small-subunit ribosomal RNA sequences for a fungal origin of Microsporidia. , 2005, Molecular phylogenetics and evolution.

[27]  Benjamin J. Raphael,et al.  A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.

[28]  B. Korber,et al.  HIV sequence compendium 2002 , 2002 .

[29]  F. Gao,et al.  Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes , 1999, Nature.

[30]  Gilles Didier,et al.  Caractérisation des N-écritures et application à l'étude des suites de complexité ultimement n+cste , 1999, Theor. Comput. Sci..

[31]  C. Simon,et al.  Phylogeny of the Dragonfly and Damselfly Order Odonata as Inferred by Mitochondrial 12S Ribosomal RNA Sequences , 2003 .

[32]  Ivan Laprevotte,et al.  Retroviral Oligonucleotide Distributions Correlate with Biased Nucleotide Compositions of Retrovirus Sequences, Suggesting a Duplicative Stepwise Molecular Evolution , 1997, Journal of Molecular Evolution.

[33]  V Soriano,et al.  [HIV-1 group O]. , 1995, Medicina clinica.