Local Similarities and Clustering of Biological Sequences: New Insights from N-local Decoding

The search for local similarities in sequences is a classical problem in biology, and several methods have been developed for this goal. We herein investigate the N-local decoding method due to Gilles Didier ([2]), in order to classify sequences according to the local similarity segments of a fixed length N that they share. The sites of our sequences are originally occupied by a nucleotide or an amino-acid, and, after the N-local decoding has been applied, these sites are occupied by new symbols (that we call GD-classes), which classify the sites according to the composition of their environment in words of length N. This method has already been successfully used to construct trees for the subtyping of HIV/SIV variants ([3]). After recalling the definitions and the original method of the N-local decoding, we will present new developments which aim, on the one hand to allow to exploit the information generated by the decoding, and on the other hand, to tackle the influence of the free parameter N.