Comparison of Genomic Sequences Clustering Using Normalized Compression Distance and Evolutionary Distance

Genomic sequences are usually compared using evolutionary distance, a procedure that implies the alignment of the sequences. Alignment of long sequences is a long procedure and the obtained dissimilarity results is not a metric. Recently the normalized compression distance was introduced as a method to calculate the distance between two generic digital objects, and it seems a suitable way to compare genomic strings. In this paper the clustering and the mapping, obtained using a SOM, with the traditional evolutionary distance and the compression distance are compared in order to understand if the two distances sets are similar. The first results indicate that the two distances catch different aspects of the genomic sequences and further investigations are needed to obtain a definitive result.

[1]  M. Nei,et al.  Molecular Evolution and Phylogenetics , 2000 .

[2]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[3]  Giuseppe Di Fatta,et al.  Soft Topographic Map for Clustering and Classification of Bacteria , 2007, IDA.

[4]  Samuel Kaski,et al.  Comparing Self-Organizing Maps , 1996, ICANN.

[5]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Panu Somervuo,et al.  Clustering and Visualization of Large Protein Sequence Databases by Means of an Extension on the Self-Organizing Map , 2000, Discovery Science.

[7]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[8]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[9]  Samuel Kaski,et al.  Clustering of Human Endogenous Retrovirus Sequences with Median Self-Organizing Map , 2003 .

[10]  George M. Garrity,et al.  Self-organizing and self-correcting classifications of biological data , 2005, Bioinform..

[11]  Didier Raoult,et al.  16S Ribosomal DNA Sequence Analysis of a Large Collection of Environmental and Clinical Unidentifiable Bacterial Isolates , 2000, Journal of Clinical Microbiology.

[12]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[13]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[14]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[15]  Yeuvo Jphonen,et al.  Self-Organizing Maps , 1995 .

[16]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[17]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[18]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[19]  D. Raoult,et al.  Systematic 16S rRNA Gene Sequencing of Atypical Clinical Isolates Identified 27 New Bacterial Species Associated with Humans , 2004, Journal of Clinical Microbiology.

[20]  Xin Chen,et al.  A compression algorithm for DNA sequences , 2001, IEEE Engineering in Medicine and Biology Magazine.

[21]  Klaus Obermayer,et al.  Self-organizing maps: Generalizations and new optimization techniques , 1998, Neurocomputing.

[22]  Barbara Hammer,et al.  Relational Topographic Maps , 2007, IDA.

[23]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[24]  Panu Somervuo,et al.  How to make large self-organizing maps for nonvectorial data , 2002, Neural Networks.