Soft Topographic Maps for Clustering and Classifying Bacteria Using Housekeeping Genes

The Self-Organizing Map (SOM) algorithm is widely used for building topographic maps of data represented in a vectorial space, but it does not operate with dissimilarity data. Soft Topographic Map (STM) algorithm is an extension of SOM to arbitrary distance measures, and it creates a map using a set of units, organized in a rectangular lattice, defining data neighbourhood relationships. In the last years, a new standard for identifying bacteria using genotypic information began to be developed. In this new approach, phylogenetic relationships of bacteria could be determined by comparing a stable part of the bacteria genetic code, the so-called "housekeeping genes." The goal of this work is to build a topographic representation of bacteria clusters, by means of self-organizing maps, starting from genotypic features regarding housekeeping genes.

[1]  T. Macke,et al.  A phylogenetic definition of the major eubacterial taxa. , 1985, Systematic and applied microbiology.

[2]  Colin Fyfe,et al.  Topological Mappings of Video and Audio Data , 2008, Int. J. Neural Syst..

[3]  S. P. Luttrell,et al.  A Bayesian Analysis of Self-Organizing Maps , 1994, Neural Computation.

[4]  S. Jeffery Evolution of Protein Molecules , 1979 .

[5]  Eric W. Weisstein,et al.  The CRC concise encyclopedia of mathematics , 1999 .

[6]  Klaus Obermayer,et al.  A Stochastic Self-Organizing Map for Proximity Data , 1999, Neural Computation.

[7]  Samuel Kaski,et al.  Clustering of Human Endogenous Retrovirus Sequences with Median Self-Organizing Map , 2003 .

[8]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[9]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Didier Raoult,et al.  16S Ribosomal DNA Sequence Analysis of a Large Collection of Environmental and Clinical Unidentifiable Bacterial Isolates , 2000, Journal of Clinical Microbiology.

[11]  Kevin D. Reilly,et al.  SEQOPTICS: A Protein Sequence Clustering Method , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[12]  George M. Garrity,et al.  Self-organizing and self-correcting classifications of biological data , 2005, Bioinform..

[13]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[14]  E. Aronson,et al.  Theory and method , 1985 .

[15]  Giuseppe Di Fatta,et al.  Soft Topographic Map for Clustering and Classification of Bacteria , 2007, IDA.

[16]  Yeuvo Jphonen,et al.  Self-Organizing Maps , 1995 .

[17]  Klaus Obermayer,et al.  Self-organizing maps: Generalizations and new optimization techniques , 1998, Neurocomputing.

[18]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[19]  Diego Vidaurre,et al.  A Quick Assessment of Topology Preservation for SOM Structures , 2007, IEEE Transactions on Neural Networks.

[20]  A. Ultsch Maps for the Visualization of high-dimensional Data Spaces , 2003 .

[21]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[22]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[23]  J. Clarridge,et al.  Impact of 16S rRNA Gene Sequence Analysis for Identification of Bacteria on Clinical Microbiology and Infectious Diseases , 2004, Clinical Microbiology Reviews.

[24]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[25]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[26]  Colin Fyfe,et al.  Online Clustering Algorithms , 2008, Int. J. Neural Syst..

[27]  Panu Somervuo,et al.  How to make large self-organizing maps for nonvectorial data , 2002, Neural Networks.

[28]  I. Jolliffe Principal Component Analysis , 2002 .

[29]  D. Raoult,et al.  Systematic 16S rRNA Gene Sequencing of Atypical Clinical Isolates Identified 27 New Bacterial Species Associated with Humans , 2004, Journal of Clinical Microbiology.

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  Samuel Kaski,et al.  Self-organizing map-based discovery and visualization of human endogenous retroviral sequence groups , 2005, Int. J. Neural Syst..

[32]  Alexander N. Gorban,et al.  Principal Manifolds and Graphs in Practice: from Molecular Biology to Dynamical Systems , 2010, Int. J. Neural Syst..

[33]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[34]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[35]  Panu Somervuo,et al.  Clustering and Visualization of Large Protein Sequence Databases by Means of an Extension on the Self-Organizing Map , 2000, Discovery Science.

[36]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[37]  Massimo La Rosa,et al.  Normalised compression distance and evolutionary distance of genomic sequences: comparison of clustering results , 2009, Int. J. Knowl. Eng. Soft Data Paradigms.