Clustering biological data with SOMs: On topology preservation in non-linear dimensional reduction

Dimensional reduction is a widely used technique for exploratory analysis of large volume of data. In biological datasets, each object is described by a large number of variables (or dimensions) and it is crucial to perform their analyses in a smaller space, to extract useful information. Kohonen self-organizing maps (SOMs) have been recently proposed in systems biology as a useful tool for exploratory analysis, data integration and discovery of new relationships in *omics datasets. SOMs have been traditionally used for clustering in several data mining problems, mainly due to their ability to preserve input data topology and reduce a high dimensional input space into a 2-D map. In spite of this, the above-mentioned dimensional reduction can lead to counterintuitive results. Sometimes, maps having almost the same size, trained on the same dataset, and with identical learning algorithms and parameters, may find different clusters. However, one would expect that small changes in map sizes or another training condition would not result in an abrupt different location of any of the grouped patterns. The aim of this work is to analyze and explain this issue through a real case study involving transcriptomic and metabolomic data, since it might have an important impact when interpreting clustering results over a biological dataset.

[1]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[2]  M. Hirai,et al.  Decoding genes with coexpression networks and metabolomics - 'majority report by precogs'. , 2008, Trends in plant science.

[3]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[4]  Eric Bonabeau,et al.  Self-Organizing Maps for Drawing Large Graphs , 1998, Inf. Process. Lett..

[5]  Theo Geisel,et al.  A Topographic Product for the Optimization of Self-Organizing Feature Maps , 1991, NIPS.

[6]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[7]  Yi Pan,et al.  Computational Intelligence in Bioinformatics , 2007 .

[8]  M. Hirai,et al.  Integration of transcriptomics and metabolomics for understanding of global responses to nutritional stresses in Arabidopsis thaliana. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Thomas Villmann,et al.  Topology preservation in self-organizing feature maps: exact definition and measurement , 1997, IEEE Trans. Neural Networks.

[10]  美弦 矢野,et al.  <ファクトデータベース・フリーウェア特集号> 一括学習型自己組織化マップ(BL-SOM)を利用したメタボロームおよびトランスクリプトームデータの統合解析 , 2006 .

[11]  T. Villmann,et al.  Topology Preservation in Self-Organizing Maps , 1999 .

[12]  Georgina Stegmayer,et al.  Neural network model for integration and visualization of introgressed genome and metabolite data , 2009, 2009 International Joint Conference on Neural Networks.

[13]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[14]  I-En Liao,et al.  A new approach for data clustering and visualization using self-organizing maps , 2012, Expert Syst. Appl..

[15]  Loren H. Rieseberg,et al.  lntrogression and Its Consequences in Plants , 1993 .

[16]  Gilles Pagès,et al.  Theoretical aspects of the SOM algorithm , 1998, Neurocomputing.

[17]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[18]  M. Zanor,et al.  Integrated Analysis of Metabolite and Transcript Levels Reveals the Metabolic Shifts That Underlie Tomato Fruit Development and Highlight Regulatory Aspects of Metabolic Network Behavior1[W] , 2006, Plant Physiology.

[19]  Paulo Novais,et al.  A visual analytics framework for cluster analysis of DNA microarray data , 2013, Expert Syst. Appl..

[20]  Bernardete Ribeiro,et al.  Clustering and visualization of bankruptcy trajectory using self-organizing map , 2013, Expert Syst. Appl..

[21]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation (3rd Edition) , 2007 .

[22]  Aviv Segev,et al.  Identification of trends from patents using self-organizing maps , 2012, Expert Syst. Appl..

[23]  Xia Li,et al.  A robust approach based on Weibull distribution for clustering gene expression data , 2011, Algorithms for Molecular Biology.

[24]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[25]  Georgina Stegmayer,et al.  A Biologically Inspired Validity Measure for Comparison of Clustering Methods over Metabolic Data Sets , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Kazuki Saito,et al.  Integrated Data Mining of Transcriptome and Metabolome Based on BL-SOM , 2006 .

[27]  Burkhard Morgenstern,et al.  Metabolite-based clustering and visualization of mass spectrometry data using one-dimensional self-organizing maps , 2008, Algorithms for Molecular Biology.

[28]  Z. Lippman,et al.  An integrated view of quantitative trait variation using tomato interspecific introgression lines. , 2007, Current opinion in genetics & development.

[29]  Jarkko Venna,et al.  Trustworthiness and metrics in visualizing similarity of gene expression , 2003, BMC Bioinformatics.

[30]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[31]  Georgina Stegmayer,et al.  *omeSOM: a software for clustering and visualization of transcriptional and metabolite data mined from interspecific crosses of crop plants , 2010, BMC Bioinformatics.

[32]  E. Arsuaga Uriarte,et al.  Topology Preservation in SOM , 2008 .

[33]  B. Neel,et al.  Genetic and cellular mechanisms of oncogenesis , 2007 .