Patch Relational Neural Gas - Clustering of Huge Dissimilarity Datasets

Clustering constitutes an ubiquitous problem when dealing with huge data sets for data compression, visualization, or preprocessing. Prototype-based neural methods such as neural gas or the self-organizing map offer an intuitive and fast variant which represents data by means of typical representatives, thereby running in linear time. Recently, an extension of these methods towards relational clustering has been proposed which can handle general non-vectorial data characterized by dissimilarities only, such as alignment or general kernels. This extension, relational neural gas, is directly applicable in important domains such as bioinformatics or text clustering. However, it is quadratic in mboth in memory and in time (mbeing the number of data points). Hence, it is infeasible for huge data sets. In this contribution we introduce an approximate patch version of relational neural gas which relies on the same cost function but it dramatically reduces time and memory requirements. It offers a single pass clustering algorithm for huge data sets, running in constant space and linear time only.

[1]  Thomas Villmann,et al.  Batch and median neural gas , 2006, Neural Networks.

[2]  Michael Beetz,et al.  KI 2007: Advances in Artificial Intelligence, 30th Annual German Conference on AI, KI 2007, Osnabrück, Germany, September 10-13, 2007, Proceedings , 2007, KI.

[3]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[4]  Panu Somervuo,et al.  How to make large self-organizing maps for nonvectorial data , 2002, Neural Networks.

[5]  Nikolai Alex,et al.  Parallelizing single pass patch clustering , 2008 .

[6]  Barbara Hammer,et al.  Parallelizing single patch pass clustering , 2008, ESANN.

[7]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[8]  Ruoming Jin,et al.  Fast and exact out-of-core and distributed k-means clustering , 2006, Knowledge and Information Systems.

[9]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[10]  M. Vingron,et al.  Quantifying the local reliability of a sequence alignment. , 1996, Protein engineering.

[11]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[12]  Frank Klawonn,et al.  Single pass clustering for large data sets , 2007 .

[13]  A. Ennaji,et al.  An incremental growing neural gas learns topologies , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[14]  Barbara Hammer,et al.  Relational Neural Gas , 2007, KI.

[15]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[16]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[17]  W. N. Street,et al.  Computer-derived nuclear features distinguish malignant from benign breast cytology. , 1995, Human pathology.

[18]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[19]  Claus Bahlmann,et al.  Learning with Distance Substitution Kernels , 2004, DAGM-Symposium.

[20]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[21]  Horst Bunke,et al.  Edit distance-based kernel functions for structural pattern classification , 2006, Pattern Recognit..

[22]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.