A quickly trainable hybrid SOM-based document organization system

The large volume of nowadays document collections has increased the need of fast trainable document organization systems. This paper presents and evaluates a hybrid system to self-organization of massive document collections based on self-organizing map (SOM). The hybrid system uses prototypes generated by a clustering algorithm to train the document maps, thus reducing the training time of large maps. We test the system with k-means and modified leader clustering algorithms. The experiments are carried out with the Reuters-21758 v1.0 and 20 Newsgroup collections. The performance of the system is measured in terms of text categorization effectiveness on test set and training time. Experimental results show that the proposed system generates effective document maps in less time than SOM. However, the hybrid system using k-means generates better document maps than the one using modified leader at the cost of more long training time.

[1]  Péter András Kernel-Kohonen Networks , 2002, Int. J. Neural Syst..

[2]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[3]  R. Wilcox Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy , 2001 .

[4]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[5]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[6]  Arnulfo P. Azcarraga,et al.  SOM-based methodology for building large text archives , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[7]  Juha Vesanto,et al.  Neural Network Tool for Data Mining: SOM Toolbox , 2000 .

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Hujun Yin,et al.  Kernel self-organising maps for classification , 2006, Neurocomputing.

[10]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[11]  Ah-Hwee Tan,et al.  Modified ART 2A growing network capable of generating a fixed number of nodes , 2004, IEEE Transactions on Neural Networks.

[12]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[13]  Erkki Oja,et al.  Artificial Neural Networks and Neural Information Processing — ICANN/ICONIP 2003 , 2003, Lecture Notes in Computer Science.

[14]  Teresa Bernarda Ludermir,et al.  A Hybrid SOM-Based Document Organization System , 2006, 2006 Ninth Brazilian Symposium on Neural Networks (SBRN'06).

[15]  Hujun Yin,et al.  Data visualisation and manifold mapping using the ViSOM , 2002, Neural Networks.

[16]  Marc Teboulle,et al.  Grouping Multidimensional Data - Recent Advances in Clustering , 2006 .

[17]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[18]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[19]  Hujun Yin,et al.  On the equivalence between kernel self-organising maps and self-organising mixture density networks , 2006, Neural Networks.

[20]  Colin Fyfe,et al.  The kernel self-organising map , 2000, KES'2000. Fourth International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies. Proceedings (Cat. No.00TH8516).

[21]  Ioannis Pitas,et al.  Marginal median SOM for document organization and retrieval , 2004, Neural Networks.

[22]  Hujun Yin,et al.  Adaptive topological tree structure for document organisation and visualisation , 2004, Neural Networks.

[23]  Joydeep Ghosh,et al.  Similarity-Based Text Clustering: A Comparative Study , 2006, Grouping Multidimensional Data.

[24]  Mohamed S. Kamel,et al.  Document clustering using hierarchical SOMART neural network , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[25]  Emilio Corchado,et al.  Advances in Self-Organizing Maps , 2006, Neural Networks.

[26]  Arlindo L. Oliveira,et al.  Semi-supervised single-label text categorization using centroid-based classifiers , 2007, SAC '07.

[27]  Fabrizio Sebastiani,et al.  An analysis of the relative hardness of Reuters-21578 subsets: Research Articles , 2005 .

[28]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[29]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[30]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .

[31]  Helge J. Ritter,et al.  Large-scale data exploration with the hierarchically growing hyperbolic SOM , 2006, Neural Networks.