Tree view self-organisation of web content

When browsing a large set of unstructured documents, it is advantageous if the documents have been organised and presented in a way that makes navigation efficient, understanding underlying concepts easy and locating related information quickly. This paper proposes a new method termed Treeview self-organising maps (Treeview SOMs) for clustering and organising text documents by means of a series of independently and automatically created, hierarchical one-dimensional SOMs. The method generates a topological taxonomy tree for a set of unstructured text documents in terms of presentation and visualisation. The documents are organised in a hierarchy of dynamically generated and automatically validated topics extracted from the corpus of the documents. The results presented in a labelled tree view, clearly show underlying contents of the documents and can help browsing the document set more efficiently than those of previous work using SOMs or hierarchical clustering methods. A brief overview on general document clustering and a review on SOM-based document analysis methods are also provided together with a comparison among them.

[1]  Wendy Wu,et al.  Document categorization and retrieval using semantic microfeatures and growing cell structures , 2001, 12th International Workshop on Database and Expert Systems Applications.

[2]  Carolyn J. Crouch,et al.  A cluster-based approach to thesaurus construction , 1988, SIGIR '88.

[3]  A. Rauber,et al.  Document Classification with Unsupervised Artificial Neural Networks , 2000 .

[4]  Gareth Jones,et al.  Non-hierarchic document clustering , 1995 .

[5]  Dmitri Roussinov,et al.  A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation , 1998 .

[6]  Andreas Rauber,et al.  "'Andreas, Rauber'? Conference pages are over there, German documents on the lower left...": an "old-fashioned" approach to Web search results visualization , 2000, Proceedings 11th International Workshop on Database and Expert Systems Applications.

[7]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[8]  Terumasa Ehara,et al.  An efficient document clustering algorithm and its application to a document browser , 1999, Inf. Process. Manag..

[9]  Gail E. Kaiser,et al.  An Information Retrieval Approach For Automatically Constructing Software Libraries , 1991, IEEE Trans. Software Eng..

[10]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[11]  Manoranjan Dash,et al.  Dimensionality reduction of unsupervised data , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[12]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[13]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[14]  Risto Mukkulainen,et al.  Script Recognition with Hierarchical Feature Maps , 1990 .

[15]  Bala Srinivasan,et al.  Dynamic self-organizing maps with controlled growth for knowledge discovery , 2000, IEEE Trans. Neural Networks Learn. Syst..

[16]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[17]  Victoria J. Hodge,et al.  Hierarchical growing cell structures: TreeGCS , 2000, KES'2000. Fourth International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies. Proceedings (Cat. No.00TH8516).

[18]  Hsinchun Chen,et al.  Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques , 1998, J. Am. Soc. Inf. Sci..

[19]  Risto Miikkulainen,et al.  Incremental grid growing: encoding high-dimensional structure into a two-dimensional feature map , 1993, IEEE International Conference on Neural Networks.

[20]  W. Bruce Croft,et al.  Document clustering: An evaluation of some experiments with the cranfield 1400 collection , 1975, Inf. Process. Manag..

[21]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Timo Honkela,et al.  Exploration of full-text databases with self-organizing maps , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[23]  Dieter Merkl,et al.  Exploration of text collections with hierarchical feature maps , 1997, SIGIR '97.

[24]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[25]  W. Robertson,et al.  A neural algorithm for document clustering , 1991, Inf. Process. Manag..

[26]  Thomas Mandl Tolerant Information Retrieval with Backpropagation Networks , 2000, Neural Computing & Applications.

[27]  Teuvo Kohonen,et al.  Exploration of very large databases by self-organizing maps , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[28]  Edie M. Rasmussen,et al.  Non-hierarchical document clustering using the ICL distribution array processor , 1987, SIGIR '87.

[29]  P. Schauble,et al.  Thesaurus based concept spaces , 1987, SIGIR '87.

[30]  Bernd Fritzke,et al.  Unsupervised clustering with growing cell structures , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[31]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[32]  Victoria J. Hodge,et al.  Hierarchical word clustering - automatic thesaurus generation , 2002, Neurocomputing.

[33]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[34]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[35]  Dieter Merkl,et al.  Text classification with self-organizing maps: Some lessons learned , 1998, Neurocomputing.

[36]  D. Merkl,et al.  Content-based software classification by self-organization , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[37]  ChenHsinchun,et al.  Internet browsing and searching , 1998 .

[38]  Hujun Yin,et al.  Self-organising maps for tree view based hierarchical document clustering , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[39]  Michelle Q. Wang Baldonado,et al.  SONIA: a service for organizing networked information autonomously , 1998, DL '98.

[40]  Tsvi Kuflik,et al.  Automating Personal Categorization Using Artificial Neural Networks , 2001, User Modeling.

[41]  Evelyne Tzoukermann,et al.  A NATURAL LANGUAGE APPROACH TO MULTI-WORD TERM CONFLATION , 1997 .

[42]  Bernd Fritzke,et al.  Kohonen Feature Maps and Growing Cell Structures - a Performance Comparison , 1992, NIPS.

[43]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[44]  Fabio Crestani,et al.  Soft Computing in Information Retrieval , 2000 .

[45]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[46]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[47]  A. Nurnberger,et al.  Visualizing changes in data collections using growing self-organizing maps , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[48]  Robert Kozma,et al.  A modified fuzzy ART for soft document clustering , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[49]  Hujun Yin,et al.  Self-Organising Maps for Hierarchical Tree View Document Clustering Using Contextual Information , 2002, IDEAL.

[50]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[51]  Hujun Yin,et al.  Advances in Self-Organising Maps, WSOM 2001, Lincoln, UK, 13-15 June, 2011 , 2001, WSOM.

[52]  Hsinchun Chen,et al.  Internet Categorization and Search: A Self-Organizing Approach , 1996, J. Vis. Commun. Image Represent..

[53]  Hsinchun Chen,et al.  Information navigation on the web by clustering and summarizing query results , 2001, Inf. Process. Manag..

[54]  Huilin Ye,et al.  A Visualised Software Library: Nested Self-Organising Maps for Retrieving and Browsing Reusable Software Assets , 2000, Neural Computing & Applications.

[55]  RauberA.,et al.  The growing hierarchical self-organizing map , 2002 .

[56]  Bernd Fritzke Growing Grid — a self-organizing network with constant neighborhood range and adaptation strength , 1995, Neural Processing Letters.

[57]  Andreas Rauber,et al.  The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data , 2002, IEEE Trans. Neural Networks.

[58]  Dan Shen,et al.  Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System , 2000, J. Digit. Inf..

[59]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[60]  Bernd Fritzke,et al.  A Growing Neural Gas Network Learns Topologies , 1994, NIPS.

[61]  Rudolf Hanka,et al.  Feature Set Reduction for Document Classification Problems , 2001 .

[62]  V. Burzevski,et al.  Hierarchical growing cell structures , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[63]  Andreas Rauber,et al.  Recent Advances with the Growing Hierarchical Self-Organizing Map , 2001, WSOM.

[64]  Samuel Kaski,et al.  Keyword selection method for characterizing text document maps , 1999 .

[65]  Martin Franz,et al.  Unsupervised and supervised clustering for topic tracking , 2001, SIGIR '01.

[66]  Andreas Rauber,et al.  LabelSOM: on the labeling of self-organizing maps , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[67]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[68]  Huilin Ye,et al.  Towards a self-structuring software library , 2001, IEE Proc. Softw..