A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation

The rapid proliferation of textual and multimedia online databases, digital libraries, Internet servers, and intranet services has turned researchers' and practitioners' dream of creating an information-rich society into a nightmare of info-gluts. Many researchers believe that turning an info-glut into a useful digital library requires automated techniques for organizing and categorizing large-scale information. This paper presents research in which we sought to develop a scaleable textual classification and categorization system based on the Kohonen's self-organizing feature map (SOM) algorithm. In our paper, we show how self-organization can be used for automatic thesaurus generation. Our proposed data structure and algorithm took advantage of the sparsity of coordinates in the document input vectors and reduced the SOM computational complexity by several order of magnitude. The proposed Scaleable SOM (SSOM) algorithm makes large-scale textual categorization tasks a possibility. Algorithmic intuition and the mathematical foundation of our research are presented in detail. We also describe three benchmarking experiments to examine the algorithm's performance at various scales: classification of electronic meeting comments, Internet homepages, and the Compendex collection.

[1]  Peter A. W. Lewis,et al.  Statistical Discrimination of the Synonymy/Antonymy Relationship Between Words , 1967, JACM.

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Brian Everitt,et al.  Cluster analysis , 1974 .

[4]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[6]  Richard P. Lippmann,et al.  An introduction to computing with neural nets , 1987 .

[7]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[8]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[9]  Gerald Salton,et al.  Automatic text processing , 1988 .

[10]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory, Third Edition , 1989, Springer Series in Information Sciences.

[11]  Luís B. Almeida,et al.  Improving the Learning Speed in Topological Maps of Patterns , 1990 .

[12]  Masafumi Hagiwara,et al.  Self-organizing multi-layer semantic maps , 1991 .

[13]  K. Kwok Query Learning Using an ANN with Adaptive Architecture , 1991, ML.

[14]  Uri Zernik,et al.  Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[15]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[16]  Slava M. Katz,et al.  Co-Occurrences of Antonymous Adjectives and Their Contexts , 1991, Comput. Linguistics.

[17]  W. Robertson,et al.  A neural algorithm for document clustering , 1991, Inf. Process. Manag..

[18]  Rodrigo A. Botafogo Cluster analysis for hypertext systems , 1993, SIGIR.

[19]  Risto Miikkulainen,et al.  Subsymbolic natural language processing - an integrated model of scripts, lexicon, and memory , 1993, Neural network modeling and connectionism.

[20]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[21]  Elizabeth D. Liddy,et al.  Text categorization for multiple users based on semantic features from a machine-readable dictionary , 1994, TOIS.

[22]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[23]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[24]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[25]  Jay F. Nunamaker,et al.  Automatic concept classification of text from electronic meetings , 1994, CACM.

[26]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[27]  Hsinchun Chen,et al.  Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms , 1995, J. Am. Soc. Inf. Sci..

[28]  Robert Burgin The retrieval effectiveness of five clustering algorithms as a function of indexing exhaustivity , 1995 .

[29]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[30]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[31]  Hsinchun Chen,et al.  Automatic Thesaurus Generation for an Electronic Community System , 1995, J. Am. Soc. Inf. Sci..

[32]  Robert Burgin,et al.  The Retrieval Effectiveness of Five Clustering Algorithms as a Function of Indexing Exhaustivity , 1995, J. Am. Soc. Inf. Sci..

[33]  A. Roli Artificial Neural Networks , 2012, Lecture Notes in Computer Science.

[34]  B. S. Manjunath,et al.  Texture Features for Browsing and Retrieval of Image Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Timo Honkela,et al.  Newsgroup Exploration with WEBSOM Method and Browsing Interface , 1996 .

[36]  Jay F. Nunamaker,et al.  Toward Intelligent Meeting Agents , 1996, Computer.

[37]  V. Demian,et al.  Implementation of the Self-Organizing Feature Map on Parallel Computers , 1992, Comput. Artif. Intell..

[38]  Hsinchun Chen,et al.  A Parallel Computing Approach to Creating Engineering Concept Spaces for Semantic Retrieval: The Illinois Digital Library Initiative Project , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Jay F. Nunamaker,et al.  A Graphical, Self-Organizing Approach to Classifying Electronic Meeting Output , 1997, J. Am. Soc. Inf. Sci..

[40]  D. Neef A Little Knowledge Is a Dangerous Thing , 1998 .

[41]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.