Extracting meaningful labels for WEBSOM text archives

Self-Organizing Maps, being used mainly with data that are not pre-labeled, need automatic procedures for extracting keywords as labels for each of the map units. The WEBSOM methodology for building very large text archives has a very slow method for extracting such unit labels. It computes the relative frequencies of all the words of all the documents associated to each unit and then compares these to the relative frequencies of all the words of all the other units of the map. Since maps may have more than 100,000 units and the archive may contain up to 7 million documents, the existing WEBSOM method is not practical. This paper describes how the meaningful labels per map unit can be deduced by analyzing the relative weight distribution of the SOM weight vectors and by taking advantage of some characteristics of the random projection method used in dimensionality reduction. The effectiveness of this technique is demonstrated on archives of the well studied Reuters and CNN collections. Comparisons with the WEBSOM method are provided.

[1]  Andreas Rauber,et al.  Using Growing hierarchical self-organizing maps for document classification , 2000, ESANN.

[2]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[3]  David W. Aha,et al.  Feature Weighting for Lazy Learning Algorithms , 1998 .

[4]  Andreas Rauber,et al.  Mining Text Archives: Creating Readable Maps to Structure and Describe Document Collections , 1999, PKDD.

[5]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[6]  Teuvo Kohonen,et al.  Self-Organization of Very Large Document Collections: State of the Art , 1998 .

[7]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[8]  Andreas Rauber,et al.  Uncovering the Hierarchical Structure of Text Archives by Using an Unsupervised Neural Network with Adaptive Architecture , 2000, PAKDD.

[9]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[10]  Arnulfo P. Azcarraga,et al.  SOM-based methodology for building large text archives , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[11]  George Karypis,et al.  Weight Adjustment Schemes for a Centroid Based Classifier , 2000 .

[12]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[13]  Andreas Rauber,et al.  Automatic Labeling of Self-Organizing Maps: Making a Treasure-Map Reveal Its Secrets , 1999, PAKDD.

[14]  T. Kohonen,et al.  Statistical Aspects of the WEBSOM System in Organizing Document Collections , 1998 .

[15]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[16]  NgHwee Tou,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997 .