Chinese Document Clustering Using Self-Organizing Map-Based on Botanical Document Warehouse

The exponential growth of information has made an overflow situation in the sea of information. It had created difficulties in the search for information. An efficient method to organize the query of information and assist users’ navigation is therefore particularly important. In this paper, we applied Self-Organizing Map (SOM) algorithm to cluster Chinese botanical documents onto a two-dimensional map. Each botanical document has been regarded as bags of words, and transferred into plain text respectively. We applied term frequency and inverse term frequency to extract key terms from documents as the input of SOM. 892 Chinese botanical documents have been projected onto a 2D map to assist users’ navigation. In our experimental results, the lowest recall was 0.71 for Polygonaceae documents and the highest recall rate was 0.94 for Amaranthaceae documents. The lowest precision rate was 0.81 for Umbelliferae documents, and the highest precision rate was one hundred percent for Convolvulaceae and Cruciferae documents.

[1]  M. Singhal Automatic Text Browsing Using Vector Space , 1995 .

[2]  Walter Joseph Trybula,et al.  Text mining and knowledge discernment: an exploratory investigation , 1999 .

[3]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  Félix Moya-Anegón,et al.  Automatic extraction of relationships between terms by means of Kohonen's algorithm , 2002 .

[6]  Kate Smith-Miles,et al.  Web page clustering using a self-organizing map of user navigation patterns , 2003, Decis. Support Syst..

[7]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[8]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[9]  Gary Marchionini,et al.  Finding facts vs. browsing knowledge in hypertext systems , 1988, Computer.

[10]  Hahn-Ming Lee,et al.  An intelligent web-page classifier with fair feature-subset selection , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[11]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[12]  J Allan,et al.  Readings in information retrieval. , 1998 .

[13]  Tat-Seng Chua,et al.  Evaluating keyword selection methods for WEBSOM text archives , 2004, IEEE Transactions on Knowledge and Data Engineering.

[14]  Rong-Jyue Fang,et al.  Mobile learning system using multi-dimension data warehouse concept-based on botanical data , 2007 .

[15]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[16]  Dunja Mladenic,et al.  Feature selection on hierarchy of web documents , 2003, Decis. Support Syst..

[17]  Charles L. A. Clarke,et al.  Browsing and searching software architectures , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[18]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[19]  Hsinchun Chen,et al.  An algorithmic approach to concept exploration in a large knowledge network (automatic thesaurus consultation): symbolic branch-and-bound search vs. connectionist Hopfield net activation , 1995 .

[20]  H. Chen,et al.  An Algorithmic Approach to Concept Exploration in a Large Knowledge Network (Automatic Thesaurus Consultation): Symbolic Branch-and-Bound Search vs. Connectionist Hopfield Net Activation , 1995, J. Am. Soc. Inf. Sci..

[21]  Timo Honkela,et al.  Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration , 1996, KDD.