LOCAL SEMANTIC KERNELS FOR TEXT DOCUMENT CLUSTERING

Document clustering is a fundamental task of text mining, by which efficient organization, navigation, summarization and retrieval of documents can be achieved. The clustering of documents presents difficult challenges due to the sparsity and the high dimensionality of text data, and to the complex semantics of the natural language. Subspace clustering is an extension of traditional clustering that is designed to capture local feature relevance, and to group documents with respect to the features (or words) that matter the most. This paper presents a subspace clustering technique based on a Locally Adaptive Clustering (LAC) algorithm. To improve the subspace clustering of documents and the identification of keywords achieved by LAC, kernel methods and semantic distances are deployed. The basic idea is to define a local kernel for each cluster by which semantic distances between pairs of words are computed to derive the clustering and the local term weightings. The proposed approach, called Semantic LAC, is evaluated using benchmark datasets. Our experiments show that Semantic LAC is capable of improving the clustering quality.

[1]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[2]  Dimitrios Gunopulos,et al.  Subspace Clustering of High Dimensional Data , 2004, SDM.

[3]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[4]  Yong Wang,et al.  Document Clustering with Semantic Analysis , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[5]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[6]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[7]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[8]  Daniel Barbará,et al.  Classifying Documents Without Labels , 2004, SDM.

[9]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[10]  Kenji Kita,et al.  Learning Nonstructural Distance Metric by Minimum Cluster Distortion , 2004, EMNLP.

[11]  Hsin-Chang Yang,et al.  A classifier-based text mining approach for evaluating semantic relatedness using support vector machines , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[12]  Huan Liu,et al.  Evaluating Subspace Clustering Algorithms , 2004 .

[13]  M. Ng,et al.  Ontology-based Distance Measure for Text Clustering , 2006 .

[14]  Roberto Basili,et al.  A Semantic Kernel to Classify Texts with Very Few Training Examples , 2006, Informatica.

[15]  Daniel Barbará,et al.  Categorization and keyword identification of unlabeled documents , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[16]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[17]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[18]  Florence d'Alché-Buc,et al.  Support Vector Machines based on a semantic kernel for text categorization , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[19]  Nello Cristianini,et al.  Latent Semantic Kernels , 2001, Journal of Intelligent Information Systems.

[20]  Roberto Basili,et al.  A Semantic Kernel to Exploit Linguistic Knowledge , 2005, AI*IA.