On ontology-driven document clustering using core semantic features

Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.

[1]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[2]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[3]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[4]  Diego Reforgiato Recupero A new unsupervised method for document clustering by using WordNet lexical and conceptual relations , 2007 .

[5]  Vasileios Kandylas,et al.  Finding cohesive clusters for analyzing knowledge communities , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  Carlotta Domeniconi,et al.  Building semantic kernels for text classification using wikipedia , 2008, KDD.

[7]  Fei Song,et al.  Knowledge-Based Approaches to Query Expansion in Information Retrieval , 1996, Canadian Conference on AI.

[8]  Ahmed K. Farahat,et al.  Enhancing Document Clustering Using Hybrid Models for Semantic Similarity , 2010 .

[9]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[10]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[11]  Gerhard Weikum,et al.  Learning Word-to-Concept Mappings for Automatic Text Classification , 2005, ICML 2005.

[12]  Jian Hu,et al.  Improving Text Classification by Using Encyclopedia Knowledge , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[13]  Carlotta Domeniconi,et al.  LOCAL SEMANTIC KERNELS FOR TEXT DOCUMENT CLUSTERING , 2007 .

[14]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[15]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[16]  Wenliang Du,et al.  A hybrid multi-group approach for privacy-preserving data mining , 2009, Knowledge and Information Systems.

[17]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[18]  Paolo Rosso,et al.  Text Categorization and Information Retrieval Using WordNet Senses , 2004 .

[19]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[20]  Florence d'Alché-Buc,et al.  Support Vector Machines based on a semantic kernel for text categorization , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[21]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[22]  Václav Snásel,et al.  LSI vs. Wordnet Ontology in Dimension Reduction for Information Retrieval , 2004, DATESO.

[23]  Padhraic Smyth,et al.  Combining concept hierarchies and statistical topic models , 2008, CIKM '08.

[24]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[25]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[26]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[27]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[28]  Richard J. Enbody,et al.  The Practice of Computing Using Python , 2010 .

[29]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[30]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[31]  Xiaohua Hu,et al.  Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering , 2006, KDD '06.

[32]  Takenobu Tokunaga,et al.  Complementing WordNet with Roget’s and Corpus-based Thesauri for Information Retrieval , 1999, EACL.

[33]  M. Ng,et al.  Ontology-based Distance Measure for Text Clustering , 2006 .

[34]  Hui Xiong,et al.  Characterizing pattern preserving clustering , 2008, Knowledge and Information Systems.

[35]  Alexandre Termier,et al.  Combining Statistics and Semantics for Word and Document Clustering , 2001, Workshop on Ontology Learning.

[36]  Yong Wang,et al.  Document Clustering with Semantic Analysis , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[37]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[38]  Samah Jamal Fodeh,et al.  Combining statistics and semantics via ensemble model for document clustering , 2009, SAC '09.

[39]  Ellen M. Voorhees,et al.  Using WordNet to disambiguate word senses for text retrieval , 1993, SIGIR.

[40]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[41]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[42]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[43]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.