Ontology-Based Mapping for Automated Document Management: A Concept-Based Technique for Word Mismatch and Ambiguity Problems in Document Clustering

Document clustering is crucial to automated document management, especially for the fast-growing volume of textual documents available digitally. Traditional lexicon-based approaches depend on document content analysis and measure overlap of the feature vectors representing different documents, which cannot effectively address word mismatch or ambiguity problems. Alternative query expansion and local context discovery approaches are developed but suffer from limited efficiency and effectiveness, because the large number of expanded terms create noise and increase the dimensionality and complexity of the overall feature space. Several techniques extend lexicon-based analysis by incorporating latent semantic indexing but produce less comprehensible clustering results and questionable performance. We instead propose a concept-based document representation and clustering (CDRC) technique and empirically examine its effectiveness using 433 articles concerning information systems and technology, randomly selected from a popular digital library. Our evaluation includes two widely used benchmark techniques and shows that CDRC outperforms them. Overall, our results reveal that clustering documents at an ontology-based, concept-based level is more effective than techniques using lexicon-based document features and can generate more comprehensible clustering results.

[1]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[2]  King-Sun Fu,et al.  A Sentence-to-Sentence Clustering Procedure for Pattern Analysis , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Steffen Staab,et al.  Ontology-based text clustering , 2001, IJCAI 2001.

[4]  Lynda Tamine,et al.  Combining Global and Local Semantic Contexts for Improving Biomedical Information Retrieval , 2011, ECIR.

[5]  Atro Voutilainen,et al.  NPtool, a Detector of English Noun Phrases , 1995, VLC@ACL.

[6]  Heiner Stuckenschmidt,et al.  Ontology-Based Integration of Information - A Survey of Existing Approaches , 2001, OIS@IJCAI.

[7]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[8]  Le Zhao,et al.  Term necessity prediction , 2010, CIKM.

[9]  Takahira Yamaguchi Acquiring Conceptual Relationships from Domain-Specific Texts , 2001, Workshop on Ontology Learning.

[10]  Diego Reforgiato Recupero,et al.  A new unsupervised method for document clustering by using WordNet lexical and conceptual relations , 2007, Information Retrieval.

[11]  Dieter Fensel,et al.  Ontologies: A silver bullet for knowledge management and electronic commerce , 2002 .

[12]  W. Scott Spangler,et al.  Generating and Browsing Multiple Taxonomies Over a Document Collection , 2003, J. Manag. Inf. Syst..

[13]  Chitra Dorai,et al.  Shape spectra based view grouping for free-form objects , 1995, Proceedings., International Conference on Image Processing.

[14]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[15]  Chih-Ping Wei,et al.  Combining preference- and content-based approaches for improving document clustering effectiveness , 2006, Inf. Process. Manag..

[16]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[17]  P. Thangaraj,et al.  Integrated Clustering and Feature Selection Scheme for Text Documents. , 2010 .

[18]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[19]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[20]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[21]  V. R. Benjamins,et al.  Overview of Knowledge Sharing and Reuse Components: Ontologies and Problem-Solving Methods , 1999, IJCAI 1999.

[22]  Ellen M. Voorhees,et al.  Towards Building Contextual Representations of Word Senses Using Statistical Models , 1996 .

[23]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[24]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[25]  Matthias Lange,et al.  SEMEDA: ontology based semantic integration of biological databases , 2003, Bioinform..

[26]  Michael Specht,et al.  Ontology based text indexing and querying for the semantic web , 2006, Knowl. Based Syst..

[27]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[28]  Chih-Ping Wei,et al.  Managing Word Mismatch Problems in Information Retrieval: A Topic-Based Query Expansion Approach , 2007, J. Manag. Inf. Syst..

[29]  HuPaul Jen-Hwa,et al.  Ontology-Based Mapping for Automated Document Management , 2015 .

[30]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[31]  Hsinchun Chen,et al.  Document clustering for electronic meetings: an experimental comparison of two techniques , 1999, Decis. Support Syst..

[32]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[33]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[34]  Félix de Moya Anegón,et al.  Document organization using Kohonen's algorithm , 2002, Inf. Process. Manag..

[35]  Peter Willett,et al.  Hierarchic document classification using Ward's clustering method , 1986, SIGIR '86.

[36]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[37]  M. Shamim Khan,et al.  Enhanced Web document retrieval using automatic query expansion , 2004, J. Assoc. Inf. Sci. Technol..

[38]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[39]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[40]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[41]  Richard A. Harshman,et al.  Information Retrieval using a Singular Value Decomposition Model of Latent Semantic Structure , 1988, SIGIR Forum.

[42]  Yaacov Choueka,et al.  Disambiguation by short contexts , 1985, Comput. Humanit..

[43]  M. Narasimha Murty,et al.  A computationally efficient technique for data-clustering , 1980, Pattern Recognit..

[44]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[45]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[46]  W. Bruce Croft,et al.  Quary Expansion Using Local and Global Document Analysis , 1996, SIGIR Forum.

[47]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[48]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[49]  M. Punithavalli,et al.  Performance Evaluation of Semantic Based and Ontology Based Text Document Clustering Techniques , 2012 .

[50]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[51]  Dekang Lin,et al.  Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity , 1997, ACL.

[52]  Catriel Beeri,et al.  Proceedings of the 7th International Conference on Database Theory , 1999 .

[53]  Ellen M. Voorhees,et al.  Using WordNet to disambiguate word senses for text retrieval , 1993, SIGIR.

[54]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[55]  Stan Szpakowicz,et al.  Semi-Automatic Acquisition of Conceptual Structure from Technical Texts , 1990, Int. J. Man Mach. Stud..

[56]  Yeuvo Jphonen,et al.  Self-Organizing Maps , 1995 .

[57]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[58]  Aviezri S. Fraenkel,et al.  Local Feedback in Full-Text Retrieval Systems , 1977, JACM.

[59]  Peter Willett,et al.  Hierarchic Document Clustering Using Ward's Method. , 1986, SIGIR 1986.

[60]  W. Bruce Croft,et al.  Lexical ambiguity and information retrieval , 1992, TOIS.

[61]  Michael K. Ng,et al.  Medical Document Clustering Using Ontology-Based Term Similarity Measures , 2008, Int. J. Data Warehous. Min..

[62]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[63]  Chih-Ping Wei,et al.  Preserving User Preferences in Automated Document-Category Management: An Evolution-Based Approach , 2009, J. Manag. Inf. Syst..

[64]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.

[65]  Chih-Ping Wei,et al.  An ontology-based technique for preserving user preferences in document-category evolutions , 2011, J. Assoc. Inf. Sci. Technol..

[66]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .