Adaptive Concept Resolution for document representation and its applications in text mining

It is well-known that synonymous and polysemous terms often bring in some noise when we calculate the similarity between documents. Existing ontology-based document representation methods are static so that the selected semantic concepts for representing a document have a fixed resolution. Therefore, they are not adaptable to the characteristics of document collection and the text mining problem in hand. We propose an Adaptive Concept Resolution (ACR) model to overcome this problem. ACR can learn a concept border from an ontology taking into the consideration of the characteristics of the particular document collection. Then, this border provides a tailor-made semantic concept representation for a document coming from the same domain. Another advantage of ACR is that it is applicable in both classification task where the groups are given in the training document set and clustering task where no group information is available. The experimental results show that ACR outperforms an existing static method in almost all cases. We also present a method to integrate Wikipedia entities into an expert-edited ontology, namely WordNet, to generate an enhanced ontology named WordNet-Plus, and its performance is also examined under the ACR model. Due to the high coverage, WordNet-Plus can outperform WordNet on data sets having more fresh documents in classification.

[1]  Iryna Gurevych,et al.  Using Wiktionary for Computing Semantic Relatedness , 2008, AAAI.

[2]  Iraklis Varlamis,et al.  Semantic smoothing for text clustering , 2013, Knowl. Based Syst..

[3]  Paolo Rosso,et al.  A Knowledge-based Representation for Cross-Language Document Retrieval and Categorization , 2014, EACL.

[4]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[5]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[6]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[7]  M. Ng,et al.  Ontology-based Distance Measure for Text Clustering , 2006 .

[8]  Simone Paolo Ponzetto,et al.  Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia , 2009, IJCAI.

[9]  Steffen Staab,et al.  Ontology-based Text Document Clustering , 2002, Künstliche Intell..

[10]  Diego Reforgiato Recupero,et al.  A new unsupervised method for document clustering by using WordNet lexical and conceptual relations , 2007 .

[11]  Apostolos Syropoulos,et al.  Mathematics of Multisets , 2000, WMP.

[12]  Hamido Fujita,et al.  Virtual Doctor System (VDS): Framework on Reasoning issues , 2010 .

[13]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Parth Gupta,et al.  Cross-Language Plagiarism Detection Using a Multilingual Semantic Network , 2013, ECIR.

[16]  Michael Specht,et al.  Ontology based text indexing and querying for the semantic web , 2006, Knowl. Based Syst..

[17]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[18]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[19]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[20]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[21]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[22]  Chao Wang,et al.  Collaborative management of web ontology data with flexible access control , 2010, Expert Syst. Appl..

[23]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[24]  Abdelmajid Ben Hamadou,et al.  Computing semantic relatedness using Wikipedia features , 2013, Knowl. Based Syst..

[25]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[26]  Mohamed Nadif,et al.  Beyond cluster labeling: Semantic interpretation of clusters' contents using a graph representation , 2014, Knowl. Based Syst..

[27]  Xia Wang,et al.  Decision support in e-business based on assessing similarities between ontologies , 2012, Knowl. Based Syst..

[28]  Carlotta Domeniconi,et al.  Building semantic kernels for text classification using wikipedia , 2008, KDD.

[29]  Rajendra Akerkar,et al.  Knowledge Based Systems , 2017, Encyclopedia of GIS.

[30]  Yan Zhang,et al.  Learning ontology resolution for document representation and its applications in text mining , 2010, CIKM '10.

[31]  Jie Lu,et al.  Ontology-supported case-based reasoning approach for intelligent m-Government emergency response services , 2013, Decis. Support Syst..

[32]  Michael J. Witbrock,et al.  An Introduction to the Syntax and Content of Cyc , 2006, AAAI Spring Symposium: Formalizing and Compiling Background Knowledge and Its Applications to Knowledge Representation and Question Answering.

[33]  Yan Zhang,et al.  Towards an enhanced and adaptable ontology by distilling and assembling online encyclopedias , 2013, CIKM.

[34]  Yan Zhang,et al.  Ontology enhancement and concept granularity learning: keeping yourself current and adaptive , 2011, KDD.

[35]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[36]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[37]  David Sánchez,et al.  Ontology-based information content computation , 2011, Knowl. Based Syst..

[38]  Xiaohua Hu,et al.  Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering , 2006, KDD '06.

[40]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[41]  John C. Platt,et al.  Translingual Document Representations from Discriminative Projections , 2010, EMNLP.

[42]  John C. Platt,et al.  Learning Discriminative Projections for Text Similarity Measures , 2011, CoNLL.

[43]  Hamido Fujita,et al.  Mental Ontology model for medical diagnosis based on some intuitionistic fuzzy functions , 2012, 2012 IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics.

[44]  Olena Medelyan,et al.  Integrating Cyc and Wikipedia: Folksonomy meets rigorously defined common-sense , 2008, AAAI 2008.

[45]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[46]  Lidong Bing,et al.  Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning , 2013, WSDM.

[47]  Jian Hu,et al.  Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[48]  Tiziano Flati,et al.  Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project , 2014, ACL.

[49]  Stefano Faralli,et al.  OntoLearn Reloaded: A Graph-Based Algorithm for Taxonomy Induction , 2013, CL.

[50]  Stan Matwin,et al.  Text Classification Using WordNet Hypernyms , 1998, WordNet@ACL/COLING.

[51]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[52]  David Sánchez,et al.  Towards the estimation of feature-based semantic similarity using multiple ontologies , 2014, Knowl. Based Syst..

[53]  Christos Bouras,et al.  A clustering technique for news articles using WordNet , 2012, Knowl. Based Syst..

[54]  Maria Ruiz-Casado,et al.  Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets , 2005, AWIC.