An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification

The use of semantics in tasks related to information retrieval has become, in recent years, a vast field of research. Considering supervised text classification, which is the main interest of this work, semantics can be involved at different steps of text processing: during indexing step, during training step and during class prediction step. As for class prediction step, new text-to-text semantic similarity measures can replace classical similarity measures that are traditionally used by some classification methods for decision-making. In this paper we propose a new measure for assessing semantic similarity between texts based on TF/IDF with a new function that aggregates semantic similarities between concepts representing the compared text documents pair-to-pair. Experimental results demonstrate that our measure outperforms other semantic and classical measures with significant improvements.

[1]  Guangyan Huang,et al.  Web Information Systems Engineering - WISE 2012 , 2012, Lecture Notes in Computer Science.

[2]  Hoa A. Nguyen,et al.  A Cluster-Based Approach for Semantic Similarity in the Biomedical Domain , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[3]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[4]  Jian Hu,et al.  Improving Text Classification by Using Encyclopedia Knowledge , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5]  Rada Mihalcea,et al.  Text-to-Text Semantic Similarity for Automatic Short Answer Grading , 2009, EACL.

[6]  Giuseppe Pirrò,et al.  A semantic similarity metric combining features and intrinsic information content , 2009, Data Knowl. Eng..

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[8]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[11]  Diarmuid Ó Séaghdha Semantic Classification with WordNet Kernels , 2009, HLT-NAACL.

[12]  Xiaogang Peng,et al.  Document Classifications based on Word Semantic Hierarchies , 2005, Artificial Intelligence and Applications.

[13]  Olivier Bodenreider,et al.  Ontology-driven similarity approaches to supporting gene func- tional assessment , 2005 .

[14]  Ian H. Witten,et al.  Learning a concept-based document similarity measure , 2012, J. Assoc. Inf. Sci. Technol..

[15]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[16]  James J. Cimino,et al.  Towards the development of a conceptual distance metric for the UMLS , 2004, J. Biomed. Informatics.

[17]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[18]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[19]  Abdoulaye Guissé,et al.  PatClust: une plateforme pour la classification sémantique des brevets , 2008 .

[20]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[21]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[22]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[23]  Yong Yu,et al.  Conceptual Graph Matching for Semantic Search , 2002, ICCS.

[24]  Sébastien Fournier,et al.  Conceptualization Effects on MEDLINE Documents Classification Using Rocchio Method , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[25]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[26]  Stephan Bloehdorn,et al.  Combined Syntactic and Semantic Kernels for Text Classification , 2007, ECIR.

[27]  David Sánchez,et al.  Ontology-based semantic similarity: A new feature-based approach , 2012, Expert Syst. Appl..

[28]  Xiaoyue Wang,et al.  Using an integrated ontology database to categorize web pages , 2010, AST/UCMA/ISA/ACN.

[29]  Galia Angelova,et al.  Conceptual Structures: Integration and Interfaces , 2002, Lecture Notes in Computer Science.

[30]  John Yen,et al.  Advances in Web Mining and Web Usage Analysis, 8th International Workshop on Knowledge Discovery on the Web, WebKDD 2006, Philadelphia, PA, USA, August 20, 2006, Revised Papers , 2007, WebKDD.

[31]  Stephan Bloehdorn,et al.  Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[32]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[33]  Euripides G. M. Petrakis,et al.  MedSearch: A Retrieval System for Medical Information Based on Semantic Similarity , 2006, ECDL.

[34]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[35]  Steffen Staab,et al.  Text Clustering Based on Background Knowledge , 2003 .

[36]  Sébastien Fournier,et al.  The Impact of Conceptualization on Text Classification , 2012, WISE.

[37]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..