论文信息 - An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification

An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification

The use of semantics in tasks related to information retrieval has become, in recent years, a vast field of research. Considering supervised text classification, which is the main interest of this work, semantics can be involved at different steps of text processing: during indexing step, during training step and during class prediction step. As for class prediction step, new text-to-text semantic similarity measures can replace classical similarity measures that are traditionally used by some classification methods for decision-making. In this paper we propose a new measure for assessing semantic similarity between texts based on TF/IDF with a new function that aggregates semantic similarities between concepts representing the compared text documents pair-to-pair. Experimental results demonstrate that our measure outperforms other semantic and classical measures with significant improvements.

[1] Guangyan Huang,et al. Web Information Systems Engineering - WISE 2012 , 2012, Lecture Notes in Computer Science.

[2] Hoa A. Nguyen,et al. A Cluster-Based Approach for Semantic Similarity in the Biomedical Domain , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[3] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[4] Jian Hu,et al. Improving Text Classification by Using Encyclopedia Knowledge , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5] Rada Mihalcea,et al. Text-to-Text Semantic Similarity for Automatic Short Answer Grading , 2009, EACL.

[6] Giuseppe Pirrò,et al. A semantic similarity metric combining features and intrinsic information content , 2009, Data Knowl. Eng..

[7] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[8] Evgeniy Gabrilovich,et al. Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[9] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10] Christiane Fellbaum,et al. Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[11] Diarmuid Ó Séaghdha. Semantic Classification with WordNet Kernels , 2009, HLT-NAACL.

[12] Xiaogang Peng,et al. Document Classifications based on Word Semantic Hierarchies , 2005, Artificial Intelligence and Applications.

[13] Olivier Bodenreider,et al. Ontology-driven similarity approaches to supporting gene func- tional assessment , 2005 .

[14] Ian H. Witten,et al. Learning a concept-based document similarity measure , 2012, J. Assoc. Inf. Sci. Technol..

[15] Chris Quirk,et al. Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[16] James J. Cimino,et al. Towards the development of a conceptual distance metric for the UMLS , 2004, J. Biomed. Informatics.

[17] Carlo Strapparava,et al. Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[18] Gerard Salton,et al. The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[19] Abdoulaye Guissé,et al. PatClust: une plateforme pour la classification sémantique des brevets , 2008 .

[20] Martha Palmer,et al. Verb Semantics and Lexical Selection , 1994, ACL.

[21] Alan R. Aronson,et al. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[22] Roy Rada,et al. Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[23] Yong Yu,et al. Conceptual Graph Matching for Semantic Search , 2002, ICCS.

[24] Sébastien Fournier,et al. Conceptualization Effects on MEDLINE Documents Classification Using Rocchio Method , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[25] Louisa Sadler,et al. Structural Non-Correspondence in Translation , 1991, EACL.

[26] Stephan Bloehdorn,et al. Combined Syntactic and Semantic Kernels for Text Classification , 2007, ECIR.

[27] David Sánchez,et al. Ontology-based semantic similarity: A new feature-based approach , 2012, Expert Syst. Appl..

[28] Xiaoyue Wang,et al. Using an integrated ontology database to categorize web pages , 2010, AST/UCMA/ISA/ACN.

[29] Galia Angelova,et al. Conceptual Structures: Integration and Interfaces , 2002, Lecture Notes in Computer Science.

[30] John Yen,et al. Advances in Web Mining and Web Usage Analysis, 8th International Workshop on Knowledge Discovery on the Web, WebKDD 2006, Philadelphia, PA, USA, August 20, 2006, Revised Papers , 2007, WebKDD.

[31] Stephan Bloehdorn,et al. Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[32] George Karypis,et al. Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[33] Euripides G. M. Petrakis,et al. MedSearch: A Retrieval System for Medical Information Based on Semantic Similarity , 2006, ECDL.

[34] Chris Buckley,et al. OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[35] Steffen Staab,et al. Text Clustering Based on Background Knowledge , 2003 .

[36] Sébastien Fournier,et al. The Impact of Conceptualization on Text Classification , 2012, WISE.

[37] Alan R. Aronson,et al. An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..