Term Similarity and Weighting Framework for Text Representation

Expressiveness of natural language is a challenge for text representation since the same idea can be expressed in many different ways. Therefore, terms in a document should not be treated independently of one another since together they help to disambiguate and establish meaning. Term-similarity measures are often used to improve representation by capturing semantic relationships between terms. Another consideration for representation involves the importance of terms. Feature selection techniques address this by using statistical measures to quantify term usefulness for retrieval. In this paper we present a framework that combines term-similarity and weighting for text representation. This allows us to comparatively study the impact of term similarity, term weighting and any synergistic effect that may exist between them. Study of term similarity is based on approaches that exploit term co-occurrences within document and sentence contexts whilst term weighting uses the popular Chi-squared test. Our results on text classification tasks show that the combined effect of similarity and weighting is superior to each technique independently and that this synergistic effect is obtained regardless of co-occurrence context granularity.

[1]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[2]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[3]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[4]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[5]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[6]  Barry Smyth,et al.  Advances in Case-Based Reasoning , 1996, Lecture Notes in Computer Science.

[7]  Luc Lamontagne,et al.  Case-Based Reasoning Research and Development , 1997, Lecture Notes in Computer Science.

[8]  Hinrich Schütze,et al.  A Cooccurrence-Based Thesaurus and Two Applications to Information Retrieval , 1994, Inf. Process. Manag..

[9]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[10]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[11]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[12]  Ivan Koychev,et al.  Feature Selection and Generalisation for Retrieval of Textual Cases , 2004, ECCBR.

[13]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[14]  Natasa Milic-Frayling,et al.  A Framework for Characterizing Feature Weighting and Selection Methods in Text Classification , 2005 .

[15]  Jordan L. Boyd-Graber,et al.  Adding dense, weighted connections to WordNet , 2005 .

[16]  Sutanu Chakraborti,et al.  Fast Case Retrieval Nets for Textual Data , 2006, ECCBR.

[17]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[18]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[19]  Sutanu Chakraborti,et al.  Acquiring Word Similarities with Higher Order Association Mining , 2007, ICCBR.

[20]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[21]  Gerhard Weikum,et al.  Efficiently Handling Dynamics in Distributed Link Based Authority Analysis , 2008, WISE.

[22]  Eduardo Mena,et al.  Web-Based Measure of Semantic Relatedness , 2008, WISE.

[23]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[24]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..