Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model

We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be learned from the data.We call the proposed similarity measure soft similarity. Similarity between features is common, for example, in natural language processing: words, n-grams, or syntactic n-grams can be somewhat different (which makes them different features) but still have much in common: for example, words “play” and “game” are different but related. When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine similarity measure in VSM by introducing what we call “soft cosine measure”. We propose various formulas for exact or approximate calculation of the soft cosine measure. For example, in one of them we consider for VSM a new feature space consisting of pairs of the original features weighted by their similarity. Again, for features that bear no similarity to each other, our formulas reduce to the standard cosine measure. Our experiments show that our soft cosine measure provides better performance in our case study: entrance exams question answering task at CLEF. In these experiments, we use syntactic n-grams as features and Levenshtein distance as the similarity between n-grams, measured either in characters or in elements of n-grams.

[1]  Grigori Sidorov,et al.  Should Syntactic N-grams Contain Names of Syntactic Relations? , 2014, Int. J. Comput. Linguistics Appl..

[2]  Eduard H. Hovy,et al.  Overview of QA4MRE at CLEF 2011: Question Answering for Machine Reading Evaluation , 2011, CLEF.

[3]  Fabio A. González,et al.  Text Comparison Using Soft Cardinality , 2010, SPIRE.

[4]  Erik Cambria,et al.  EmoSenticSpace: A novel framework for affective common-sense reasoning , 2014, Knowl. Based Syst..

[5]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[6]  Liping Han,et al.  Distance Weighted Cosine Similarity Measure for Text Classification , 2013, IDEAL.

[7]  Costas S. Iliopoulos,et al.  String Processing and Information Retrieval , 2015, Lecture Notes in Computer Science.

[8]  Grigori Sidorov,et al.  Graph Based Approach for the Question Answering Task Based on Entrance Exams , 2014, CLEF.

[9]  R. Chaffin,et al.  Cognitive and Psychometric Analysis of Analogical Problem Solving , 1990 .

[10]  Efstathios Stamatatos,et al.  Syntactic N-grams as machine learning features for natural language processing , 2014, Expert Syst. Appl..

[11]  Darnes Vilariño Ayala,et al.  A graph-based multi-level linguistic representation for document understanding , 2014, Pattern Recognit. Lett..

[12]  Alexander Gelbukh,et al.  BASELINES FOR NATURAL LANGUAGE PROCESSING TASKS BASED ON SOFT CARDINALITY SPECTRA , 2012 .

[13]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[14]  Kenta Mikawa,et al.  A proposal of extended cosine measure for distance metric learning in text classification , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[15]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[16]  Alberto Flores Rueda,et al.  Computación Y Sistemas , 2022 .

[17]  Dipankar Das,et al.  Enhanced SenticNet with Affective Labels for Concept-Based Opinion Mining , 2013, IEEE Intelligent Systems.

[18]  Gerald Salton,et al.  Automatic text processing , 1988 .

[19]  Josef Stoer,et al.  Numerische Mathematik 1 , 1989 .

[20]  Grigori Sidorov,et al.  A Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014 , 2014, CLEF.

[21]  Alexander F. Gelbukh,et al.  Dependency-Based Semantic Parsing for Concept-Level Text Analysis , 2014, CICLing.

[22]  Ildar Z. Batyrshin,et al.  Methods and applications of artificial and computational intelligence , 2014, Expert Syst. Appl..

[23]  GRIGORI SIDOROV Syntactic Dependency Based N-grams in Rule Based Automatic English as Second Language Grammar Correction , 2013, Int. J. Comput. Linguistics Appl..

[24]  Noriko Kando,et al.  Overview of QA4MRE 2013 Entrance Exams Task , 2013, CLEF.