Discovering interchangeable words from string databases

This paper presents a solution for the problem of finding interchangeable words in the context of an input collection of strings. Interchangeable words are words that can be replaced indistinctly in phrases or free text without deviating its actual meaning. Under restricted conditions, pairs of interchangeable might be useful for data deduplication, copy detection, software localization, among others. The calculation of the degree of interchangeability involves the accurate calculation of semantic similarity between pairs of words and the search for candidate pairs in the overall search space imposed by the input collection. The solution presented in this paper is composed by a search method for candidate pairs using the Levenshtein distance algorithm and a novel algorithm - SSA -for calculating the semantic similarity between words. The proposed solution was implemented and tested within a real world application related to a string message database from a software development company. The system was used to build an ontology with clusters of interchangeable words.

[1]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[2]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[3]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[4]  Ted Pedersen,et al.  Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts , 2006 .

[5]  Stan Szpakowicz,et al.  Roget's thesaurus and semantic similarity , 2012, RANLP.

[6]  David M. W. Powers,et al.  Measuring Semantic Similarity in the Taxonomy of WordNet , 2005, ACSC.

[7]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[8]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[9]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[10]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[11]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[12]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[13]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[14]  Christiane Fellbaum,et al.  Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms , 1998 .