Robust , Light-weight Approaches to compute Lexical Similarity

Most text processing systems need to compare lexical units – words, entities, semantic concepts – with each other as a basic processing step within large and complex systems. A significant amount of research has taken place in formulating and evaluating multiple similarity metrics, primarily between words. Often, such techniques are resourceintensive or are applicable only to specific use cases. In this technical report, we summarize some of our research work in finding robust, lightweight approaches to compute similarity between two spans of text. We describe two new measures to compute similarity, WNSim for word similarity, and NESim for named entity similarity, which in our experience have been more useful than more standard similarity metrics. We also present a technique, Lexical Level Matching (LLM), to combine such token-level similarity measures to compute phraseand sentence-level similarity scores. We have found LLM to be useful in a number of NLP applications; it is easy to compute, and surprisingly robust to