Most text processing systems need to compare lexical units – words, entities, semantic concepts – with each other as a basic processing step within large and complex systems. A significant amount of research has taken place in formulating and evaluating multiple similarity metrics, primarily between words. Often, such techniques are resourceintensive or are applicable only to specific use cases. In this technical report, we summarize some of our research work in finding robust, lightweight approaches to compute similarity between two spans of text. We describe two new measures to compute similarity, WNSim for word similarity, and NESim for named entity similarity, which in our experience have been more useful than more standard similarity metrics. We also present a technique, Lexical Level Matching (LLM), to combine such token-level similarity measures to compute phraseand sentence-level similarity scores. We have found LLM to be useful in a number of NLP applications; it is easy to compute, and surprisingly robust to
[1]
Dan Roth,et al.
Semantic and Logical Inference Model for Textual Entailment
,
2007,
ACL-PASCAL@ACL.
[2]
Ted Pedersen,et al.
WordNet::Similarity - Measuring the Relatedness of Concepts
,
2004,
NAACL.
[3]
Chris Quirk,et al.
Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources
,
2004,
COLING.
[4]
Xiaohua Hu,et al.
The Evaluation of Sentence Similarity Measures
,
2008,
DaWaK.
[5]
Pradeep Ravikumar,et al.
A Comparison of String Distance Metrics for Name-Matching Tasks
,
2003,
IIWeb.
[6]
Dan Roth,et al.
Context Sensitive Paraphrasing with a Global Unsupervised Classifier
,
2007,
ECML.
[7]
Ido Dagan,et al.
The Third PASCAL Recognizing Textual Entailment Challenge
,
2007,
ACL-PASCAL@ACL.
[8]
Regina Barzilay,et al.
Paraphrasing for Automatic Evaluation
,
2006,
NAACL.