Paraphrase Identification and Semantic Similarity in Twitter with Simple Features

Paraphrase Identification and Semantic Similarity are two different yet well related tasks in NLP. There are many studies on these two tasks extensively on structured texts in the past. However, with the strong rise of social media data, studying these tasks on unstructured texts, particularly, social texts in Twitter is very interesting as it could be more complicated problems to deal with. We investigate and find a set of simple features which enables us to achieve very competitive performance on both tasks in Twitter data. Interestingly, we also confirm the significance of using word alignment techniques from evaluation metrics in machine translation in the overall performance of these tasks.

[1]  Nitin Madnani,et al.  Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods , 2010, CL.

[2]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[3]  Balachander Krishnamurthy,et al.  A few chirps about twitter , 2008, WOSN '08.

[4]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[5]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[6]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[7]  Rada Mihalcea,et al.  Measuring semantic relatedness using salient encyclopedic concepts , 2011 .

[8]  Aminul Islam,et al.  Semantic similarity of short texts , 2009 .

[9]  Johanna D. Moore,et al.  Twitter Sentiment Analysis: The Good the Bad and the OMG! , 2011, ICWSM.

[10]  Weiwei Guo,et al.  Modeling Sentences in the Latent Space , 2012, ACL.

[11]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[12]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[13]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[14]  Günter Neumann,et al.  The Excitement Open Platform for Textual Inferences , 2014, ACL.

[15]  Fabio Massimo Zanzotto,et al.  Linguistic Redundancy in Twitter , 2011, EMNLP.

[16]  Stephen Wan,et al.  Using Dependency-Based Features to Take the ’Para-farce’ out of Paraphrase , 2006, ALTA.

[17]  Nitin Madnani,et al.  Re-examining Machine Translation Metrics for Paraphrase Identification , 2012, NAACL.

[18]  Wei Xu,et al.  Gathering and Generating Paraphrases from Twitter with Application to Normalization , 2013, BUCC@ACL.

[19]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[20]  Eduard H. Hovy,et al.  Squibs: What Is a Paraphrase? , 2013, CL.

[21]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[22]  Nitin Madnani,et al.  TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate , 2009, Machine Translation.

[23]  Chris Callison-Burch,et al.  SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT) , 2015, *SEMEVAL.

[24]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[25]  Samuel Fernando,et al.  A Semantic Similarity Approach to Paraphrase Detection , 2008 .

[26]  Rizal Setya Perdana What is Twitter , 2013 .

[27]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[28]  Chris Callison-Burch,et al.  Extracting Lexically Divergent Paraphrases from Twitter , 2014, TACL.