Multi-task Pairwise Neural Ranking for Hashtag Segmentation

Hashtags are often employed on social media and beyond to add metadata to a textual utterance with the goal of increasing discoverability, aiding search, or providing additional semantics. However, the semantic content of hashtags is not straightforward to infer as these represent ad-hoc conventions which frequently include multiple words joined together and can include abbreviations and unorthodox spellings. We build a dataset of 12,594 hashtags split into individual segments and propose a set of approaches for hashtag segmentation by framing it as a pairwise ranking problem between candidate segmentations. Our novel neural approaches demonstrate 24.6% error reduction in hashtag segmentation accuracy compared to the current state-of-the-art method. Finally, we demonstrate that a deeper understanding of hashtag semantics obtained through segmentation is useful for downstream applications such as sentiment analysis, for which we achieved a 2.6% increase in average recall on the SemEval 2017 sentiment analysis dataset.

[1]  Yue Zhang,et al.  Context-Sensitive Lexicon Features for Neural Sentiment Analysis , 2016, EMNLP.

[2]  Thierry Declerck,et al.  Processing and Normalizing Hashtags , 2015, RANLP.

[3]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[4]  Kenneth Heafield,et al.  Efficient Language Modeling Algorithms with Applications to Statistical Machine Translation , 2013 .

[5]  Tao Chen,et al.  Context-aware Image Tweet Modelling and Recommendation , 2016, ACM Multimedia.

[6]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[7]  Dale Schuurmans,et al.  A Hierarchical EM Approach to Word Segmentation , 2001, NLPRS.

[8]  Vasudeva Varma,et al.  Towards Deep Semantic Analysis of Hashtags , 2015, ECIR.

[9]  Nianwen Xue,et al.  Chinese Word Segmentation as LMR Tagging , 2003, SIGHAN.

[10]  Shuming Shi,et al.  Microblog Hashtag Generation via Encoding Conversation Contexts , 2019, NAACL.

[11]  Arzucan Özgür,et al.  Segmenting Hashtags using Automatically Created Training Data , 2016, LREC.

[12]  Patrick M. Haluptzok,et al.  Finding the Most Probable Ranking of Objects with Probabilistic Pairwise Preferences , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Yoram Singer,et al.  Learning to Order Things , 1997, NIPS.

[15]  David Bamman,et al.  Contextualized Sarcasm Detection on Twitter , 2015, ICWSM.

[16]  Jian Su,et al.  Attentive Gated Lexicon Reader with Contrastive Contextual Co-Attention for Sentiment Classification , 2018, EMNLP.

[17]  Kuansan Wang,et al.  Web scale NLP: a case study on url word breaking , 2011, WWW.

[18]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[19]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[20]  Ming Zhou,et al.  Building Large-Scale Twitter-Specific Sentiment Lexicon : A Representation Learning Approach , 2014, COLING.

[21]  Fatiha Sadat,et al.  Named Entity Recognition and Hashtag Decomposition to Improve the Classification of Tweets , 2016, NUT@COLING.

[22]  Zhong Zhou,et al.  Tweet2Vec: Character-Based Distributed Representations for Social Media , 2016, ACL.

[23]  Giacomo Berardi,et al.  ISTI@TREC Microblog Track 2011: Exploring the Use of Hashtag Segmentation and Text Quality Ranking , 2011, TREC.

[24]  Arzucan Özgür,et al.  Segmenting hashtags and analyzing their grammatical structure , 2018, J. Assoc. Inf. Sci. Technol..

[25]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[26]  Hsin-Hsi Chen,et al.  Disambiguating False-Alarm Hashtag Usages in Tweets for Irony Detection , 2018, ACL.

[27]  Mark Dredze,et al.  Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[28]  Diana Maynard,et al.  Who cares about Sarcastic Tweets? Investigating the Impact of Sarcasm on Sentiment Analysis. , 2014, LREC.

[29]  Howard J. Hamilton,et al.  Word Segmentation Algorithms with Lexical Resources for Hashtag Classification , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[30]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[31]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[33]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[34]  Jason Weston,et al.  #TagSpace: Semantic Embeddings from Hashtags , 2014, EMNLP.

[35]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[36]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[37]  Dan Klein,et al.  An Empirical Investigation of Statistical Significance in NLP , 2012, EMNLP.

[38]  Muhammad Abdul-Mageed,et al.  EmoNet: Fine-Grained Emotion Detection with Gated Recurrent Neural Networks , 2017, ACL.

[39]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[40]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[41]  Preslav Nakov,et al.  Developing a successful SemEval task in sentiment analysis of Twitter and other social media texts , 2016, Language Resources and Evaluation.

[42]  Dong Nguyen,et al.  Emo, love and god: making sense of Urban Dictionary, a crowd-sourced online dictionary , 2017, Royal Society Open Science.

[43]  Pinar Senkul,et al.  Semantic Expansion of Hashtags for Enhanced Event Detection in Twitter , 2012 .

[44]  Saif Mohammad,et al.  NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets , 2013, *SEMEVAL.

[45]  Mark Hopkins,et al.  Tuning as Ranking , 2011, EMNLP.

[46]  Huan Liu,et al.  Leveraging the Implicit Structure within Social Media for Emergent Rumor Detection , 2016, CIKM.

[47]  Ellen Riloff,et al.  Learning Emotion Indicators from Tweets: Hashtags, Hashtag Patterns, and Phrases , 2014, EMNLP.

[48]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.