Exploiting hybrid contexts for Tweet segmentation

Twitter has attracted hundred millions of users to share and disseminate most up-to-date information. However, the noisy and short nature of tweets makes many applications in information retrieval (IR) and natural language processing (NLP) challenging. Recently, segment-based tweet representation has demonstrated effectiveness in named entity recognition (NER) and event detection from tweet streams. To split tweets into meaningful phrases or segments, the previous work is purely based on external knowledge bases, which ignores the rich local context information embedded in the tweets. In this paper, we propose a novel framework for tweet segmentation in a batch mode, called HybridSeg. HybridSeg incorporates local context knowledge with global knowledge bases for better tweet segmentation. HybridSeg consists of two steps: learning from off-the-shelf weak NERs and learning from pseudo feedback. In the first step, the existing NER tools are applied to a batch of tweets. The named entities recognized by these NERs are then employed to guide the tweet segmentation process. In the second step, HybridSeg adjusts the tweet segmentation results iteratively by exploiting all segments in the batch of tweets in a collective manner. Experiments on two tweet datasets show that HybridSeg significantly improves tweet segmentation quality compared with the state-of-the-art algorithm. We also conduct a case study by using tweet segments for the task of named entity recognition from tweets. The experimental results demonstrate that HybridSeg significantly benefits the downstream applications.

[1]  Houfeng Wang,et al.  Entity-centric topic-oriented opinion summarization in twitter , 2012, KDD.

[2]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[3]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[4]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[5]  Oren Etzioni,et al.  Open domain event extraction from twitter , 2012, KDD.

[6]  Eni Mustafaraj,et al.  Can Collective Sentiment Expressed on Twitter Predict Political Elections? , 2011, AAAI.

[7]  Minyi Guo,et al.  Emoticon Smoothed Language Models for Twitter Sentiment Analysis , 2012, AAAI.

[8]  Ting Wang,et al.  Opinion Retrieval in Twitter , 2012, ICWSM.

[9]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[10]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[11]  Yiqun Liu,et al.  Discover breaking events with popular hashtags in twitter , 2012, CIKM.

[12]  Bu-Sung Lee,et al.  TwiNER: named entity recognition in targeted twitter stream , 2012, SIGIR '12.

[13]  Ming Zhou,et al.  Exacting Social Events for Tweets Using a Factor Graph , 2012, AAAI.

[14]  Anna Kazantseva,et al.  Linear Text Segmentation Using Affinity Propagation , 2011, EMNLP.

[15]  Ee-Peng Lim,et al.  Community-based classification of noun phrases in twitter , 2012, CIKM '12.

[16]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[17]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[18]  Chenliang Li,et al.  Twevent: segment-based event detection from tweets , 2012, CIKM.

[19]  Chris Biemann,et al.  TopicTiling: A Text Segmentation Algorithm based on LDA , 2012, ACL 2012.

[20]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[21]  Xiaolong Li,et al.  An Overview of Microsoft Web N-gram Corpus and Applications , 2010, NAACL.

[22]  Xiaolong Wang,et al.  Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach , 2011, CIKM '11.

[23]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[24]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.

[25]  Joemon M. Jose,et al.  Text segmentation via topic modeling: an analytical study , 2009, CIKM.

[26]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.