Segmentation of Tweets with URLs and its Applications to Sentiment Analysis

An important means for disseminating information in social media platforms is by including URLs that point to external sources in user posts. In Twitter, we estimate that about 21% of the daily stream of English-language tweets contain URLs. We notice that NLP tools make little attempt at understanding the relationship between the content of the URL and the text surrounding it in a tweet. In this work, we study the structure of tweets with URLs relative to the content of the Web documents pointed to by the URLs. We identify several segment classes that may appear in a tweet with URLs, such as the title of a Web page and the user’s original content. Our goals in this paper are: introduce, define, and analyze the segmentation problem of tweets with URLs, develop an effective algorithm to solve it, and show that our solution can benefit sentiment analysis on Twitter. We also show that the problem is an instance of the block edit distance problem, and thus an NP-hard problem.

[1]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[2]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[3]  Noah A. Smith,et al.  A Dependency Parser for Tweets , 2014, EMNLP.

[4]  Roberto Basili,et al.  A context-based model for Sentiment Analysis in Twitter , 2014, COLING.

[5]  Naoaki Okazaki,et al.  Identifying Sections in Scientific Abstracts using Conditional Random Fields , 2008, IJCNLP.

[6]  Eugene Agichtein,et al.  Mining reference tables for automatic text segmentation , 2004, KDD.

[7]  Georg Groh,et al.  Sequence Labeling: A Practical Approach , 2018, ArXiv.

[8]  Cornelia Caragea,et al.  Keyphrase Extraction from Disaster-related Tweets , 2019, WWW.

[9]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[10]  Elke A. Rundensteiner,et al.  Using Hashtags as Labels for Supervised Learning of Emotions in Twitter Messages , 2014 .

[11]  David Allen,et al.  Geotagging one hundred million Twitter accounts with total variation minimization , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[12]  Felix Naumann,et al.  Analyzing and predicting viral tweets , 2013, WWW.

[13]  Fan Yang,et al.  Predicting Personal Opinion on Future Events with Fingerprints , 2020, COLING.

[14]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[15]  Daniel P. Lopresti,et al.  Block Edit Models for Approximate String Matching , 1997, Theor. Comput. Sci..

[16]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[17]  Yücel Saygin,et al.  SU-Sentilab : A Classification System for Sentiment Analysis in Twitter , 2013, *SEMEVAL.

[18]  Weiyi Meng,et al.  Normalization of Duplicate Records from Multiple Sources , 2019, IEEE Transactions on Knowledge and Data Engineering.

[19]  Leysia Palen,et al.  Twitter adoption and use in mass convergence and emergency events , 2009 .

[20]  Dirk Hovy,et al.  Adapting taggers to Twitter with not-so-distant supervision , 2014, COLING.

[21]  Hsing-Yen Ann,et al.  Efficient algorithms for the block edit problems , 2010, Inf. Comput..

[22]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[23]  Shanshan Zhang,et al.  Regular Expression Guided Entity Mention Mining from Noisy Web Data , 2018, EMNLP.

[24]  Clement T. Yu,et al.  Annotating Search Results from Web Databases , 2013, IEEE Transactions on Knowledge and Data Engineering.

[25]  Artur Jez,et al.  Edit Distance with Block Operations , 2018, ESA.

[26]  Ed H. Chi,et al.  Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network , 2010, 2010 IEEE Second International Conference on Social Computing.

[27]  Lidong Bing,et al.  Towards a unified solution: data record region detection and segmentation , 2011, CIKM '11.

[28]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[29]  Graham Cormode,et al.  The string edit distance matching problem with moves , 2002, SODA '02.

[30]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[31]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[32]  S. Muthukrishnan,et al.  Approximate nearest neighbors and sequence comparison with block operations , 2000, STOC '00.

[33]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[34]  Dana Shapira,et al.  Edit distance with move operations , 2002, J. Discrete Algorithms.

[35]  Shanshan Zhang,et al.  How to Invest my Time: Lessons from Human-in-the-Loop Entity Extraction , 2019, KDD.

[36]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Marcel Salathé,et al.  COVID-Twitter-BERT: A natural language processing model to analyse COVID-19 content on Twitter , 2020, Frontiers in Artificial Intelligence.

[39]  Dana Shapira,et al.  Edit Distance with Block Deletions , 2011, Algorithms.

[40]  Arjun Mukherjee,et al.  Leveraging Social Media Signals for Record Linkage , 2018, WWW.

[41]  Qi He,et al.  Tweet Segmentation and Its Application to Named Entity Recognition , 2015, IEEE Transactions on Knowledge and Data Engineering.