Just the Facts: Winnowing Microblogs for Newsworthy Statements using Non-Lexical Features

Microblogging has become a popular method to disseminate information quickly, but also for many other dialogue acts such as expression opinion and advertising. As the volumes have risen, the task of filtering messages for wanted information has become increasingly important. In this work we examine the potential of natural language processing and machine learning to filter short messages for those that state items of news. We propose an approach that makes use of information carried at a deeper level than message’s lexical surface, and show that this can be used effectively improve precision in filtering Twitter messages. Our method outperforms a baseline unigram “bag-of-words” approach to selecting news-event Tweets, yielding a 4.8% drop in false detection.

[1]  Ajantha S. Atukorale,et al.  A robust algorithm for determining the newsworthiness of microblogs , 2015, 2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer).

[2]  Craig MacDonald,et al.  Can Twitter Replace Newswire for Breaking News? , 2013, ICWSM.

[3]  Kalina Bontcheva,et al.  TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text , 2013, RANLP.

[4]  Jong Kim,et al.  Spam Filtering in Twitter Using Sender-Receiver Relationship , 2011, RAID.

[5]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[6]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[7]  M. Chuah,et al.  Spam Detection on Twitter Using Traditional Classifiers , 2011, ATC.

[8]  Gerhard Weikum,et al.  EnBlogue: emergent topic detection in web 2.0 streams , 2011, SIGMOD '11.

[9]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[10]  Arkaitz Zubiaga,et al.  Real‐time classification of Twitter trends , 2014, J. Assoc. Inf. Sci. Technol..

[11]  Ponnurangam Kumaraguru,et al.  TweetCred: Real-Time Credibility Assessment of Content on Twitter , 2014, SocInfo.

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[14]  James Allan,et al.  First story detection in TDT is hard , 2000, CIKM '00.

[15]  Ryuichiro Higashinaka,et al.  Syntactic Filtering and Content-Based Retrieval of Twitter Sentences for the Generation of System Utterances in Dialogue Systems , 2016 .

[16]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[17]  Steven Skiena,et al.  Newspapers vs. Blogs: Who Gets the Scoop? , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[18]  Haizhou Li,et al.  Graph-based informative-sentence selection for opinion summarization , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[19]  Mark Dredze,et al.  Facebook, Twitter and Google Plus for Breaking News: Is There a Winner? , 2014, ICWSM.

[20]  Donna K. Harman,et al.  Novelty Detection: The TREC Experience , 2005, HLT.

[21]  Sushil Jajodia,et al.  Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? , 2012, IEEE Transactions on Dependable and Secure Computing.

[22]  Derek Ruths,et al.  Classifying Political Orientation on Twitter: It's Not Easy! , 2013, ICWSM.

[23]  Gerhard Weikum,et al.  A Fresh Look on Knowledge Bases: Distilling Named Events from News , 2014, CIKM.

[24]  R. Rajasree,et al.  Sentiment analysis in twitter using machine learning techniques , 2013, 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[25]  Iadh Ounis,et al.  Real-Time Detection, Tracking, and Monitoring of Automatically Discovered Events in Social Media , 2014, ACL.

[26]  Christopher M. Danforth,et al.  Sifting robotic from organic text: A natural language approach for detecting automation on Twitter , 2015, J. Comput. Sci..

[27]  Yiannis Kompatsiaris,et al.  Sensing Trending Topics in Twitter , 2013, IEEE Transactions on Multimedia.

[28]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[29]  Vincent Larivière,et al.  Tweets as impact indicators: Examining the implications of automated “bot” accounts on Twitter , 2014, J. Assoc. Inf. Sci. Technol..

[30]  Joshua Goodman,et al.  Multi-Document Summarization by Maximizing Informative Content-Words , 2007, IJCAI.