Information Filtering on Micro-blogging Services

Micro-blogging is an emerging form of communication and became very popular in recent years. Micro-blogging services allow users to publish updates as short text messages that are broadcasted to the followers of users in real-time. Twitter is currently the most popular micro-blogging service. It is a rich and real-time information source and a good way to discover interesting content or to follow recent developments. However, the service is fairly simple, and rely on the concept of following other users. With the lack of classification or filtering tools, the user receives all messages posted by the users she follows. In most cases, the user receive a noisy stream of updates. In this paper, an information filtering system for Twitter is introduced. The system focuses on one kind of feeds on Twitter: Lists which are a manually selected group of users on Twitter. List feeds tend to be focused on specific topics, however it is still noisy due to irrelevant messages. Therefore, we propose an online filtering system, which extracts the niche topics in a list, filtering out irrelevant messages. To classify messages as relevant or irrelevant, next to text-based features, we utilize the social network of Twitter and different aspects of messages such as the temporal properties and the links included in the text. We evaluate our approach on a labeled dataset of lists and with the help of these novel features, we achieve accuracies between 85% and 95%. Finally, we present the online prototype of the system.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  John G. Cleary,et al.  K*: An Instance-based Learner Using and Entropic Distance Measure , 1995, ICML.

[3]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[5]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[6]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[8]  Raymond J. Mooney,et al.  Constructing Diverse Classifier Ensembles using Artificial Training Examples , 2003, IJCAI.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Douglas W. Oard,et al.  The State of the Art in Text Filtering , 1997, User Modeling and User-Adapted Interaction.

[12]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[13]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[14]  Péter Schönhofen,et al.  Identifying Document Topics Using the Wikipedia Category Network , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[15]  Nick Koudas,et al.  BlogScope: A System for Online Analysis of High Volume Text Streams , 2007, VLDB.

[16]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[17]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[18]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[19]  Divesh Srivastava,et al.  What's on the grapevine? , 2009, SIGMOD Conference.

[20]  David A. Shamma,et al.  Tweet the debates: understanding community annotation of uncollected sources , 2009, WSM@MM.

[21]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[22]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[23]  Leonid Novak,et al.  Sifting micro-blogging stream for events of user interest , 2009, SIGIR.

[24]  Hanan Samet,et al.  TwitterStand: news in tweets , 2009, GIS.

[25]  Mary Beth Rosson,et al.  How and why people Twitter: the role that micro-blogging plays in informal communication at work , 2009, GROUP.

[26]  David A. Shamma,et al.  Tweetgeist : Can the Twitter Timeline Reveal the Structure of Broadcast Events ? , 2009 .

[27]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[28]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[29]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[30]  Fernando Diaz,et al.  Time is of the essence: improving recency ranking using Twitter data , 2010, WWW '10.

[31]  Gilad Mishne,et al.  Towards recency ranking in web search , 2010, WSDM '10.

[32]  Michael S. Bernstein,et al.  Short and tweet: experiments on recommending content from information streams , 2010, CHI.

[33]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[34]  Alice Oh,et al.  Analysis of Twitter Lists as a Potential Source for Discovering Latent Characteristics of Users , 2010 .

[35]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.