ActNeT: Active Learning for Networked Texts in Microblogging

Supervised learning, e.g., classification, plays an important role in processing and organizing microblogging data. In microblogging, it is easy to mass vast quantities of unlabeled data, but would be costly to obtain labels, which are essential for supervised learning algorithms. In order to reduce the labeling cost, active learning is an effective way to select representative and informative instances to query for labels for improving the learned model. Different from traditional data in which the instances are assumed to be independent and identically distributed (i.i.d.), instances in microblogging are networked with each other. This presents both opportunities and challenges for applying active learning to microblogging data. Inspired by social correlation theories, we investigate whether social relations can help perform effective active learning on networked data. In this paper, we propose a novel Active learning framework for the classification of Networked Texts in microblogging (ActNeT). In particular, we study how to incorporate network information into text content modeling, and design strategies to select the most representative and informative instances from microblogging for labeling by taking advantage of social network structure. Experimental results on Twitter datasets show the benefit of incorporating network information in active learning and that the proposed framework outperforms existing state-of-the-art methods.

[1]  Doug Downey,et al.  Sentiment identification by incorporating syntax, semantics and context information , 2012, SIGIR '12.

[2]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[3]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[4]  Huan Liu,et al.  Exploiting homophily effect for trust prediction , 2013, WSDM.

[5]  Sanjoy Dasgupta,et al.  Hierarchical sampling for active learning , 2008, ICML '08.

[6]  Nagiza F. Samatova,et al.  Spice: discovery of phenotype-determining component interplays , 2012, BMC Systems Biology.

[7]  Xiaoming Zhang,et al.  A Semi-Supervised Bayesian Network Model for Microblog Topic Classification , 2012, COLING.

[8]  Fei Wang,et al.  ET-LDA: Joint Topic Modeling for Aligning Events and their Twitter Feedback , 2012, AAAI.

[9]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[10]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[11]  Rong Jin,et al.  Active Learning by Querying Informative and Representative Examples , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[13]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[14]  Lise Getoor,et al.  Active Learning for Networked Data , 2010, ICML.

[15]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Jiawei Han,et al.  A Variance Minimization Criterion to Active Learning on Graphs , 2012, AISTATS.

[17]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[18]  Maria-Florina Balcan,et al.  Margin Based Active Learning , 2007, COLT.

[19]  Nagiza F. Samatova,et al.  Discovery of extreme events-related communities in contrasting groups of physical system networks , 2012, Data Mining and Knowledge Discovery.

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[21]  Fei Wang,et al.  What Were the Tweets About? Topical Associations between Public Events and Twitter Feeds , 2012, ICWSM.

[22]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[23]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[24]  Xiaowei Xu,et al.  Representative Sampling for Text Classification Using Support Vector Machines , 2003, ECIR.

[25]  Nagiza F. Samatova,et al.  Community-based anomaly detection in evolutionary networks , 2012, Journal of Intelligent Information Systems.

[26]  Huan Liu,et al.  Relational learning via latent social dimensions , 2009, KDD.

[27]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[28]  Huan Liu,et al.  Exploiting social relations for sentiment analysis in microblogging , 2013, WSDM.

[29]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[30]  Jiawei Han,et al.  Towards feature selection in network , 2011, CIKM '11.

[31]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[32]  Shlomo Argamon,et al.  Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers , 1995 .

[33]  Yue Lu Exploiting Social Context for Review Quality Prediction , 2010 .