CDS: Collaborative distant supervision for Twitter account classification

Individuals use Twitter for personal communication, whereas businesses, politicians and celebrities use Twitter for branding purposes. Distinguishing Personal from Branding Twitter accounts is important for Twitter analytics. Existing studies of Twitter account classification apply classical supervised learning, which requires intensive manual annotation for training. In this paper, we propose CDS (Collaborative Distant Supervision), a novel learning scheme for Twitter account classification that does not require intensive manual labelling. Twitter accounts are automatically labelled using heuristics for distant supervision learning. To achieve effective learning from heuristic labels, active learning is applied to identify and correct false positive labels, and semi-supervised learning is applied to further use false negatives missed by labelling heuristics for learning. Extensive experiments on Twitter data showed that CDS achieved high classification accuracy.

[1]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[2]  Hiroshi Nakagawa,et al.  Reducing Wrong Labels in Distant Supervision for Relation Extraction , 2012, ACL.

[3]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[4]  Heng Ji,et al.  Harnessing web page directories for large-scale classification of tweets , 2013, WWW '13 Companion.

[5]  Zhong Zhou,et al.  Tweet2Vec: Character-Based Distributed Representations for Social Media , 2016, ACL.

[6]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[7]  David Bamman,et al.  Gender in Twitter: Styles, stances, and social networks , 2012, ArXiv.

[8]  Wang-Chien Lee,et al.  Two Sides of a Coin: Separating Personal Communication and Public Dissemination Accounts in Twitter , 2014, PAKDD.

[9]  Jon Kleinberg,et al.  Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter , 2011, WWW.

[10]  Ralph Grishman,et al.  Distant Supervision for Relation Extraction with an Incomplete Knowledge Base , 2013, NAACL.

[11]  Gustavo Camps-Valls,et al.  Semisupervised Classification of Remote Sensing Images With Active Queries , 2012, IEEE Transactions on Geoscience and Remote Sensing.

[12]  Hassan Sajjad,et al.  Bridging social media via distant supervision , 2015, Social Network Analysis and Mining.

[13]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[14]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[15]  L SalzbergSteven On Comparing Classifiers , 1997 .

[16]  Alex Hai Wang,et al.  Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach , 2010, DBSec.

[17]  Gustavo Camps-Valls,et al.  Semi-Supervised Graph-Based Hyperspectral Image Classification , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[18]  Bernard J. Jansen,et al.  Twitter power: Tweets as electronic word of mouth , 2009, J. Assoc. Inf. Sci. Technol..

[19]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[21]  Zahir Tari,et al.  KRNN: k Rare-class Nearest Neighbour classification , 2017, Pattern Recognit..

[22]  Ramesh Nallapati,et al.  Multi-instance Multi-label Learning for Relation Extraction , 2012, EMNLP.

[23]  Lorenzo Bruzzone,et al.  A Novel Transductive SVM for Semisupervised Classification of Remote-Sensing Images , 2006, IEEE Transactions on Geoscience and Remote Sensing.

[24]  Scott A. Golder,et al.  Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures , 2011 .

[25]  Lei Xi,et al.  Rough set and ensemble learning based semi-supervised algorithm for text classification , 2011, Expert Syst. Appl..

[26]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[27]  Daniele Quercia,et al.  Our Twitter Profiles, Our Selves: Predicting Personality with Twitter , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[28]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[29]  David Zimbra,et al.  Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network , 2013, Expert Syst. Appl..

[30]  Eugénio C. Oliveira,et al.  Identifying Automatic Posting Systems in Microblogs , 2011, EPIA.

[31]  Yihao Zhang,et al.  Semi-supervised learning combining co-training with active learning , 2014, Expert Syst. Appl..

[32]  A. Kai Qin,et al.  Collaborative Active and Semisupervised Learning for Hyperspectral Remote Sensing Image Classification , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[33]  S. Sathiya Keerthi,et al.  Large scale semi-supervised linear SVMs , 2006, SIGIR.

[34]  Nick Bassiliades,et al.  Ontology-based sentiment analysis of twitter posts , 2013, Expert Syst. Appl..

[35]  Liang Yan,et al.  Classifying Twitter Users Based on User Profile and Followers Distribution , 2013, DEXA.

[36]  Ee-Peng Lim,et al.  Chalk and Cheese in Twitter: Discriminating Personal and Organization Accounts , 2015, ECIR.

[37]  Mor Naaman,et al.  Unfolding the event landscape on twitter: classification and exploration of user categories , 2012, CSCW '12.

[38]  Xiuzhen Zhang,et al.  Sentiment Analysis on Twitter through Topic-Based Lexicon Expansion , 2014, ADC.

[39]  James Bailey,et al.  Sentiment Analysis by Augmenting Expectation Maximisation with Lexical Knowledge , 2012, WISE.

[40]  Hassan Sajjad,et al.  Distant Supervision for Tweet Classification Using YouTube Labels , 2015, ICWSM.

[41]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[42]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.