Chalk and Cheese in Twitter: Discriminating Personal and Organization Accounts

Social media have been popular not only for individuals to share contents, but also for organizations to engage users and spread information. Given the trait differences between personal and organization accounts, the ability to distinguish between the two account types is important for developing better search/recommendation engines, marketing strategies, and information dissemination platforms. However, such task is non-trivial and has not been well studied thus far. In this paper, we present a new generic framework for classifying personal and organization accounts, based upon which comprehensive and systematic investigation on a rich variety of content, social, and temporal features can be carried out. In addition to generic feature transformation pipelines, the framework features a gradient boosting classifier that is accurate/robust and facilitates good data understanding such as the importance of different features. We demonstrate the efficacy of our approach through extensive experiments on Twitter data from Singapore, by which we discover several discriminative content, social, and temporal features.

[1]  Derek Ruths,et al.  Classifying Political Orientation on Twitter: It's Not Easy! , 2013, ICWSM.

[2]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[3]  Wang-Chien Lee,et al.  Two Sides of a Coin: Separating Personal Communication and Public Dissemination Accounts in Twitter , 2014, PAKDD.

[4]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[5]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[6]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Mor Naaman,et al.  Unfolding the event landscape on twitter: classification and exploration of user categories , 2012, CSCW '12.

[9]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[10]  A. Faisal,et al.  Scaling-Laws of Human Broadcast Communication Enable Distinction between Human, Corporate and Robot Twitter Users , 2013, PloS one.

[11]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[12]  Liang Yan,et al.  Classifying Twitter Users Based on User Profile and Followers Distribution , 2013, DEXA.

[13]  Akihiko Kinoshita,et al.  Historical Biogeography and Diversification of Truffles in the Tuberaceae and Their Newly Identified Southern Hemisphere Sister Lineage , 2013, PloS one.

[14]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[15]  Lars Backstrom,et al.  ePluribus: Ethnicity on Social Networks , 2010, ICWSM.

[16]  Wenfei Fan,et al.  Keys with Upward Wildcards for XML , 2001, DEXA.

[17]  N. Smirnov Table for Estimating the Goodness of Fit of Empirical Distributions , 1948 .