Simple Statistics Are Sometime Too Simple: A Case Study in Social Media Data

In this work we ask to which extent are simple statistics useful to make sense of social media data. By simple statistics we mean counting and bookkeeping type features such as the number of likes given to a user's post, a user's number of friends, etc. We find that relying solely on simple statistics is not always a good approach. Specifically, we develop a statistical framework that we term semantic shattering which allows to detect semantic inconsistencies in the data that may occur due to relying solely on simple statistics. We apply our framework to simple-statistics data collected from six online social media platforms and arrive at a surprising counter-intuitive finding in three of them, Twitter, Instagram and YouTube. We find that overall, the activity of the user is not correlated with the feedback that the user receives on that activity. A hint to understand this phenomenon may be found in the fact that the activity-feedback shattering did not occur in LinkedIn, Steam and Flickr. A possible explanation for this separation is the amount of effort required to produce content. The lesser the effort the lesser the correlation between activity and feedback. The amount of effort may be a proxy to the level of commitment that the users feel towards each other in the network, and indeed sociologists claim that commitment explains consistent human behavior, or lack thereof. However, the amount of effort or the level of commitment are by no means a simple statistic.

[1]  Magdalini Eirinaki,et al.  Identification of influential social networkers , 2012, Int. J. Web Based Communities.

[2]  Paul Golder,et al.  The Guttman-Kaiser Criterion as a Predictor of the Number of Common Factors , 1982 .

[3]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[4]  J. Brooks Why most published research findings are false: Ioannidis JP, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece , 2008 .

[5]  Krishna P. Gummadi,et al.  Towards Detecting Anomalous User Behavior in Online Social Networks , 2014, USENIX Security Symposium.

[6]  H. Becker Notes on the Concept of Commitment , 1960, American Journal of Sociology.

[7]  Joseph B. Bayer,et al.  Sharing the small moments: ephemeral social interaction on Snapchat , 2016 .

[8]  Michael Trusov,et al.  Determining Influential Users in Internet Social Networks , 2010 .

[9]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[10]  Claudia Canali,et al.  A quantitative methodology based on component analysis to identify key users in social networks , 2012, Int. J. Soc. Netw. Min..

[11]  Jonathan Taylor,et al.  Statistical learning and selective inference , 2015, Proceedings of the National Academy of Sciences.

[12]  Kristina Lerman,et al.  Using Simpson's Paradox to Discover Interesting Patterns in Behavioral Data , 2018, ICWSM.

[13]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[14]  Dan Vilenchik The Million Tweets Fallacy: Activity and Feedback Are Uncorrelated , 2018, ICWSM.

[15]  C. Blyth On Simpson's Paradox and the Sure-Thing Principle , 1972 .

[16]  Ronald K. S. Macaulay,et al.  Talk That Counts: Age, Gender, and Social Class Differences in Discourse , 2005 .