The good, the bad, and the ugly: uncovering novel research opportunities in social media mining

Big data is ubiquitous and can only become bigger, which challenges traditional data mining and machine learning methods. Social media is a new source of data that is significantly different from conventional ones. Social media data are mostly user-generated, and are big, linked, and heterogeneous. We present the good, the bad and the ugly associated with the multi-faceted social media data and exemplify the importance of some original problems with real-world examples. We discuss bias in social media data, evaluation dilemma, data reduction, inferring invisible information, and big-data paradox. We illuminate new opportunities of developing novel algorithms and tools for data science. In our endeavor of employing the good to tame the bad with the help of the ugly, we deepen the understanding of ever growing and continuously evolving data and create innovative solutions with interdisciplinary and collaborative research of data science.

[1]  A. Pentland,et al.  Life in the network: The coming age of computational social science: Science , 2009 .

[2]  Huan Liu,et al.  A Novel Measure for Coherence in Statistical Topic Models , 2016, ACL.

[3]  Peter Fankhauser,et al.  Identifying Users Across Social Tagging Systems , 2011, ICWSM.

[4]  Reza Zafarani,et al.  Connecting Corresponding Identities across Communities , 2009, ICWSM.

[5]  Huan Liu,et al.  Unsupervised feature selection for linked social media data , 2012, KDD.

[6]  Silvio Lattanzi,et al.  An efficient reconciliation algorithm for social networks , 2013, Proc. VLDB Endow..

[7]  Charu C. Aggarwal,et al.  Recommendations in Signed Social Networks , 2016, WWW.

[8]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[9]  Huan Liu,et al.  Can One Tamper with the Sample API?: Toward Neutralizing Bias from Spam and Bot Content , 2016, WWW.

[10]  Reza Zafarani,et al.  Connecting users across social media sites: a behavioral-modeling approach , 2013, KDD.

[11]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[12]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[13]  Ramanathan V. Guha,et al.  Propagation of trust and distrust , 2004, WWW '04.

[14]  Sushil Jajodia,et al.  Who is tweeting on Twitter: human, bot, or cyborg? , 2010, ACSAC '10.

[15]  Reza Zafarani,et al.  Understanding User Migration Patterns in Social Media , 2011, AAAI.

[16]  Reza Zafarani,et al.  Evaluation without ground truth in social media research , 2015, Commun. ACM.

[17]  Charu C. Aggarwal,et al.  A Survey of Signed Network Mining in Social Media , 2015, ACM Comput. Surv..

[18]  Eric Gilbert,et al.  Predicting tie strength with social media , 2009, CHI.

[19]  Reza Zafarani,et al.  User Identification Across Social Media , 2015, ACM Trans. Knowl. Discov. Data.

[20]  Yong-Yeol Ahn,et al.  Analyzing the Video Popularity Characteristics of Large-Scale User Generated Content Systems , 2009, IEEE/ACM Transactions on Networking.

[21]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[22]  Reza Zafarani,et al.  10 Bits of Surprise: Detecting Malicious Users with Minimum Information , 2015, CIKM.

[23]  A. Loudon,et al.  Know Your Enemy , 1942 .

[24]  Dawn Xiaodong Song,et al.  Suspended accounts in retrospect: an analysis of twitter spam , 2011, IMC '11.

[25]  Huan Liu,et al.  Twitter Data Analytics , 2013, SpringerBriefs in Computer Science.

[26]  Huan Liu,et al.  Feature Selection with Linked Data in Social Media , 2012, SDM.

[27]  Reza Zafarani,et al.  Social Media Mining: An Introduction , 2014 .

[28]  Huan Liu,et al.  CoSelect: Feature Selection with Instance Selection for Social Media Data , 2013, SDM.

[29]  Huan Liu,et al.  Network denoising in social media , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[30]  Ghazaleh Beigi,et al.  Signed Link Analysis in Social Media Networks , 2016, ICWSM.

[31]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[32]  Huan Liu,et al.  Community Detection and Mining in Social Media , 2010, Community Detection and Mining in Social Media.

[33]  Jérôme Kunegis,et al.  What is the added value of negative links in online social networks? , 2013, WWW.

[34]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[35]  Jure Leskovec,et al.  Predicting positive and negative links in online social networks , 2010, WWW '10.

[36]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[37]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[38]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[39]  Huan Liu,et al.  Feature Selection for Social Media Data , 2014, TKDD.

[40]  Huan Liu,et al.  A new approach to bot detection: Striking the balance between precision and recall , 2016, 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[41]  Charu C. Aggarwal,et al.  Negative Link Prediction in Social Media , 2014, WSDM.

[42]  Kyumin Lee,et al.  Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter , 2011, ICWSM.

[43]  Fang Wu,et al.  Social Networks that Matter: Twitter Under the Microscope , 2008, First Monday.

[44]  Jennifer Neville,et al.  Linkage and Autocorrelation Cause Feature Selection Bias in Relational Learning , 2002, ICML.

[45]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[46]  A. Pentland,et al.  Computational Social Science , 2009, Science.

[47]  Huan Liu,et al.  Unsupervised Feature Selection for Multi-View Data in Social Media , 2013, SDM.

[48]  Pablo Rodriguez,et al.  I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system , 2007, IMC '07.

[49]  Huan Liu,et al.  Text, Topics, and Turkers: A Consensus Measure for Statistical Topics , 2015, HT.