Big Data: Pitfalls, Methods and Concepts for an Emergent Field

Big Data, large-scale aggregate databases of imprints of online and social media activity, has captured scientific and policy attention. However, this emergent field is challenged by inadequate attention to methodological and conceptual issues. I review key methodological and conceptual challenges including: 1) Inadequate attention to the implicit and explicit structural biases of the platform(s) most frequently used to generate datasets (the model organism problem). 2) The common practice of selecting on the dependent variable without corresponding attention to the complications of this path. 3) Lack of clarity with regard to sampling, universe and representativeness (the denominator problem). 4) Most big data analyses come from a single platform (hence missing the ecology of information flows). Conceptual issues reviewed in this paper include: 1) More research is needed to interpret aggregated mediated interactions. Clicks, status updates, links, retweets, etc. are complex social interactions. 2) Network methods imported from other fields need to be carefully reconsidered to evaluate appropriateness for analyzing human social media imprints. 3) Most big datasets contain information only on “node-to-node” interaction. However, “field” effects – events that affect a society or a group in a wholesale fashion either through shared experience or through broadcast media – are an important part of human socio-cultural experience. 4.Human reflexivity – that humans will alter behaviors around metrics – needs to be assumed and built into the analysis. 5) Assuming additivity and counting interactions so that each new interaction is seen as (n 1) without regards to the semantics or context can be misleading. 6) The relationship between network structure and other attributes is complex and multi-faceted.

[1]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[2]  Eszter Hargittai,et al.  Whose Space? Differences Among Users and Non-Users of Social Network Sites , 2007, J. Comput. Mediat. Commun..

[3]  C. T. Butts,et al.  Revisiting the Foundations of Network Analysis , 2009, Science.

[4]  Michael S. Bernstein,et al.  Quantifying the invisible audience in social networks , 2013, CHI.

[5]  Mark S. Granovetter The Strength of Weak Ties , 1973, American Journal of Sociology.

[6]  Eric Gilbert,et al.  Predicting tie strength with social media , 2009, CHI.

[7]  Mark Johnston,et al.  Whither Model Organism Research? , 2005, Science.

[8]  A-L Barabási,et al.  Structure and tie strengths in mobile communication networks , 2006, Proceedings of the National Academy of Sciences.

[9]  Tamás Nepusz,et al.  Measuring tie-strength in virtual social networks , 2006 .

[10]  P. V. Marsden,et al.  Measuring Tie Strength , 1984 .

[11]  A. Pentland,et al.  Computational Social Science , 2009, Science.

[12]  B. Geddes How the Cases You Choose Affect the Answers You Get: Selection Bias in Comparative Politics , 1990, Political Analysis.

[13]  Matthew A. Wills,et al.  The choice of model organisms in evo–devo , 2007, Nature Reviews Genetics.

[14]  Caroline Haythornthwaite,et al.  Strong, Weak, and Latent Ties and the Impact of New Media , 2002, Inf. Soc..

[15]  J. Bolker,et al.  Model systems in developmental biology , 1995, BioEssays : news and reviews in molecular, cellular and developmental biology.

[16]  S. Gilbert Ecological developmental biology: developmental biology meets the real world. , 2001, Developmental biology.

[17]  D. Boyd,et al.  CRITICAL QUESTIONS FOR BIG DATA , 2012 .