Automating Open Science for Big Data

The vast majority of social science research uses small (megabyte- or gigabyte-scale) datasets. These fixed-scale datasets are commonly downloaded to the researcher’s computer where the analysis is performed. The data can be shared, archived, and cited with well-established technologies, such as the Dataverse Project, to support the published results. The trend toward big data—including large-scale streaming data—is starting to transform research and has the potential to impact policymaking as well as our understanding of the social, economic, and political problems that affect human societies. However, big data research poses new challenges to the execution of the analysis, archiving and reuse of the data, and reproduction of the results. Downloading these datasets to a researcher’s computer is impractical, leading to analyses taking place in the cloud, and requiring unusual expertise, collaboration, and tool development. The increased amount of information in these large datasets is an advantage, but at the same time it poses an increased risk of revealing personally identifiable sensitive information. In this article, we discuss solutions to these new challenges so that the social sciences can realize the potential of big data.

[1]  C. Main,et al.  Replication, replication, replication , 2017 .

[2]  Mercè Crosas,et al.  The Evolution of Data Citation: From Principles to Implementation , 2014 .

[3]  Ryan P. Adams,et al.  Firefly Monte Carlo: Exact MCMC with Subsets of Data , 2014, UAI.

[4]  Vito D'Orazio,et al.  Statistical Modeling by Gesture: A graphical, Browser-based Statistical Interface for Data Repositories , 2014, HT.

[5]  G. King Restructuring the Social Sciences: Reflections from Harvard's Institute for Quantitative Social Science , 2013, PS: Political Science & Politics.

[6]  Distributed and Adaptive Darting Monte Carlo through Regenerations , 2013 .

[7]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[8]  Herbert Van de Sompel,et al.  Cool URIs and Dynamic Data , 2012, IEEE Internet Computing.

[9]  Mercè Crosas,et al.  A Data Sharing Story , 2012 .

[10]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[11]  Gary King,et al.  Amelia II: A Program for Missing Data , 2011 .

[12]  Anne Laurent,et al.  Reduce, You Say: What NoSQL Can Do for Data Aggregation and BI in Large Repositories , 2011, 2011 22nd International Workshop on Database and Expert Systems Applications.

[13]  Ian T. Foster,et al.  Globus Online: Accelerating and Democratizing Science through Cloud-Based Services , 2011, IEEE Internet Computing.

[14]  Mercè Crosas,et al.  The Dataverse Network®: An Open-Source Application for Sharing, Discovering and Preserving Data , 2011, D Lib Mag..

[15]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[16]  G. King,et al.  What to Do about Missing Values in Time‐Series Cross‐Section Data , 2010 .

[17]  Cynthia Dwork,et al.  Differential Privacy for Statistics: What we Know and What we Want to Learn , 2010, J. Priv. Confidentiality.

[18]  Jerome P. Reiter Multiple Imputation for Disclosure Limitation: Future Research Challenges , 2010, J. Priv. Confidentiality.

[19]  Elizabeth A Stuart,et al.  Matching methods for causal inference: A review and a look forward. , 2010, Statistical science : a review journal of the Institute of Mathematical Statistics.

[20]  Moni Naor,et al.  On the complexity of differentially private data release: efficient algorithms and hardness results , 2009, STOC '09.

[21]  Gary King,et al.  Toward a Common Framework for Statistical Analysis and Development , 2008 .

[22]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[23]  Gary King,et al.  An Introduction to the Dataverse Network as an Infrastructure for Data Sharing , 2007 .

[24]  Gary King,et al.  Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference , 2007, Political Analysis.

[25]  Martin L. Kersten,et al.  Database Cracking , 2007, CIDR.

[26]  Micah Altman,et al.  A Proposed Standard for the Scholarly Citation of Quantitative Data , 2008, IASSIST Conference.

[27]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[28]  Gary King,et al.  Zelig: Everyone's Statistical Software , 2006 .

[29]  G. King,et al.  Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation , 2001, American Political Science Review.

[30]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[31]  Jason Wittenberg,et al.  Making the Most Of Statistical Analyses: Improving Interpretation and Presentation , 2000 .

[32]  L. Sweeney Simple Demographics Often Identify People Uniquely , 2000 .