Elements and Principles for Characterizing Variation between Data Analyses

The data revolution has led to an increased interest in the practice of data analysis. For a given problem, there can be significant or subtle differences in how a data analyst constructs or creates a data analysis, including differences in the choice of methods, tooling, and workflow. In addition, data analysts can prioritize (or not) certain objective characteristics in a data analysis, leading to differences in the quality or experience of the data analysis, such as an analysis that is more or less reproducible or an analysis that is more or less exhaustive. However, data analysts currently lack a formal mechanism to compare and contrast what makes analyses different from each other. To address this problem, we introduce a vocabulary to describe and characterize variation between data analyses. We denote this vocabulary as the elements and principles of data analysis, and we use them to describe the fundamental concepts for the practice and teaching of creating a data analysis. This leads to two insights: it suggests a formal mechanism to evaluate data analyses based on objective characteristics, and it provides a framework to teach students how to build data analyses.

[1]  B. Vassilev,et al.  Language-Agnostic Reproducible Data Analysis Using Literate Programming , 2016, PloS one.

[2]  C. Wild Embracing the “Wider View” of Statistics , 1994 .

[3]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[4]  William S. Cleveland Data Science: an Action Plan for Expanding the Technical Areas of the Field of Statistics , 2001 .

[5]  G. Box Science and Statistics , 1976 .

[6]  Deborah Nolan,et al.  Computing in the Statistics Curricula , 2010 .

[7]  M. B. Wilk,et al.  Data analysis and statistics: an expository overview , 1966, AFIPS '66 (Fall).

[8]  Nicholas J. Horton,et al.  Data Science in Statistics Curricula: Preparing Students to “Think with Data” , 2014, 1410.3127.

[9]  Daniel T. Kaplan Teaching Stats for Data Science , 2018 .

[10]  Hadley Wickham,et al.  A Cognitive Interpretation of Data Analysis , 2014 .

[11]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[12]  Deborah F. Swayne,et al.  Interactive and Dynamic Graphics for Data Analysis - With R and GGobi , 2007, Use R.

[13]  Daniel Krewski,et al.  Rejoinder: Reanalysis of the Harvard Six Cities Study and American Cancer Society Study of Particulate Air Pollution and Mortality , 2003 .

[14]  Ben Baumer,et al.  A Data Science Course for Undergraduates: Thinking With Data , 2015, ArXiv.

[15]  Rafael A. Irizarry,et al.  A Guide to Teaching Data Science , 2016, The American statistician.

[16]  J. Tukey The Future of Data Analysis , 1962 .

[17]  R. Peng,et al.  Effect of an Integrated Pest Management Intervention on Asthma Symptoms Among Mouse-Sensitized Children and Adolescents With Asthma: A Randomized Clinical Trial , 2017, JAMA.

[18]  C. Wild,et al.  Statistical Thinking in Empirical Enquiry , 1999 .