Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society

The Big Data Research and Development Initiative is now in its third year and making great strides to address the challenges of Big Data. To further advance this initiative, we describe how statistical thinking can help tackle the many Big Data challenges, emphasizing that often the most productive approach will involve multidisciplinary teams with statistical, computational, mathematical, and scientific domain expertise.

[1]  A. Raftery,et al.  Probabilistic Projections of the Total Fertility Rate for All Countries , 2011, Demography.

[2]  Ellen G. Cohn,et al.  The Impact of Research on Legal Policy: The Minneapolis Domestic Violence Experiment , 1989 .

[3]  Christopher W. Larimer,et al.  Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment , 2008, American Political Science Review.

[4]  Jerome P. Reiter,et al.  Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database , 2011 .

[5]  Warren B. Powell,et al.  Adaptive Stochastic Control for the Smart Grid , 2011, Proceedings of the IEEE.

[6]  Craig J. Johns,et al.  Infilling Sparse Records of Spatial Fields , 2003 .

[7]  R. Tibshirani,et al.  Association between cellular-telephone calls and motor vehicle collisions. , 1997, The New England journal of medicine.

[8]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[9]  R. W. Ogburn,et al.  Detection of B-mode polarization at degree angular scales by BICEP2. , 2014, Physical review letters.

[10]  Scott L. Zeger,et al.  Mortality in the Medicare Population and Chronic Exposure to Fine Particulate Air Pollution in Urban Centers (2000–2005) , 2008, Environmental health perspectives.

[11]  Язык программирования,et al.  Cross Industry Standard Process for Data Mining , 2010 .

[12]  Richard A. Berk,et al.  The Differential Deterrent Effects of An Arrest in Incidents of Domestic Violence: A Bayesian Analysis of Four Randomized Field Experiments , 2011 .

[13]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  R. Berk Criminal Justice Forecasts of Risk: A Machine Learning Approach , 2012 .

[15]  Ben Shneiderman,et al.  Inventing Discovery Tools: Combining Information Visualization with Data Mining1 , 2001, Inf. Vis..

[16]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[17]  David Madigan,et al.  Under-reporting of cardiovascular events in the rofecoxib Alzheimer disease studies. , 2012, American heart journal.

[18]  Nan Li,et al.  Bayesian probabilistic population projections for all countries , 2012, Proceedings of the National Academy of Sciences.

[19]  Noel A Cressie,et al.  Statistical science: contributions to the Administration's research priority on climate change , 2014 .

[20]  R. Berk,et al.  The specific deterrent effects of arrest for domestic assault. , 1984, American sociological review.

[21]  Douglas W. Nychka,et al.  The ‘hockey stick’ and the 1990s: a statistical perspective on reconstructing hemispheric temperatures , 2007 .

[22]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[23]  Shalin Hai-Jew Analyzing Social Media Networks with NodeXL: Insights from a Connected World , 2012 .

[24]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[25]  Philip B. Stark Privacy, Big Data, and the Public Good: Frameworks for Engagement , 2016 .

[26]  Axinia Radeva,et al.  Analytics for Power Grid Distribution Reliability in New York City , 2014, Interfaces.

[27]  Jihoon Kim,et al.  Detecting inappropriate access to electronic health records using collaborative filtering , 2014, Machine Learning.

[28]  John D. Storey A direct approach to false discovery rates , 2002 .

[29]  Richard A. L. Jones,et al.  The North American Regional Climate Change Assessment Program: Overview of Phase I Results , 2012 .

[30]  Rajesh Parekh,et al.  Lessons and Challenges from Mining Retail E-Commerce Data , 2004, Machine Learning.

[31]  Ben Shneiderman,et al.  In vivo filtering of in vitro MyoD target data: An approach for identification of biologically relevant novel downstream targets of transcription factors (2003) , 2005 .

[32]  Eric Horvitz,et al.  Prediction, Expectation, and Surprise: Methods, Designs, and Study of a Deployed Traffic Forecasting Service , 2005, UAI.

[33]  J. Riis How the Other Half Lives: Studies Among the Tenements of New York , 1903 .

[34]  Daniel A. Keim,et al.  Mastering the Information Age - Solving Problems with Visual Analytics , 2010 .

[35]  N. Thomas,et al.  Dynamic green fluorescent protein sensors for high-content analysis of the cell cycle. , 2006, Methods in enzymology.

[36]  Eric Gossett,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2015 .

[37]  Leland Gerson Neuberg,et al.  A solution to the ecological inference problem: Reconstructing individual behavior from aggregate data , 1999 .

[38]  Richard A. Berk,et al.  Police Responses to Family Violence Incidents: An Analysis of an Experimental Design with Incomplete Randomization , 1988 .

[39]  Jihoon Kim,et al.  Using statistical and machine learning to help institutions detect suspicious access to electronic health records , 2011, J. Am. Medical Informatics Assoc..