Dynamic data science and official statistics

Many of the challenges and opportunities of data science have to do with dynamic factors: a growing volume of administrative and commercial data on individuals and establishments, continuous flows of data and the capacity to analyze and summarize them in real time, and the necessity for resources to maintain them. With its emphasis on data quality and supportable results, the practice of Official Statistics faces a variety of statistical and data science issues. This article discusses the importance of population frames and their maintenance; the potential for use of multi-frame methods and linkages; how the use of large scale non-survey data may shape the objects of inference; the complexity of models for large data sets; the importance of recursive methods and regularization; and the benefits of sophisticated spatial visualization tools in capturing spatial variation and temporal change. The Canadian Journal of Statistics xx: 1–14; 2017 © 2017 Statistical Society of Canada

[1]  Benoˆıt Quenneville,et al.  Restoring Accounting Constraints in Time Series—Methods and Software for a Statistical Agency , 2012 .

[2]  Yulia R. Gel,et al.  A new surveillance and spatio-temporal visualization tool SIMID: SIMulation of Infectious Diseases using random networks and GIS , 2013, Comput. Methods Programs Biomed..

[3]  R. Groves Three Eras of Survey Research , 2011 .

[4]  P. Bickel,et al.  Banded regularization of autocovariance matrices in application to parameter estimation and forecasting of time series , 2011 .

[5]  Barry Schouten,et al.  Optimizing quality of response through adaptive survey designs , 2013 .

[6]  Wesley S. Burr,et al.  Bias correction in estimation of public health risk attributable to short‐term air pollution exposure , 2015 .

[7]  S. M. Tam,et al.  Analysis of Repeated Surveys Using a Dynamic Linear Model , 1987 .

[8]  I. Fellegi Sampling with Varying Probabilities without Replacement: Rotating and Non-Rotating Samples , 1963 .

[9]  Shawn T. Brown,et al.  FRED (A Framework for Reconstructing Epidemic Dynamics): an open-source software system for modeling infectious diseases and control strategies using census-based populations , 2013, BMC Public Health.

[10]  Timothy A. Thomas,et al.  Measures of Human Mobility Using Mobile Phone Records Enhanced with GIS Data , 2014, PloS one.

[11]  Wolfgang Nejdl,et al.  Predicting and visualizing traffic congestion in the presence of planned special events , 2014, J. Vis. Lang. Comput..

[12]  James O. Chipperfield Disclosure-Protected Inference with Linked Microdata Using a Remote Analysis Server , 2014 .

[13]  A. I. McLeod,et al.  A Convenient Algorithm for Drawing a Simple Random Sample , 1983 .

[14]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[15]  Dino Pedreschi,et al.  Small Area Model-Based Estimators Using Big Data Sources , 2015 .

[16]  D. Cox Big data and precision , 2015 .

[17]  C. Field,et al.  Robust state space models for estimating fish stock maturities , 2015 .

[18]  Robert M. Groves,et al.  Responsive design for household surveys: tools for actively controlling survey errors and costs , 2006 .

[19]  James Moody,et al.  Data Visualization in Sociology. , 2014, Annual review of sociology.

[20]  Ron S. Jarmin,et al.  Wrapping it up in a person: Examining employment and earnings outcomes for Ph.D. recipients , 2015, Science.

[21]  Nathaniel K. Newlands,et al.  An integrated, probabilistic model for improved seasonal forecasting of agricultural crop yield under environmental uncertainty , 2014, Front. Environ. Sci..

[22]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[23]  R. Little,et al.  Penalized Spline Model-Based Estimation of the Finite Populations Total from Probability-Proportional-to-Size Samples , 2003 .

[24]  David Firth,et al.  Robust models in probability sampling , 1998 .

[25]  Jun Zhou,et al.  Automatic Record Linkage of Individuals and Households in Historical Census Data , 2014, Int. J. Humanit. Arts Comput..

[26]  Luke Bornn,et al.  Efficient stabilization of crop yield prediction in the Canadian Prairies , 2012 .

[27]  A. Cavallo Online and Official Price Indexes: Measuring Argentina’s Inflation , 2012 .

[28]  M. Thompson Theory of Sample Surveys , 1997 .

[29]  Richard K. Lomotey,et al.  Particle filtering in a SEIRV simulation model of H1N1 influenza , 2015, 2015 Winter Simulation Conference (WSC).

[30]  Sasikiran Kandula,et al.  Inference and Forecast of the Current West African Ebola Outbreak in Guinea, Sierra Leone and Liberia , 2014, PLoS currents.

[31]  James O. Ramsay,et al.  Spatial spline regression models , 2013 .

[32]  M. Hallin,et al.  Dynamic functional principal components , 2015 .

[33]  Rob J. Hyndman,et al.  Fast computation of reconciled forecasts for hierarchical and grouped time series , 2016, Comput. Stat. Data Anal..

[34]  Danny Pfeffermann,et al.  Estimation and Seasonal Adjustment of Population Means Using Data from Repeated Surveys , 1991 .

[35]  S. Geer,et al.  Regularization in statistics , 2006 .

[36]  Daniell Toth,et al.  ANALYZING ESTABLISHMENT NONRESPONSE USING AN INTERPRETABLE REGRESSION TREE MODEL WITH LINKED ADMINISTRATIVE DATA , 2012, 1206.6666.