The potential and perils of preprocessing: Building new foundations

Preprocessing forms an oft-neglected foundation for a wide range of statistical and scientific analyses. However, it is rife with subtleties and pitfalls. Decisions made in preprocessing constrain all later analyses and are typically irreversible. Hence, data analysis becomes a collaborative endeavor by all parties involved in data collection, preprocessing and curation, and downstream inference. Even if each party has done its best given the information and resources available to them, the final result may still fall short of the best possible in the traditional single-phase inference framework. This is particularly relevant as we enter the era of “big data”. The technologies driving this data explosion are subject to complex new forms of measurement error. Simultaneously, we are accumulating increasingly massive databases of scientific analyses. As a result, preprocessing has become more vital (and potentially more dangerous) than ever before. We propose a theoretical framework for the analysis of preprocessing under the banner of multiphase inference. We provide some initial theoretical foundations for this area, including distributed preprocessing, building upon previous work in multiple imputation. We motivate this foundation with two problems from biology and astrophysics, illustrating multiphase pitfalls and potential solutions. These examples also emphasize the motivations behind multiphase analyses—both practical and theoretical. We demonstrate that multiphase inferences can, in some cases, even surpass standard single-phase estimators in efficiency and robustness. Our work suggests several rich paths for further research into the statistical principles underlying preprocessing. To tackle our increasingly complex and massive data, we must ensure that our inferences are built upon solid inputs and sound principles. Principled investigation of preprocessing is thus a vital direction for statistical research.

[1]  Gordon K. Smyth,et al.  A comparison of background correction methods for two-colour microarrays , 2007, Bioinform..

[2]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[3]  Pavlos Protopapas,et al.  Semi-parametric Robust Event Detection for Massive Time-Domain Databases , 2013, 1301.3027.

[4]  P. Bernardis,et al.  Variations of the spectral index of dust emissivity from Hi-GAL observations of the Galactic plane , 2010, 1009.2779.

[5]  John P A Ioannidis,et al.  Improving Validation Practices in “Omics” Research , 2011, Science.

[6]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Zhongxue Chen,et al.  Parameter Estimation for the Exponential-Normal Convolution Model for Background Correction of Affymetrix GeneChip Data , 2006, Statistical applications in genetics and molecular biology.

[8]  Yang Xie,et al.  Statistical methods of background correction for Illumina BeadArray data , 2009, Bioinform..

[9]  Xiao-Li Meng,et al.  I Got More Data, My Model is More Refined, but My Estimator is Getting Worse! Am I Just Dumb? , 2014 .

[10]  David L. Neuhoff,et al.  Quantization , 2022, IEEE Trans. Inf. Theory.

[11]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[12]  Xiao-Li Meng,et al.  Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM Algorithm , 1991 .

[13]  Peter C. Fishburn,et al.  Several Bayesians: A review , 1993 .

[14]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[15]  Xiao-Li Meng,et al.  Multiple-Imputation Inferences with Uncongenial Sources of Input , 1994 .

[16]  Evan M. Manning,et al.  Massive Dataset Analysis for NASA’s Atmospheric Infrared Sounder , 2012, Technometrics.

[17]  J. Aumont,et al.  Submillimetre point sources from the Archeops experiment : very cold clumps in the Galactic plane , 2008, 0801.4502.

[18]  M. Degroot,et al.  Comparison of Experiments and Information Measures , 1979 .

[19]  Stat Pairs,et al.  Statistical Algorithms Description Document , 2022 .

[20]  J. Neyman,et al.  Consistent Estimates Based on Partially Consistent Observations , 1948 .

[21]  Martin J. Wainwright,et al.  ON surrogate loss functions and f-divergences , 2005, math/0510521.

[22]  Daniel Q. Naiman,et al.  Simple decision rules for classifying human cancers from gene expression profiles , 2005, Bioinform..

[23]  A. P. Dawid,et al.  Invariant Prior Distributions , 2006 .

[24]  D. Blackwell Equivalent Comparisons of Experiments , 1953 .

[25]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[26]  S. Geisser,et al.  A Predictive Approach to Model Selection , 1979 .

[27]  A. Goodman,et al.  THE EFFECT OF LINE-OF-SIGHT TEMPERATURE VARIATION AND NOISE ON DUST CONTINUUM OBSERVATIONS , 2009, 0902.3477.

[28]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[29]  Leonard J. Savage,et al.  On Rereading R. A. Fisher , 1976 .

[30]  D.,et al.  Regression Models and Life-Tables , 2022 .

[31]  Søren Feodor Nielsen,et al.  Proper and Improper Multiple Imputation , 2003 .

[32]  R. Emery,et al.  The physical properties of the dust in the RCW 120 HII region as seen by Herschel , 2010, 1005.1565.

[33]  On a Necessary and Sufficient Condition for Admissibility of Estimators When Strictly Convex Loss is Used , 1968 .

[34]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[35]  L. Wasserman,et al.  The Selection of Prior Distributions by Formal Rules , 1996 .

[36]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[37]  L. L. Cam,et al.  Sufficiency and Approximate Sufficiency , 1964 .

[38]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[39]  A. Goodman,et al.  DUST SPECTRAL ENERGY DISTRIBUTIONS IN THE ERA OF HERSCHEL AND PLANCK: A HIERARCHICAL BAYESIAN-FITTING TECHNIQUE , 2012, 1203.0025.

[40]  Ian Evans,et al.  The Chandra X-ray Observatory data processing system , 2006, SPIE Astronomical Telescopes + Instrumentation.

[41]  Daniel Q. Naiman,et al.  Classifying Gene Expression Profiles from Pairwise mRNA Comparisons , 2004, Statistical applications in genetics and molecular biology.

[42]  S. Klein Astronomy and astrophysics with , 2008 .

[43]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[44]  D. Blackwell Comparison of Experiments , 1951 .

[45]  A. Tversky,et al.  On the Reconciliation of Probability Assessments , 1979 .

[46]  Inverse temperature dependence of the dust submillimeter spectral index , 2003, astro-ph/0310091.

[47]  Xiao-Li Meng,et al.  Discussion: Efficiency and Self‐efficiency With Multiple Imputation Inference , 2003 .

[48]  Rafael A. Irizarry,et al.  Comparison of Affymetrix GeneChip expression measures , 2006, Bioinform..

[49]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.