Nonstandard conditionally specified models for nonignorable missing data

Data analyses typically rely upon assumptions about missingness mechanisms that lead to observed versus missing data. When the data are missing not at random, direct assumptions about the missingness mechanism, and indirect assumptions about the distributions of observed and missing data, are typically untestable. We explore an approach, where the joint distribution of observed data and missing data is specified through non-standard conditional distributions. In this formulation, which traces back to a factorization of the joint distribution, apparently proposed by J.W. Tukey, the modeling assumptions about the conditional factors are either testable or are designed to allow the incorporation of substantive knowledge about the problem at hand, thereby offering a possibly realistic portrayal of the data, both missing and observed. We apply Tukey's conditional representation to exponential family models, and we propose a computationally tractable inferential strategy for this class of models. We illustrate the utility of this approach using high-throughput biological data with missing data that are not missing at random.

[1]  Donald B. Rubin,et al.  Selection Modeling Versus Mixture Modeling with Nonignorable Nonresponse , 1986 .

[2]  D Scharfstein,et al.  Methods for Conducting Sensitivity Analysis of Trials with Potentially Nonignorable Competing Causes of Censoring , 2001, Biometrics.

[3]  J. Pérez-Ortín,et al.  There is a steady‐state transcriptome in exponentially growing yeast cells , 2010, Yeast.

[4]  Andrea Rotnitzky,et al.  Pattern–mixture and selection models for analysing longitudinal data with monotone missing patterns , 2003 .

[5]  E. O’Shea,et al.  Global analysis of protein expression in yeast , 2003, Nature.

[6]  Alexander W. B Locker The potential and perils of preprocessing: Building new foundations , 2013 .

[7]  E. Airoldi,et al.  Accounting for Experimental Noise Reveals That mRNA Levels, Amplified by Post-Transcriptional Processes, Largely Determine Steady-State Protein Levels in Yeast , 2014, bioRxiv.

[8]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[9]  W. Richard McCombie,et al.  High-Throughput Sequencing , 2011 .

[10]  J W Hogan,et al.  Reparameterizing the Pattern Mixture Model for Sensitivity Analyses Under Informative Dropout , 2000, Biometrics.

[11]  Susan M. Paddock,et al.  Subjective prior distributions for modeling longitudinal continuous outcomes with non‐ignorable dropout , 2009, Statistics in medicine.

[12]  Donald B Rubin,et al.  Sensitivity analysis for a partially missing binary outcome in a two‐arm randomized clinical trial , 2014, Statistics in medicine.

[13]  Donald B. Rubin,et al.  Bayesian Inference for Causal Effects: The Role of Randomization , 1978 .

[14]  Robert J. Weil,et al.  Non-Gaussian Distributions Affect Identification of Expression Patterns, Functional Annotation, and Prospective Classification in Human Cancer Genomes , 2012, PloS one.

[15]  Mary Kynn,et al.  Eliciting Expert Knowledge for Bayesian Logistic Regression in Species Habitat Modelling , 2005 .

[16]  E. Marcotte,et al.  Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation , 2007, Nature Biotechnology.

[17]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[18]  R. Little Pattern-Mixture Models for Multivariate Incomplete Data , 1993 .

[19]  Edoardo M. Airoldi,et al.  Polytope samplers for inference in ill-posed inverse problems , 2011, AISTATS.

[20]  E. Airoldi,et al.  Estimating a Structured Covariance Matrix From Multilab Measurements in High-Throughput Biology , 2015, Journal of the American Statistical Association.

[21]  D. Brook On the distinction between the conditional probability and the joint probability approaches in the specification of nearest-neighbour systems , 1964 .

[22]  Xiao-Li Meng,et al.  The potential and perils of preprocessing: Building new foundations , 2013, 1309.6790.

[23]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[24]  Zhi Geng,et al.  Identifiability of Normal and Normal Mixture Models with Nonignorable Missing Data , 2015, 1509.03860.

[25]  Donald B. Rubin,et al.  ‘Clarifying missing at random and related definitions, and implications when coupled with exchangeability’ , 2015 .

[26]  Jianhua Huang,et al.  A statistical framework for protein quantitation in bottom-up MS-based proteomics , 2009, Bioinform..

[27]  John W. Tukey,et al.  Discussion 4: Mixture Modeling Versus Selection Modeling with Nonignorable Nonresponse , 1986 .

[28]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[29]  Noel A Cressie,et al.  Statistics for Spatio-Temporal Data , 2011 .

[30]  Jeremy E. Oakley,et al.  Uncertain Judgements: Eliciting Experts' Probabilities , 2006 .

[31]  Chuan Lu,et al.  An investigation into the population abundance distribution of mRNAs, proteins, and metabolites in biological systems , 2009, Bioinform..

[32]  P. Rorsman,et al.  Gene expression profiling in single cells from the pancreatic islets of Langerhans reveals lognormal distribution of mRNA levels. , 2005, Genome research.

[33]  Maurizio Dapor Monte Carlo Strategies , 2020, Transport of Energetic Electrons in Solids.

[34]  Donald B. Rubin,et al.  Characterizing the Estimation of Parameters in Incomplete-Data Problems , 1974 .

[35]  Richard D. Smith,et al.  Normalization and missing value imputation for label-free LC-MS analysis , 2012, BMC Bioinformatics.

[36]  D. Rubin Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys , 1977 .

[37]  Michael P Snyder,et al.  High-throughput sequencing for biology and medicine , 2013, Molecular systems biology.

[38]  Matthias Mann,et al.  Mass spectrometry–based proteomics in cell biology , 2010, The Journal of cell biology.

[39]  M. Girolami,et al.  Riemann manifold Langevin and Hamiltonian Monte Carlo methods , 2011, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[40]  J. Robins,et al.  Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models , 1999 .

[41]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[42]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[43]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .