Nonstandard conditionally specified models for nonignorable missing data

Significance We consider data-analysis settings where data are missing not at random. In these cases, the two basic modeling approaches are 1) pattern-mixture models, with separate distributions for missing data and observed data, and 2) selection models, with a distribution for the data preobservation and a missing-data mechanism that selects which data are observed. These two modeling approaches lead to distinct factorizations of the joint distribution of the observed-data and missing-data indicators. In this paper, we explore a third approach, apparently originally proposed by J. W. Tukey as a remark in a discussion between Rubin and Hartigan, and reported by Holland in a two-page note, which has been so far neglected. Data analyses typically rely upon assumptions about the missingness mechanisms that lead to observed versus missing data, assumptions that are typically unassessable. We explore an approach where the joint distribution of observed data and missing data are specified in a nonstandard way. In this formulation, which traces back to a representation of the joint distribution of the data and missingness mechanism, apparently first proposed by J. W. Tukey, the modeling assumptions about the distributions are either assessable or are designed to allow relatively easy incorporation of substantive knowledge about the problem at hand, thereby offering a possibly realistic portrayal of the data, both observed and missing. We develop Tukey’s representation for exponential-family models, propose a computationally tractable approach to inference in this class of models, and offer some general theoretical comments. We then illustrate the utility of this approach with an example in systems biology.

[1]  D. Rubin,et al.  Fully conditional specification in multivariate imputation , 2006 .

[2]  Matthias Mann,et al.  Mass spectrometry–based proteomics in cell biology , 2010, The Journal of cell biology.

[3]  Richard D. Smith,et al.  Normalization and missing value imputation for label-free LC-MS analysis , 2012, BMC Bioinformatics.

[4]  Donald B Rubin,et al.  Sensitivity analysis for a partially missing binary outcome in a two‐arm randomized clinical trial , 2014, Statistics in medicine.

[5]  D. Rubin Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys , 1977 .

[6]  Chuan Lu,et al.  An investigation into the population abundance distribution of mRNAs, proteins, and metabolites in biological systems , 2009, Bioinform..

[7]  E. Airoldi,et al.  Accounting for Experimental Noise Reveals That mRNA Levels, Amplified by Post-Transcriptional Processes, Largely Determine Steady-State Protein Levels in Yeast , 2014, bioRxiv.

[8]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[9]  Donald B. Rubin,et al.  Bayesian Inference for Causal Effects: The Role of Randomization , 1978 .

[10]  Andrea Rotnitzky,et al.  Pattern–mixture and selection models for analysing longitudinal data with monotone missing patterns , 2003 .

[11]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[12]  E. O’Shea,et al.  Global analysis of protein expression in yeast , 2003, Nature.

[13]  Geert Molenberghs,et al.  Monotone missing data and pattern‐mixture models , 1998 .

[14]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[15]  John W. Tukey,et al.  Discussion 4: Mixture Modeling Versus Selection Modeling with Nonignorable Nonresponse , 1986 .

[16]  Donald B. Rubin,et al.  ‘Clarifying missing at random and related definitions, and implications when coupled with exchangeability’ , 2015 .

[17]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[18]  Andrea Rotnitzky,et al.  Estimation of regression models for the mean of repeated outcomes under nonignorable nonmonotone nonresponse. , 2007, Biometrika.

[19]  Jeremy E. Oakley,et al.  Uncertain Judgements: Eliciting Experts' Probabilities , 2006 .

[20]  Xiao-Li Meng,et al.  A Note on Bivariate Distributions That are Conditionally Normal , 1991 .

[21]  P. Rorsman,et al.  Gene expression profiling in single cells from the pancreatic islets of Langerhans reveals lognormal distribution of mRNA levels. , 2005, Genome research.

[22]  Donald B. Rubin,et al.  Characterizing the Estimation of Parameters in Incomplete-Data Problems , 1974 .

[23]  Susan M. Paddock,et al.  Subjective prior distributions for modeling longitudinal continuous outcomes with non‐ignorable dropout , 2009, Statistics in medicine.

[24]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[25]  D. Brook On the distinction between the conditional probability and the joint probability approaches in the specification of nearest-neighbour systems , 1964 .

[26]  D Scharfstein,et al.  Methods for Conducting Sensitivity Analysis of Trials with Potentially Nonignorable Competing Causes of Censoring , 2001, Biometrics.

[27]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[28]  E. Marcotte,et al.  Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation , 2007, Nature Biotechnology.

[29]  M. Kenward,et al.  Handbook of Missing Data Methodology , 2019 .

[30]  Edoardo M. Airoldi,et al.  Polytope samplers for inference in ill-posed inverse problems , 2011, AISTATS.

[31]  E. Marcotte,et al.  Insights into the regulation of protein abundance from proteomic and transcriptomic analyses , 2012, Nature Reviews Genetics.

[32]  Xiao-Li Meng,et al.  The potential and perils of preprocessing: Building new foundations , 2013, 1309.6790.

[33]  J. Pérez-Ortín,et al.  There is a steady‐state transcriptome in exponentially growing yeast cells , 2010, Yeast.

[34]  Zhi Geng,et al.  Identifiability of Normal and Normal Mixture Models with Nonignorable Missing Data , 2015, 1509.03860.

[35]  Marcello Pagano,et al.  The role of randomization , 1992 .

[36]  Noel A Cressie,et al.  Statistics for Spatio-Temporal Data , 2011 .

[37]  Alexander W. B Locker The potential and perils of preprocessing: Building new foundations , 2013 .

[38]  Antonio R. Linero,et al.  Bayesian Approaches for Missing Not at Random Outcome Data: The Role of Identifying Restrictions. , 2018, Statistical science : a review journal of the Institute of Mathematical Statistics.

[39]  E. Airoldi,et al.  Estimating a Structured Covariance Matrix From Multilab Measurements in High-Throughput Biology , 2015, Journal of the American Statistical Association.

[40]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[41]  Donald B. Rubin,et al.  Nested multiple imputation of NMES via partially incompatible MCMC , 2003 .

[42]  Nando de Freitas,et al.  Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.

[43]  J. Robins,et al.  Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models , 1999 .

[44]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[45]  Michael P Snyder,et al.  High-throughput sequencing for biology and medicine , 2013, Molecular systems biology.

[46]  R. Little Pattern-Mixture Models for Multivariate Incomplete Data , 1993 .

[47]  Jianhua Huang,et al.  A statistical framework for protein quantitation in bottom-up MS-based proteomics , 2009, Bioinform..

[48]  A. McDermott,et al.  Global Sensitivity Analysis for Repeated Measures Studies With Informative Dropout: A Fully Parametric Approach , 2014 .

[49]  Jörg Drechsler,et al.  Multiple Imputation for Nonresponse , 2011 .

[50]  Robert J. Weil,et al.  Non-Gaussian Distributions Affect Identification of Expression Patterns, Functional Annotation, and Prospective Classification in Human Cancer Genomes , 2012, PloS one.

[51]  M. Girolami,et al.  Riemann manifold Langevin and Hamiltonian Monte Carlo methods , 2011, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[52]  Donald B. Rubin,et al.  Selection Modeling Versus Mixture Modeling with Nonignorable Nonresponse , 1986 .

[53]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[54]  Mary Kynn,et al.  Eliciting Expert Knowledge for Bayesian Logistic Regression in Species Habitat Modelling , 2005 .