Modern Multiple Imputation with Functional Data

This work considers the problem of fitting functional models with sparsely and irregularly sampled functional data. It overcomes the limitations of the state-of-the-art methods, which face major challenges in the fitting of more complex non-linear models. Currently, many of these models cannot be consistently estimated unless the number of observed points per curve grows sufficiently quickly with the sample size, whereas, we show numerically that a modified approach with more modern multiple imputation methods can produce better estimates in general. We also propose a new imputation approach that combines the ideas of {\it MissForest} with {\it Local Linear Forest} and compare their performance with {\it PACE} and several other multivariate multiple imputation methods. This work is motivated by a longitudinal study on smoking cessation, in which the Electronic Health Records (EHR) from Penn State PaTH to Health allow for the collection of a great deal of data, with highly variable sampling. To illustrate our approach, we explore the relation between relapse and diastolic blood pressure. We also consider a variety of simulation schemes with varying levels of sparsity to validate our methods.

[1]  Frédéric Ferraty,et al.  Nonparametric Functional Data Analysis: Theory and Practice (Springer Series in Statistics) , 2006 .

[2]  H. Müller,et al.  Functional Data Analysis for Sparse Longitudinal Data , 2005 .

[3]  Mariela Sued,et al.  Mean estimation with data missing at random for functional covariables , 2013 .

[4]  Christophe Crambes,et al.  Regression imputation in the functional linear model with missing values in the response , 2019, Journal of Statistical Planning and Inference.

[5]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[6]  David S. Matteson,et al.  Functional Autoregression for Sparsely Sampled Data , 2016, 1603.02982.

[7]  Z. Q. John Lu,et al.  Nonparametric Functional Data Analysis: Theory And Practice , 2007, Technometrics.

[8]  Anton Nekrutenko,et al.  Child Weight Gain Trajectories Linked To Oral Microbiota Composition , 2018, Scientific Reports.

[9]  Mathew W. McLean,et al.  Journal of Computational and Graphical Statistics Functional Generalized Additive Models Functional Generalized Additive Models Accepted Manuscript Accepted Manuscript , 2022 .

[10]  Piotr Kokoszka,et al.  Inference for Functional Data with Applications , 2012 .

[11]  J. O. Ramsay,et al.  Functional Data Analysis (Springer Series in Statistics) , 1997 .

[12]  C. Blanco,et al.  Probability and predictors of relapse to smoking: results of the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC). , 2013, Drug and alcohol dependence.

[13]  Gareth M. James,et al.  Functional additive regression , 2015, 1510.04064.

[14]  Arun Ross,et al.  A comparison of imputation methods for handling missing scores in biometric fusion , 2012, Pattern Recognit..

[15]  Jeff Goldsmith,et al.  Variable selection in the functional linear concurrent model , 2017, Statistics in medicine.

[16]  Trivellore E Raghunathan,et al.  A functional multiple imputation approach to incomplete longitudinal data , 2011, Statistics in medicine.

[17]  Julie Tibshirani,et al.  Local Linear Forests , 2018, J. Comput. Graph. Stat..

[18]  F. Ferraty,et al.  The Oxford Handbook of Functional Data Analysis , 2011, Oxford Handbooks Online.

[19]  David Ruppert,et al.  Optimal Prediction in an Additive Functional Model , 2013, 1301.4954.

[20]  M. Zanna,et al.  Do risk-minimizing beliefs about smoking inhibit quitting? Findings from the International Tobacco Control (ITC) Four-Country Survey. , 2009, Preventive medicine.

[21]  J. Marrero,et al.  Comparison of imputation methods for missing laboratory data in medicine , 2013, BMJ Open.

[22]  D. Stekhoven missForest: Nonparametric missing value imputation using random forest , 2015 .

[23]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[24]  Bharath K. Sriperumbudur,et al.  Optimal Prediction for Additive Function-on-Function Regression , 2017, 1708.03372.

[25]  M. Reimherr,et al.  Highly Irregular Functional Generalized Linear Regression with Electronic Health Records , 2018 .

[26]  Brian Caffo,et al.  Longitudinal functional principal component analysis. , 2010, Electronic journal of statistics.

[27]  Gilbert Saporta,et al.  The NIPALS Algorithm for Missing Functional Data , 2010 .

[28]  Zhongyi Zhu,et al.  Continuously dynamic additive models for functional data , 2016, J. Multivar. Anal..

[29]  Julie Josse,et al.  Nonparametric Imputation by Data Depth , 2017, Journal of the American Statistical Association.

[30]  S Greenland,et al.  A critical look at methods for handling missing covariates in epidemiologic regression analyses. , 1995, American journal of epidemiology.

[31]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[32]  T. Hedner,et al.  Smoking affects blood pressure. , 1996, Blood pressure.

[33]  Colin O. Wu,et al.  Nonparametric Mixed Effects Models for Unequally Sampled Noisy Curves , 2001, Biometrics.

[34]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[35]  Yan Lin,et al.  Missing value imputation in high-dimensional phenomic data: imputable or not, and how? , 2014, BMC Bioinformatics.

[36]  S. Athey,et al.  Generalized random forests , 2016, The Annals of Statistics.

[37]  Ori Rosen,et al.  A Bayesian Model for Sparse Functional Data , 2008, Biometrics.

[38]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[39]  James R Carpenter,et al.  Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model , 2012, Statistical methods in medical research.

[40]  Ron Borland,et al.  Predictors of smoking relapse by duration of abstinence: findings from the International Tobacco Control (ITC) Four Country Survey. , 2009, Addiction.

[41]  Emanuela Falaschetti,et al.  Association Between Smoking and Blood Pressure: Evidence From the Health Survey for England , 2001, Hypertension.

[42]  Catherine A. Sugar,et al.  Principal component models for sparse functional data , 1999 .

[43]  S. van Buuren Multiple imputation of discrete and continuous data by fully conditional specification , 2007, Statistical methods in medical research.

[44]  Jianhui Ning,et al.  A comparison study of nonparametric imputation methods , 2012, Stat. Comput..

[45]  Fang Yao,et al.  Continuously additive models for nonlinear functional regression , 2013 .

[46]  Richard Wasserman,et al.  Automated identification of implausible values in growth data from pediatric electronic health records , 2017, J. Am. Medical Informatics Assoc..

[47]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[48]  J. Ramsay,et al.  Introduction to Functional Data Analysis , 2007 .