Accounting for Dependence Induced by Weighted KNN Imputation in Paired Samples, Motivated by a Colorectal Cancer Study

Missing data can arise in bioinformatics applications for a variety of reasons, and imputation methods are frequently applied to such data. We are motivated by a colorectal cancer study where miRNA expression was measured in paired tumor-normal samples of hundreds of patients, but data for many normal samples were missing due to lack of tissue availability. We compare the precision and power performance of several imputation methods, and draw attention to the statistical dependence induced by K-Nearest Neighbors (KNN) imputation. This imputation-induced dependence has not previously been addressed in the literature. We demonstrate how to account for this dependence, and show through simulation how the choice to ignore or account for this dependence affects both power and type I error rate control.

[1]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[2]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[3]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[4]  Guy N. Brock,et al.  Biological impact of missing-value imputation on downstream analyses of gene expression profiles , 2011, Bioinform..

[5]  Claes Wohlin,et al.  An evaluation of k-nearest neighbour imputation using Likert data , 2004 .

[6]  V. Ambros,et al.  The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14 , 1993, Cell.

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  Ben Bolstad,et al.  Low-level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization , 2003 .

[9]  G. King,et al.  What to Do about Missing Values in Time‐Series Cross‐Section Data , 2010 .

[10]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[11]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[12]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[13]  D. Berger,et al.  MicroRNA and Colorectal Cancer , 2009, World Journal of Surgery.

[14]  T. H. Bø,et al.  LSimpute: accurate estimation of missing values in microarray data with least squares methods. , 2004, Nucleic acids research.

[15]  Michael Z Michael,et al.  Reduced accumulation of specific microRNAs in colorectal neoplasia. , 2003, Molecular cancer research : MCR.

[16]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[17]  Alan J. Lee,et al.  Linear Regression Analysis: Seber/Linear , 2003 .

[18]  L. Lim,et al.  A microRNA component of the p53 tumour suppressor network , 2007, Nature.

[19]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[20]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[21]  Ki-Yeol Kim,et al.  Reuse of imputed data in microarray analysis increases imputation efficiency , 2004, BMC Bioinformatics.

[22]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[23]  Y. Akao,et al.  let-7 microRNA functions as a potential growth suppressor in human colon cancer cells. , 2006, Biological & pharmaceutical bulletin.

[24]  John R. Stevens,et al.  A comparison of probe-level and probeset models for small-sample gene expression data , 2010, BMC Bioinformatics.

[25]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[26]  Ming Ouyang,et al.  Gaussian mixture clustering and imputation of microarray data , 2004, Bioinform..

[27]  F. Slack,et al.  Oncomirs — microRNAs with a role in cancer , 2006, Nature Reviews Cancer.

[28]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[29]  C. Croce,et al.  MicroRNA signatures in human cancers , 2006, Nature Reviews Cancer.

[30]  Yang C. Yuan,et al.  Multiple Imputation for Missing Data: Concepts and New Development , 2000 .

[31]  Michael B. Miller Linear Regression Analysis , 2013 .

[32]  Claes Wohlin,et al.  An evaluation of k-nearest neighbour imputation using Likert data , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..