Structured Matrix Completion with Applications to Genomic Data Integration

ABSTRACT Matrix completion has attracted significant recent attention in many fields including statistics, applied mathematics, and electrical engineering. Current literature on matrix completion focuses primarily on independent sampling models under which the individual observed entries are sampled independently. Motivated by applications in genomic data integration, we propose a new framework of structured matrix completion (SMC) to treat structured missingness by design. Specifically, our proposed method aims at efficient matrix recovery when a subset of the rows and columns of an approximately low-rank matrix are observed. We provide theoretical justification for the proposed SMC method and derive lower bound for the estimation errors, which together establish the optimal rate of recovery over certain classes of approximately low-rank matrices. Simulation studies show that the method performs well in finite sample under a variety of configurations. The method is applied to integrate several ovarian cancer genomic studies with different extent of genomic measurements, which enables us to construct more accurate prediction rules for ovarian cancer survival. Supplementary materials for this article are available online.

[1]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[2]  H GolubGene,et al.  Missing value estimation for DNA microarray gene expression data , 2005 .

[3]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[4]  R. Tothill,et al.  Novel Molecular Subtypes of Serous and Endometrioid Ovarian Cancer Linked to Clinical Outcome , 2008, Clinical Cancer Research.

[5]  S. Yun,et al.  An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems , 2009 .

[6]  A. Tsybakov,et al.  Estimation of high-dimensional low-rank matrices , 2009, 0912.5338.

[7]  David Suter,et al.  Recovering the missing components in a large noisy low-rank matrix: application to SFM , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Vincent Nesme,et al.  Note on sampling without replacing from a finite collection of matrices , 2010, ArXiv.

[9]  V. Koltchinskii Von Neumann Entropy Penalization and Low Rank Matrix Estimation , 2010, 1009.2439.

[10]  B. Browning,et al.  A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. , 2009, American journal of human genetics.

[11]  Anil Potti,et al.  An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer. , 2007, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[12]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[13]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[14]  Benjamin Recht,et al.  A Simpler Approach to Matrix Completion , 2009, J. Mach. Learn. Res..

[15]  S. Yun,et al.  An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems , 2009 .

[16]  R. Vershynin Spectral norm of products of random and deterministic matrices , 2008, 0812.2432.

[17]  Wing H Wong,et al.  Expression profiling of serous low malignant potential, low-grade, and high-grade tumors of the ovary. , 2005, Cancer research.

[18]  David Gross,et al.  Recovering Low-Rank Matrices From Few Coefficients in Any Basis , 2009, IEEE Transactions on Information Theory.

[19]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[20]  Ohad Shamir,et al.  Learning with the weighted trace-norm under arbitrary sampling distributions , 2011, NIPS.

[21]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[22]  B. Laurent Adaptive estimation of a quadratic functional of a density by model selection , 2005 .

[23]  Amit Singer,et al.  Uniqueness of Low-Rank Matrix Completion by Rigidity Theory , 2009, SIAM J. Matrix Anal. Appl..

[24]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[25]  Zhaoxia Yu,et al.  Methods to impute missing genotypes for population data , 2007, Human Genetics.

[26]  T. Tony Cai,et al.  Matrix completion via max-norm constrained optimization , 2013, ArXiv.

[27]  C Tomasi,et al.  Shape and motion from image streams: a factorization method. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[29]  M. West,et al.  Patterns of Gene Expression That Characterize Long-term Survival in Advanced Stage Serous Ovarian Cancers , 2005, Clinical Cancer Research.

[30]  Yinyu Ye,et al.  Semidefinite programming based algorithms for sensor network localization , 2006, TOSN.

[31]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[32]  V. Koltchinskii,et al.  Nuclear norm penalization and optimal rates for noisy low rank matrix completion , 2010, 1011.6256.

[33]  Andrea Montanari,et al.  Matrix Completion from Noisy Entries , 2009, J. Mach. Learn. Res..

[34]  Ruslan Salakhutdinov,et al.  Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm , 2010, NIPS.

[35]  A. Jemal,et al.  Cancer statistics, 2013 , 2013, CA: a cancer journal for clinicians.

[36]  W. Weichert,et al.  A prognostic gene expression index in ovarian cancer—validation across different independent data sets , 2009, The Journal of pathology.

[37]  J. Kalbfleisch,et al.  The Statistical Analysis of Failure Time Data , 1980 .

[38]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[39]  Ao Li,et al.  Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme , 2006, BMC Bioinformatics.

[40]  Emmanuel J. Candès,et al.  Tight Oracle Inequalities for Low-Rank Matrix Recovery From a Minimal Number of Noisy Random Measurements , 2011, IEEE Transactions on Information Theory.

[41]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[42]  D. Botstein,et al.  Missing Value Estimation , 2003 .

[43]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[44]  Gary K. Chen,et al.  Genotype imputation via matrix completion , 2013, Genome research.

[45]  J. Berek,et al.  Ovarian cancer: epidemiology, biology, and prognostic factors. , 2000, Seminars in surgical oncology.

[46]  Laurence L. George,et al.  The Statistical Analysis of Failure Time Data , 2003, Technometrics.