Gene expression prediction using low-rank matrix completion

BackgroundAn exponential growth of high-throughput biological information and data has occurred in the past decade, supported by technologies, such as microarrays and RNA-Seq. Most data generated using such methods are used to encode large amounts of rich information, and determine diagnostic and prognostic biomarkers. Although data storage costs have reduced, process of capturing data using aforementioned technologies is still expensive. Moreover, the time required for the assay, from sample preparation to raw value measurement is excessive (in the order of days). There is an opportunity to reduce both the cost and time for generating such expression datasets.ResultsWe propose a framework in which complete gene expression values can be reliably predicted in-silico from partial measurements. This is achieved by modelling expression data as a low-rank matrix and then applying recently discovered techniques of matrix completion by using nonlinear convex optimisation. We evaluated prediction of gene expression data based on 133 studies, sourced from a combined total of 10,921 samples. It is shown that such datasets can be constructed with a low relative error even at high missing value rates (>50 %), and that such predicted datasets can be reliably used as surrogates for further analysis.ConclusionThis method has potentially far-reaching applications including how bio-medical data is sourced and generated, and transcriptomic prediction by optimisation. We show that gene expression data can be computationally constructed, thereby potentially reducing the costs of gene expression profiling. In conclusion, this method shows great promise of opening new avenues in research on low-rank matrix completion in biological sciences.

[1]  Paul D. McNicholas,et al.  Model-based clustering of microarray expression data via latent Gaussian mixture models , 2010, Bioinform..

[2]  D. di Bernardo,et al.  How to infer gene networks from expression profiles , 2007, Molecular systems biology.

[3]  Yi-Wei Tang,et al.  Basic Concepts of Microarrays and Potential Applications in Clinical Microbiology , 2009, Clinical Microbiology Reviews.

[4]  F. Eisenhaber,et al.  pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model , 2007, Biology Direct.

[5]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[6]  M. Xiong,et al.  A systems biology approach to genetic studies of complex diseases , 2005, FEBS letters.

[7]  Soheil Meshinchi,et al.  Identification of genes with abnormal expression changes in acute myeloid leukemia , 2008, Genes, chromosomes & cancer.

[8]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[9]  Dennis M. Wilkinson,et al.  Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.

[10]  Zhili He,et al.  Empirical Evaluation of a New Method for Calculating Signal-to-Noise Ratio for Microarray Data Analysis , 2008, Applied and Environmental Microbiology.

[11]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[12]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[13]  Loris Nanni,et al.  Combining multiple approaches for gene microarray classification , 2012, Bioinform..

[14]  Anthony Man-Cho So,et al.  Theory of semidefinite programming for Sensor Network Localization , 2005, SODA '05.

[15]  Hong Yan,et al.  Missing value imputation for gene expression data: computational techniques to recover missing data from available information , 2011, Briefings Bioinform..

[16]  Aleksandra A. Kolodziejczyk,et al.  Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[17]  M Cazzola,et al.  Deregulated gene expression pathways in myelodysplastic syndrome hematopoietic stem cells , 2010, Leukemia.

[18]  A. Yakovlev,et al.  How high is the level of technical noise in microarray data? , 2007, Biology Direct.

[19]  James Bennett,et al.  The Netflix Prize , 2007 .

[20]  Adam A. Margolin,et al.  Reverse engineering cellular networks , 2006, Nature Protocols.

[21]  Chuhsing Kate Hsiao,et al.  Identification of a Novel Biomarker, SEMA5A, for Non–Small Cell Lung Carcinoma in Nonsmoking Women , 2010, Cancer Epidemiology, Biomarkers & Prevention.

[22]  Nir Friedman,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004, Science.

[23]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[24]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[25]  Nicolas Gillis,et al.  Low-Rank Matrix Approximation with Weights or Missing Data Is NP-Hard , 2010, SIAM J. Matrix Anal. Appl..

[26]  Susmita Datta,et al.  A statistical framework for differential network analysis from microarray data , 2010, BMC Bioinformatics.

[27]  Mohd Saberi Mohamad,et al.  A Review on Missing Value Imputation Algorithms for Microarray Gene Expression Data , 2014 .

[28]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[29]  Liviu Badea,et al.  Combined gene expression analysis of whole-tissue and microdissected pancreatic ductal adenocarcinoma identifies genes specifically overexpressed in tumor epithelia. , 2008, Hepato-gastroenterology.

[30]  Steven Salzberg,et al.  Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm , 1995, J. Comput. Biol..

[31]  Y. Teo,et al.  Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts , 2013, PLoS genetics.

[32]  Robert Petryszak,et al.  ArrayExpress update—simplifying data submissions , 2014, Nucleic Acids Res..

[33]  R. Kitchen,et al.  Relative impact of key sources of systematic noise in Affymetrix and Illumina gene-expression microarray experiments , 2011, BMC Genomics.

[34]  René Vidal,et al.  Motion Segmentation with Missing Data Using PowerFactorization and GPCA , 2004, CVPR.

[35]  R. Yantiss,et al.  Effects of Cigarette Smoke on the Human Oral Mucosal Transcriptome , 2010, Cancer Prevention Research.

[36]  F. Valafar Pattern Recognition Techniques in Microarray Data Analysis , 2002, Annals of the New York Academy of Sciences.

[37]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[38]  Jong Kyoung Kim,et al.  Corrigendum: Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression , 2015, Nature Communications.

[39]  Mayte Suárez-Fariñas,et al.  Expanding the Psoriasis Disease Profile: Interrogation of the Skin and Serum of Patients with Moderate-to-Severe Psoriasis , 2012, The Journal of investigative dermatology.

[40]  Ronen Basri,et al.  Lambertian reflectance and linear subspaces , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[41]  Jonathan M. Garibaldi,et al.  Supervised machine learning algorithms for protein structure classification , 2009, Comput. Biol. Chem..

[42]  Robert Veroff,et al.  A Bayesian Network Classification Methodology for Gene Expression Data , 2004, J. Comput. Biol..

[43]  Lodewyk F. A. Wessels,et al.  Current composite-feature classification methods do not outperform simple single-genes classifiers in breast cancer prognosis , 2013, Front. Genet..

[44]  S. Wacholder,et al.  Gene Expression Signature of Cigarette Smoking and Its Role in Lung Adenocarcinoma Development and Survival , 2008, PloS one.

[45]  Junzhou Huang,et al.  Background Subtraction Using Low Rank and Group Sparsity Constraints , 2012, ECCV.

[46]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[47]  Shaf Keshavjee,et al.  Gene expression profiling in the lungs of patients with pulmonary hypertension associated with pulmonary fibrosis. , 2012, Chest.

[48]  Hong Yan,et al.  Noise reduction in microarray gene expression data based on spectral analysis , 2012, Int. J. Mach. Learn. Cybern..

[49]  Gordon Wetzstein,et al.  Compressive light field photography using overcomplete dictionaries and optimized projections , 2013, ACM Trans. Graph..

[50]  Giovanni Parmigiani,et al.  Impact of gene expression profiling tests on breast cancer outcomes. , 2007, Evidence report/technology assessment.

[51]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[52]  Aleksandra A. Kolodziejczyk,et al.  Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression , 2015, Nature Communications.

[53]  Hongyu Zhao,et al.  Low-Rank Modeling and Its Applications in Image Analysis , 2014, ACM Comput. Surv..

[54]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[55]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[56]  Ian Korf,et al.  Gene finding in novel genomes , 2004, BMC Bioinformatics.

[57]  Sin-Ho Jung,et al.  Sample size calculation for multiple testing in microarray data analysis. , 2005, Biostatistics.

[58]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[59]  O. Fiehn,et al.  Differential metabolic networks unravel the effects of silent plant phenotypes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[60]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[61]  Vikas Sindhwani,et al.  Rank Selection in Low-rank Matrix Approximations : A Study of Cross-Validation for NMFs , 2010 .

[62]  Jin-Kao Hao,et al.  Advances in metaheuristics for gene selection and classification of microarray data , 2010, Briefings Bioinform..

[63]  Xiaoyong Zou,et al.  Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine. , 2009, Protein and peptide letters.