Learning and Imputation for Mass-spec Bias Reduction (LIMBR)

Motivation Decreasing costs are making it feasible to perform time series proteomics and genomics experiments with more replicates and higher resolution than ever before. With more replicates and time points, proteome and genome-wide patterns of expression are more readily discernible. These larger experiments require more batches exacerbating batch effects and increasing the number of bias trends. In the case of proteomics, where methods frequently result in missing data this increasing scale is also decreasing the number of peptides observed in all samples. The sources of batch effects and missing data are incompletely understood necessitating novel techniques. Results Here we show that by exploiting the structure of time series experiments, it is possible to accurately and reproducibly model and remove batch effects. We implement Learning and Imputation for Mass-spec Bias Reduction (LIMBR) software, which builds on previous block based models of batch effects and includes features specific to time series and circadian studies. To aid in the analysis of time series proteomics experiments, which are often plagued with missing data points, we also integrate an imputation system. By building LIMBR for imputation and time series tailored bias modeling into one straightforward software package, we expect that the quality and ease of large-scale proteomics and genomics time series experiments will be significantly increased.

[1]  Sean J. Humphrey,et al.  Phosphorylation Is a Central Mechanism for Circadian Control of Metabolism and Physiology. , 2017, Cell metabolism.

[2]  Florian D. Schneider,et al.  Animal diversity and ecosystem functioning in dynamic food webs , 2016, Nature Communications.

[3]  Christine Nardini,et al.  Missing value estimation methods for DNA methylation data , 2019, Bioinform..

[4]  Edward L. Huttlin,et al.  Quantitative Temporal Viromics: An Approach to Investigate Host-Pathogen Interaction , 2014, Cell.

[5]  Jeffrey T. Leek,et al.  Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction , 2014, Bioinform..

[6]  Antonio Núñez Galindo,et al.  Nuclear Proteomics Uncovers Diurnal Regulatory Landscapes in Mouse Liver , 2017, Cell metabolism.

[7]  Daniel B. Martin,et al.  Computational prediction of proteotypic peptides for quantitative proteomics , 2007, Nature Biotechnology.

[8]  Karl Kornacker,et al.  JTK_CYCLE: An Efficient Nonparametric Algorithm for Detecting Rhythmic Components in Genome-Scale Data Sets , 2010, Journal of biological rhythms.

[9]  Alexis Battle,et al.  Identifying global expression patterns and key regulators in epithelial to mesenchymal transition through multi-study integration , 2017, BMC Cancer.

[10]  P. Pavlidis,et al.  miR-1202: A Primate Specific and Brain Enriched miRNA Involved in Major Depression and Antidepressant Treatment , 2014, Nature Medicine.

[11]  Yuri Kotliarov,et al.  Global Analyses of Human Immune Variation Reveal Baseline Predictors of Postvaccination Responses , 2014, Cell.

[12]  Nell Sedransk,et al.  Improved Normalization of Systematic Biases Affecting Ion Current Measurements in Label-free Proteomics Data* , 2014, Molecular & Cellular Proteomics.

[13]  J. Leek Surrogate variable analysis , 2007 .

[14]  E. Hovig,et al.  Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses , 2015, Biostatistics.

[15]  Kathleen M Jagodnik,et al.  Extraction and analysis of signatures from the Gene Expression Omnibus by the crowd , 2016, Nature Communications.

[16]  Jenny Forshed,et al.  Defining, Comparing, and Improving iTRAQ Quantification in Mass Spectrometry Proteomics Data* , 2013, Molecular & Cellular Proteomics.

[17]  Wei Shi,et al.  Detecting and correcting systematic variation in large-scale RNA sequencing data , 2014, Nature Biotechnology.

[18]  F. Naef,et al.  Circadian clock-dependent and -independent rhythmic proteomes implement distinct diurnal functions in mouse liver , 2013, Proceedings of the National Academy of Sciences.

[19]  Richard D. Smith,et al.  Normalization and missing value imputation for label-free LC-MS analysis , 2012, BMC Bioinformatics.

[20]  Ton J. Cleophas,et al.  Missing-data Imputation , 2022 .

[21]  Ito Wasito,et al.  Nearest neighbour approach in the least-squares data imputation algorithms , 2005, Inf. Sci..

[22]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models: Missing-data imputation , 2006 .

[23]  Joshua N. Adkins,et al.  Normalization of peak intensities in bottom-up MS-based proteomics using singular value decomposition , 2009, Bioinform..

[24]  Steven P. Gygi,et al.  Defining the consequences of genetic variation on a proteome-wide scale , 2016, Nature.

[25]  J. Yates,et al.  Statistical characterization of ion trap tandem mass spectra from doubly charged tryptic peptides. , 2003, Analytical chemistry.

[26]  Neil Bahroos,et al.  Improved Statistical Methods Enable Greater Sensitivity in Rhythm Detection for Genome-Wide Data , 2015, PLoS Comput. Biol..

[27]  Susmita Datta,et al.  svapls: an R package to correct for hidden factors of variability in gene expression studies , 2013, BMC Bioinformatics.

[28]  Andrew E. Jaffe,et al.  Erratum to: Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis , 2015, BMC Bioinformatics.

[29]  K. Hansen,et al.  A ketogenic diet rescues hippocampal memory defects in a mouse model of Kabuki syndrome , 2016, Proceedings of the National Academy of Sciences.

[30]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[31]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[32]  Ronald J. Moore,et al.  Sources of technical variability in quantitative LC-MS proteomics: human brain tissue sample analysis. , 2013, Journal of proteome research.

[33]  Jean-Baptiste Mouret,et al.  Neural Modularity Helps Organisms Evolve to Learn New Skills without Forgetting Old Skills , 2015, PLoS Comput. Biol..

[34]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[35]  M. Mann,et al.  In-Vivo Quantitative Proteomics Reveals a Key Contribution of Post-Transcriptional Mechanisms to the Circadian Regulation of Liver Metabolism , 2014, PLoS genetics.

[36]  D. Petrov,et al.  Genomic Evidence of Rapid and Stable Adaptive Oscillations over Seasonal Time Scales in Drosophila , 2013, PLoS genetics.

[37]  Cheng Chang,et al.  In-depth method assessments of differentially expressed protein detection for shotgun proteomics data with missing values , 2017, Scientific Reports.

[38]  Mickael Guedj,et al.  A Comparison of Six Methods for Missing Data Imputation , 2015 .