A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis

Abstract Background Data errors, including sample swapping and mis-labeling, are inevitable in the process of large-scale omics data generation. Data errors need to be identified and corrected before integrative data analyses where different types of data are merged on the basis of the annotated labels. Data with labeling errors dampen true biological signals. More importantly, data analysis with sample errors could lead to wrong scientific conclusions. We developed a robust probabilistic multi-omics data matching procedure, proMODMatcher, to curate data and identify and correct data annotation and errors in large databases. Results Application to simulated datasets suggests that proMODMatcher achieved robust statistical power even when the number of cis-associations was small and/or the number of samples was large. Application of our proMODMatcher to multi-omics datasets in The Cancer Genome Atlas and International Cancer Genome Consortium identified sample errors in multiple cancer datasets. Our procedure was not only able to identify sample-labeling errors but also to unambiguously identify the source of the errors. Our results demonstrate that these errors should be identified and corrected before integrative analysis. Conclusions Our results indicate that sample-labeling errors were common in large multi-omics datasets. These errors should be corrected before integrative analysis.

[1]  E. Schadt,et al.  A Network Analysis of Multiple Myeloma Related Gene Signatures , 2019, Cancers.

[2]  E. Marcotte,et al.  Global signatures of protein and mRNA expression levelsw , 2009 .

[3]  John D. Storey,et al.  Mapping the Genetic Architecture of Gene Expression in Human Liver , 2008, PLoS biology.

[4]  Ritsert C. Jansen,et al.  MixupMapper: correcting sample mix-ups in genome-wide datasets increases power to detect small genetic effects , 2011, Bioinform..

[5]  Kristian Cibulskis,et al.  ContEst: estimating cross-contamination of human samples in next-generation sequencing data , 2011, Bioinform..

[6]  H. Bussemaker,et al.  Identifying regulatory mechanisms underlying tumorigenesis using locus expression signature analysis , 2014, Proceedings of the National Academy of Sciences.

[7]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[8]  D. Bartel,et al.  Microarray profiling of microRNAs reveals frequent coexpression with neighboring miRNAs and host genes. , 2005, RNA.

[9]  Xia Yang,et al.  Liver and Adipose Expression Associated SNPs Are Enriched for Association to Type 2 Diabetes , 2010, PLoS genetics.

[10]  L. Kruglyak,et al.  Genetic Dissection of Transcriptional Regulation in Budding Yeast , 2002, Science.

[11]  Nicole Soranzo,et al.  An Integration of Genome-Wide Association Study and Gene Expression Profiling to Prioritize the Discovery of Novel Susceptibility Loci for Osteoporosis-Related Traits , 2010, PLoS genetics.

[12]  A. Bradley,et al.  Identification of mammalian microRNA host genes and transcription units. , 2004, Genome research.

[13]  Madeleine P. Ball,et al.  Corrigendum: Targeted and genome-scale strategies reveal gene-body methylation signatures in human cells , 2009, Nature Biotechnology.

[14]  Tao Huang,et al.  MODMatcher: Multi-Omics Data Matcher for Integrative Genomic Analysis , 2014, PLoS Comput. Biol..

[15]  Vladimir Vacic,et al.  Conpair: concordance and contamination estimator for matched tumor–normal pairs , 2016, Bioinform..

[16]  S. Horvath,et al.  Variations in DNA elucidate molecular networks that cause disease , 2008, Nature.

[17]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[18]  R. Redon,et al.  Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes , 2007, Science.

[19]  Aleix Prat Aparicio Comprehensive molecular portraits of human breast tumours , 2012 .