A pipeline for RNA-seq based eQTL analysis with automated quality control procedures

Background Advances in the expression quantitative trait loci (eQTL) studies have provided valuable insights into the mechanism of diseases and traits-associated genetic variants. However, it remains challenging to evaluate and control the quality of multi-source heterogeneous eQTL raw data for researchers with limited computational background. There is an urgent need to develop a powerful and user-friendly tool to automatically process the raw datasets in various formats and perform the eQTL mapping afterward. Results In this work, we present a pipeline for eQTL analysis, termed eQTLQC, featured with automated data preprocessing for both genotype data and gene expression data. Our pipeline provides a set of quality control and normalization approaches, and utilizes automated techniques to reduce manual intervention. We demonstrate the utility and robustness of this pipeline by performing eQTL case studies using multiple independent real-world datasets with RNA-seq data and whole genome sequencing (WGS) based genotype data. Conclusions eQTLQC provides a reliable computational workflow for eQTL analysis. It provides standard quality control and normalization as well as eQTL mapping procedures for eQTL raw data in multiple formats. The source code, demo data, and instructions are freely available at https://github.com/stormlovetao/eQTLQC.

[1]  Tao Wang,et al.  Enhancers active in dopamine neurons are a primary link between genetic variation and neuropsychiatric disease , 2018, Nature Neuroscience.

[2]  J. Schneider,et al.  Overview and findings from the rush Memory and Aging Project. , 2012, Current Alzheimer research.

[3]  Yadong Wang,et al.  FSM: Fast and Scalable Network Motif Discovery for Exploring Higher-order Network Organizations. , 2020, Methods.

[4]  Yadong Wang,et al.  An automated quality control pipeline for eQTL analysis with RNA-seq data , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[5]  Jianye Hao,et al.  An end-to-end heterogeneous graph representation learning-based framework for drug-target interaction prediction , 2021, Briefings Bioinform..

[6]  David A Bennett,et al.  Religious Orders Study and Rush Memory and Aging Project. , 2018, Journal of Alzheimer's disease : JAD.

[7]  Charles C. White,et al.  A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer’s disease , 2018, Nature Neuroscience.

[8]  Olivier Delaneau,et al.  A complete tool set for molecular QTL discovery and analysis , 2016, Nature Communications.

[9]  A. Chen-Plotkin,et al.  The Post-GWAS Era: From Association to Function. , 2018, American journal of human genetics.

[10]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[11]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[12]  Jianye Hao,et al.  A learning-based framework for miRNA-disease association identification using neural networks , 2018, bioRxiv.

[13]  B. Langmead,et al.  Aligning Short Sequencing Reads with Bowtie , 2010, Current protocols in bioinformatics.

[14]  Jonathan Pevsner,et al.  Inference of Relationships in Population Data Using Identity-by-Descent and Identity-by-State , 2011, PLoS genetics.

[15]  Jie Sun,et al.  DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function , 2018, Bioinform..

[16]  Chun Jimmie Ye,et al.  Multiplexed droplet single-cell RNA-sequencing using natural genetic variation , 2017, Nature Biotechnology.

[17]  Xuequn Shang,et al.  A novel subnetwork representation learning method for uncovering disease-disease relationships. , 2020, Methods.

[18]  Benjamin A. Logsdon,et al.  The Mount Sinai cohort of large-scale genomic, transcriptomic and proteomic data in Alzheimer's disease , 2018, Scientific Data.

[19]  Liang Cheng,et al.  Exposing the Causal Effect of C-Reactive Protein on the Risk of Type 2 Diabetes Mellitus: A Mendelian Randomization Study , 2018, Front. Genet..

[20]  Liang Cheng,et al.  gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions , 2019, Nucleic acids research.

[21]  Pingping Wang,et al.  Computational Methods for Identifying Similar Diseases , 2019, Molecular therapy. Nucleic acids.

[22]  Andrey A. Shabalin,et al.  Matrix eQTL: ultra fast eQTL analysis via large matrix operations , 2011, Bioinform..

[23]  A. Morris,et al.  Data quality control in genetic case-control association studies , 2010, Nature Protocols.

[24]  Nicola J. Rinaldi,et al.  Genetic effects on gene expression across human tissues , 2017, Nature.

[25]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[26]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[27]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[28]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[29]  M. Nalls,et al.  A meta-analysis of genome-wide association studies identifies 17 new Parkinson's disease risk loci , 2017, Nature Genetics.

[30]  Xiaoli Liu,et al.  eQTLMAPT: Fast and Accurate eQTL Mediation Analysis With Efficient Permutation Testing Approaches , 2020, Frontiers in Genetics.

[31]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[32]  Kelsey S. Montgomery,et al.  CommonMind Consortium provides transcriptomic and epigenomic data for Schizophrenia and Bipolar Disorder , 2019, Scientific Data.

[33]  J. Marchini,et al.  Fast and accurate genotype imputation in genome-wide association studies through pre-phasing , 2012, Nature Genetics.

[34]  Gonçalo R. Abecasis,et al.  Minimac2: Faster Genotype Imputation , 2015, Bioinform..

[35]  Zoltán Kutalik,et al.  Quality control and conduct of genome-wide association meta-analyses , 2014, Nature Protocols.

[36]  M. G. van der Wijst,et al.  Single-cell RNA sequencing identifies cell type-specific cis-eQTLs and co-expression QTLs , 2018, Nature Genetics.

[37]  Bo Liu,et al.  Disease Module Identification Based on Representation Learning of Complex Networks Integrated From GWAS, eQTL Summaries, and Human Interactome , 2020, Frontiers in Bioengineering and Biotechnology.

[38]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[39]  J. Schneider,et al.  Overview and findings from the religious orders study. , 2012, Current Alzheimer research.

[40]  Brian L Browning,et al.  Genotype Imputation with Millions of Reference Samples. , 2016, American journal of human genetics.

[41]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[42]  Liang Cheng,et al.  Computational and Biological Methods for Gene Therapy. , 2019, Current gene therapy.

[43]  Emmanouil T. Dermitzakis,et al.  Fast and efficient QTL mapper for thousands of molecular phenotypes , 2015, bioRxiv.

[44]  James A. Eddy,et al.  Human whole genome genotype and transcriptome data for Alzheimer’s and other neurodegenerative diseases , 2016, Scientific Data.

[45]  Meng Zhou,et al.  MetSigDis: a manually curated resource for the metabolic signatures of diseases , 2019, Briefings Bioinform..

[46]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[47]  Jiajie Peng,et al.  Integrating multi-network topology for gene function prediction using deep neural networks , 2019, bioRxiv.

[48]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[49]  D. Rubin INFERENCE AND MISSING DATA , 1975 .