Statistical Approach of Gene Set Analysis with Quantitative Trait Loci for Crop Gene Expression Studies

Genome-wide expression study is a powerful genomic technology to quantify expression dynamics of genes in a genome. In gene expression study, gene set analysis has become the first choice to gain insights into the underlying biology of diseases or stresses in plants. It also reduces the complexity of statistical analysis and enhances the explanatory power of the obtained results from the primary downstream differential expression analysis. The gene set analysis approaches are well developed in microarrays and RNA-seq gene expression data analysis. These approaches mainly focus on analyzing the gene sets with gene ontology or pathway annotation data. However, in plant biology, such methods may not establish any formal relationship between the genotypes and the phenotypes, as most of the traits are quantitative and controlled by polygenes. The existing Quantitative Trait Loci (QTL)-based gene set analysis approaches only focus on the over-representation analysis of the selected genes while ignoring their associated gene scores. Therefore, we developed an innovative statistical approach, GSQSeq, to analyze the gene sets with trait enriched QTL data. This approach considers the associated differential expression scores of genes while analyzing the gene sets. The performance of the developed method was tested on five different crop gene expression datasets obtained from real crop gene expression studies. Our analytical results indicated that the trait-specific analysis of gene sets was more robust and successful through the proposed approach than existing techniques. Further, the developed method provides a valuable platform for integrating the gene expression data with QTL data.

[1]  Frank Emmert-Streib,et al.  Comparative evaluation of gene set analysis approaches for RNA-Seq data , 2014, BMC Bioinformatics.

[2]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[3]  G. Pertea,et al.  GFF Utilities: GffRead and GffCompare. , 2020, F1000Research.

[4]  M. Newton,et al.  Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis , 2007, 0708.4350.

[5]  Y. Benjamini,et al.  Multiple Hypotheses Testing with Weights , 1997 .

[6]  Ben S. Wittner,et al.  Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1 , 2009, Nature.

[7]  Xi Wang,et al.  Gene set enrichment analysis of RNA-Seq data: integrating differential expression and splicing , 2013, BMC Bioinformatics.

[8]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[9]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[10]  I. Goodhead,et al.  Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution , 2008, Nature.

[11]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[12]  Di Wu,et al.  ROAST: rotation gene set tests for complex microarray experiments , 2010, Bioinform..

[13]  Seon-Young Kim,et al.  Improving Gene-Set Enrichment Analysis of RNA-Seq Data with Small Replicates , 2016, PloS one.

[14]  Runan Yao,et al.  iDEP: an integrated web application for differential expression and pathway analysis of RNA-Seq data , 2018, BMC Bioinformatics.

[15]  Heidi Ledford,et al.  The death of microarrays? , 2008, Nature.

[16]  Zhe Feng,et al.  A general introduction to adjustment for multiple comparisons. , 2017, Journal of thoracic disease.

[17]  D. Schwartz,et al.  Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data , 2013, Rice.

[18]  Frank Emmert-Streib,et al.  Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline , 2015, Briefings Bioinform..

[19]  Shesh N Rai,et al.  Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data , 2020, Entropy.

[20]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..

[21]  Yanchun Liang,et al.  A Computational Systems Biology Study for Understanding Salt Tolerance Mechanism in Rice , 2013, PloS one.

[22]  P. Fontana,et al.  Transcriptome and Cell Physiological Analyses in Different Rice Cultivars Provide New Insights Into Adaptive and Salinity Stress Responses , 2018, Front. Plant Sci..

[23]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[24]  Wei Zhao,et al.  Gramene: a resource for comparative grass genomics , 2002, Nucleic Acids Res..

[25]  Maureen A. Sartor,et al.  RNA-Enrich: a cut-off free functional enrichment testing method for RNA-seq with improved detection power , 2016, Bioinform..

[26]  Leighton J. Core,et al.  Nascent RNA Sequencing Reveals Widespread Pausing and Divergent Initiation at Human Promoters , 2008, Science.

[27]  Monther Alhamdoosh,et al.  Combining multiple tools outperforms individual methods in gene set enrichment analyses , 2015, bioRxiv.

[28]  Liisa Holm,et al.  Gene set analysis: limitations in popular existing methods and proposed improvements , 2014, Bioinform..

[29]  Peter J. Bickel,et al.  Measuring reproducibility of high-throughput experiments , 2011, 1110.4705.

[30]  S. Rai,et al.  Statistical Approach for Gene Set Analysis with Trait Specific Quantitative Trait Loci , 2018, Scientific Reports.

[31]  Liisa Holm,et al.  Robust extraction of functional signals from gene set analysis using a generalized threshold free scoring function , 2009, BMC Bioinformatics.

[32]  Alicia Oshlack,et al.  goseq: Gene Ontology testing for RNA-seq datasets , 2014 .

[33]  Anil Rai,et al.  Statistical approach for selection of biologically informative genes. , 2018, Gene.

[34]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[35]  Shesh N. Rai,et al.  Fifteen Years of Gene Set Analysis for High-Throughput Genomic Data: A Review of Statistical Approaches and Future Challenges , 2020, Entropy.

[36]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[37]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[38]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[39]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[40]  Terrence S. Furey,et al.  GSAASeqSP: A Toolset for Gene Set Association Analysis of RNA-Seq Data , 2014, Scientific Reports.

[41]  P. Sullivan,et al.  Biological pathways and networks implicated in psychiatric disorders , 2015, Current Opinion in Behavioral Sciences.

[42]  M. Mooney,et al.  Gene set analysis: A step‐by‐step guide , 2015, American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics.

[43]  Justin Guinney,et al.  GSVA: gene set variation analysis for microarray and RNA-Seq data , 2013, BMC Bioinformatics.

[44]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[45]  Daniel L. Koller,et al.  Identification of pathways for bipolar disorder: a meta-analysis. , 2014, JAMA psychiatry.

[46]  Serdar Bozdag,et al.  GSEPD: a Bioconductor package for RNA-seq gene set enrichment and projection display , 2019, BMC Bioinformatics.