Statistical Approaches for Gene Selection, Hub Gene Identification and Module Interaction in Gene Co-Expression Network Analysis: An Application to Aluminum Stress in Soybean (Glycine max L.)

Selection of informative genes is an important problem in gene expression studies. The small sample size and the large number of genes in gene expression data make the selection process complex. Further, the selected informative genes may act as a vital input for gene co-expression network analysis. Moreover, the identification of hub genes and module interactions in gene co-expression networks is yet to be fully explored. This paper presents a statistically sound gene selection technique based on support vector machine algorithm for selecting informative genes from high dimensional gene expression data. Also, an attempt has been made to develop a statistical approach for identification of hub genes in the gene co-expression network. Besides, a differential hub gene analysis approach has also been developed to group the identified hub genes into various groups based on their gene connectivity in a case vs. control study. Based on this proposed approach, an R package, i.e., dhga (https://cran.r-project.org/web/packages/dhga) has been developed. The comparative performance of the proposed gene selection technique as well as hub gene identification approach was evaluated on three different crop microarray datasets. The proposed gene selection technique outperformed most of the existing techniques for selecting robust set of informative genes. Based on the proposed hub gene identification approach, a few number of hub genes were identified as compared to the existing approach, which is in accordance with the principle of scale free property of real networks. In this study, some key genes along with their Arabidopsis orthologs has been reported, which can be used for Aluminum toxic stress response engineering in soybean. The functional analysis of various selected key genes revealed the underlying molecular mechanisms of Aluminum toxic stress response in soybean.

[1]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[2]  Shuai Liu,et al.  Gene expression patterns combined with network analysis identify hub genes associated with bladder cancer , 2015, Comput. Biol. Chem..

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[5]  W. Ramakrishna,et al.  Genes and Co-Expression Modules Common to Drought and Bacterial Stress Responses in Arabidopsis and Rice , 2013, PloS one.

[6]  E. Levanon,et al.  Human housekeeping genes, revisited. , 2013, Trends in genetics : TIG.

[7]  David J. Galas,et al.  RCytoscape: tools for exploratory network analysis , 2013, BMC Bioinformatics.

[8]  Yanchun Liang,et al.  A Computational Systems Biology Study for Understanding Salt Tolerance Mechanism in Rice , 2013, PloS one.

[9]  Miron B. Kursa,et al.  Robustness of Random Forest-based gene selection methods , 2013, BMC Bioinformatics.

[10]  Yong-Mei Bi,et al.  A Developmental Transcriptional Network for Maize Defines Coexpression Modules1[C][W][OA] , 2013, Plant Physiology.

[11]  Y. Qi,et al.  Roles of Organic Acid Anion Secretion in Aluminium Tolerance of Higher Plants , 2012, BioMed research international.

[12]  H. Nian,et al.  Identification of wild soybean miRNAs and their target genes responsive to aluminum stress , 2012, BMC Plant Biology.

[13]  Weiming Cai,et al.  OsLEA3-2, an Abiotic Stress Induced Gene of Rice Plays a Key Role in Salt and Drought Tolerance , 2012, PloS one.

[14]  A. Fernie,et al.  The use of metabolomics to dissect plant responses to abiotic stresses , 2012, Cellular and Molecular Life Sciences.

[15]  S. Elena,et al.  A Meta-Analysis Reveals the Commonalities and Differences in Arabidopsis thaliana Response to Different Viral Pathogens , 2012, PloS one.

[16]  K. Tang,et al.  Identification of Gene Modules Associated with Drought Response in Rice by Network-Based Analysis , 2012, PloS one.

[17]  M. Porter,et al.  Critical Truths About Power Laws , 2012, Science.

[18]  G. Tseng,et al.  Comprehensive literature review and statistical considerations for microarray meta-analysis , 2012, Nucleic acids research.

[19]  G. Tseng,et al.  Comprehensive literature review and statistical considerations for GWAS meta-analysis , 2012, Nucleic acids research.

[20]  Roger E Bumgarner,et al.  Construction of regulatory networks using expression time-series data of a genotyped population , 2011, Proceedings of the National Academy of Sciences.

[21]  Kevin L. Childs,et al.  Gene Coexpression Network Analysis as a Source of Functional Annotation for Rice Genes , 2011, PloS one.

[22]  Yanchun Liang,et al.  Prediction of Drought-Resistant Genes in Arabidopsis thaliana Using SVM-RFE , 2011, PloS one.

[23]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[24]  F Alex Feltus,et al.  The Association of Multiple Interacting Genes with Specific Phenotypes in Rice Using Gene Coexpression Networks1[C][W][OA] , 2010, Plant Physiology.

[25]  Zhou Du,et al.  agriGO: a GO analysis toolkit for the agricultural community , 2010, Nucleic Acids Res..

[26]  Steven B. Cannon,et al.  SoyBase, the USDA-ARS soybean genetics and genomics database , 2009, Nucleic Acids Res..

[27]  D. Qi An Intefrated Semi-Random Forests Based Approach to Gene Selection for Glioma Classification , 2010 .

[28]  F. Baluška,et al.  Aluminum stress signaling in plants , 2009, Plant signaling & behavior.

[29]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[30]  Bor-Sen Chen,et al.  A systems biology approach to construct the gene regulatory network of systemic inflammation via microarray and databases mining , 2008, BMC Medical Genomics.

[31]  Jun Dong,et al.  Geometric Interpretation of Gene Coexpression Network Analysis , 2008, PLoS Comput. Biol..

[32]  V. Shulaev,et al.  Reactive oxygen signaling and abiotic stress. , 2008, Physiologia plantarum.

[33]  J. Ma,et al.  Syndrome of aluminum toxicity and diversity of aluminum resistance in higher plants. , 2007, International review of cytology.

[34]  Narendra Tuteja,et al.  Mechanisms of high salinity tolerance in plants. , 2007, Methods in enzymology.

[35]  Peng Zhao,et al.  Supervised learning-based cell image segmentation for P53 immunohistochemistry , 2006, IEEE Transactions on Biomedical Engineering.

[36]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[37]  S. Horvath,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[38]  Adrian E. Raftery,et al.  Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data , 2005, Bioinform..

[39]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[40]  L. Kochian,et al.  How do crop plants tolerate acid soils? Mechanisms of aluminum tolerance and phosphorous efficiency. , 2004, Annual review of plant biology.

[41]  Benjamin M. Bolstad,et al.  affy - analysis of Affymetrix GeneChip data at the probe level , 2004, Bioinform..

[42]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[43]  L. Kochian,et al.  The Physiology, Genetics and Molecular Biology of Plant Aluminum Resistance and Toxicity , 2005, Plant and Soil.

[44]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[45]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[46]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[47]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[48]  W. J. DeCoursey,et al.  Introduction: Probability and Statistics , 2003 .

[49]  E. Delhaize,et al.  FUNCTION AND MECHANISM OF ORGANIC ANION EXUDATION FROM PLANT ROOTS. , 2001, Annual review of plant physiology and plant molecular biology.

[50]  Stefan Sperlich,et al.  Generalized Additive Models , 2014 .

[51]  R. Ash,et al.  Probability and measure theory , 1999 .

[52]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[53]  W. Snedden,et al.  Salt tolerance conferred by overexpression of a vacuolar Na+/H+ antiport in Arabidopsis. , 1999, Science.

[54]  P. Hasegawa,et al.  NaCl-Induced Alterations in Both Cell Structure and Tissue-Specific Plasma Membrane H+ -ATPase Gene Expression , 1996, Plant physiology.

[55]  R. Tibshirani,et al.  Generalized additive models for medical research , 1995, Statistical methods in medical research.

[56]  P. Hasegawa,et al.  NaCl Regulation of Plasma Membrane H+-ATPase Gene Expression in a Glycophyte and a Halophyte , 1993, Plant physiology.

[57]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[58]  S. Linn,et al.  DNA damage and oxygen radical toxicity. , 1988, Science.

[59]  R. Wise,et al.  Chilling-enhanced photooxidation : evidence for the role of singlet oxygen and superoxide in the breakdown of pigments and endogenous antioxidants. , 1987, Plant physiology.