A data mining paradigm for identifying key factors in biological processes using gene expression data

A large volume of biological data is being generated for studying mechanisms of various biological processes. These precious data enable large-scale computational analyses to gain biological insights. However, it remains a challenge to mine the data efficiently for knowledge discovery. The heterogeneity of these data makes it difficult to consistently integrate them, slowing down the process of biological discovery. We introduce a data processing paradigm to identify key factors in biological processes via systematic collection of gene expression datasets, primary analysis of data, and evaluation of consistent signals. To demonstrate its effectiveness, our paradigm was applied to epidermal development and identified many genes that play a potential role in this process. Besides the known epidermal development genes, a substantial proportion of the identified genes are still not supported by gain- or loss-of-function studies, yielding many novel genes for future studies. Among them, we selected a top gene for loss-of-function experimental validation and confirmed its function in epidermal differentiation, proving the ability of this paradigm to identify new factors in biological processes. In addition, this paradigm revealed many key genes in cold-induced thermogenesis using data from cold-challenged tissues, demonstrating its generalizability. This paradigm can lead to fruitful results for studying molecular mechanisms in an era of explosive accumulation of publicly available biological data.

[1]  John P. A. Ioannidis,et al.  Methods for meta-analysis in genetic association studies: a review of their potential and pitfalls , 2008, Human Genetics.

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  Gang Wei,et al.  CITGeneDB: a comprehensive database of human and mouse genes enhancing or suppressing cold-induced thermogenesis validated by perturbation experiments in mice , 2018, Database J. Biol. Databases Curation.

[4]  Peter M. Elias,et al.  The skin barrier as an innate immune element , 2007, Seminars in Immunopathology.

[5]  J. Ellis,et al.  Adipose fatty acid oxidation is required for thermogenesis and potentiates oxidative stress-induced inflammation. , 2015, Cell reports.

[6]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[7]  Donald Y M Leung,et al.  The immunology of atopic dermatitis and its reversibility with broad-spectrum and targeted therapies. , 2017, The Journal of allergy and clinical immunology.

[8]  Olga Ilkayeva,et al.  Adipose acyl-CoA synthetase-1 directs fatty acids toward beta-oxidation and is required for cold thermogenesis. , 2010, Cell metabolism.

[9]  Zhengyu Guo,et al.  Possible mechanisms of host resistance to Haemonchus contortus infection in sheep breeds native to the Canary Islands , 2016, Scientific Reports.

[10]  Jin Li,et al.  SFMetaDB: a comprehensive annotation of mouse RNA splicing factor RNA-Seq datasets , 2017, bioRxiv.

[11]  Peng Yu,et al.  RNASeqMetaDB: a database and web server for navigating metadata of publicly available mouse RNA-Seq datasets , 2015, Bioinform..

[12]  Petr Tvrdik,et al.  ELOVL3 Is an Important Component for Early Onset of Lipid Recruitment in Brown Adipose Tissue* , 2006, Journal of Biological Chemistry.

[13]  Paul D Thomas,et al.  The Gene Ontology and the Meaning of Biological Function. , 2017, Methods in molecular biology.

[14]  Peng Yu,et al.  Activity-dependent aberrations in gene expression and alternative splicing in a mouse model of Rett syndrome , 2018, Proceedings of the National Academy of Sciences.

[15]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[16]  G. Churchill Fundamentals of experimental design for cDNA microarrays , 2002, Nature Genetics.

[17]  Tao Liu,et al.  CistromeMap: a knowledgebase and web server for ChIP-Seq and DNase-Seq studies in mouse and human , 2012, Bioinform..

[18]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[19]  M. Blumenberg,et al.  The homeoprotein DLX3 and tumor suppressor p53 co-regulate cell cycle progression and squamous tumor growth , 2015, Oncogene.

[20]  Chris Sander,et al.  CTD2 Dashboard: a searchable web interface to connect validated results from the Cancer Target Discovery and Development Network , 2017, Database J. Biol. Databases Curation.

[21]  Jin Li,et al.  RBPMetaDB: a comprehensive annotation of mouse RNA-Seq datasets with perturbations of RNA-binding proteins , 2018, bioRxiv.

[22]  Masashi Yanagisawa,et al.  Endothelin‐1 is a transcriptional target of p53 in epidermal keratinocytes and regulates ultraviolet‐induced melanocyte homeostasis , 2013, Pigment cell & melanoma research.

[23]  Maria I Morasso,et al.  Suprabasin, a Novel Epidermal Differentiation Marker and Potential Cornified Envelope Precursor* , 2002, The Journal of Biological Chemistry.

[24]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[25]  Zhao Li,et al.  GEOMetaCuration: a web-based application for accurate manual curation of Gene Expression Omnibus metadata , 2018, bioRxiv.

[26]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[27]  Tadeusz Pawelczyk,et al.  Expression of Cornified Envelope Proteins in Skin and Its Relationship with Atopic Dermatitis Phenotype. , 2017, Acta dermato-venereologica.

[28]  Peng Yu,et al.  Genome-wide transcriptome analysis identifies alternative splicing regulatory network and key splicing factors in mouse and human psoriasis , 2018, Scientific Reports.

[29]  Jun-Mo Yang,et al.  Expression of the homeobox gene, HOPX, is modulated by cell differentiation in human keratinocytes and is involved in the expression of differentiation markers. , 2010, European journal of cell biology.

[30]  Aparna Bhaduri,et al.  Network Analysis Identifies Mitochondrial Regulation of Epidermal Differentiation by MPZL3 and FDXR. , 2015, Developmental cell.

[31]  R. A. Groeneveld,et al.  Practical Nonparametric Statistics (2nd ed). , 1981 .

[32]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[33]  Radha Ayyagari,et al.  Loss of functional ELOVL4 depletes very long-chain fatty acids (≥C28) and the unique ω-O-acylceramides in skin leading to neonatal death , 2007 .

[34]  Zhengyu Guo,et al.  RNA-seq analysis of glycosylation related gene expression in STZ-induced diabetic rat kidney inner medulla , 2015, Front. Physiol..

[35]  Mayte Suárez-Fariñas,et al.  Broad defects in epidermal cornification in atopic dermatitis identified through genomic analysis. , 2009, The Journal of allergy and clinical immunology.

[36]  Maria Keays,et al.  ArrayExpress update—trends in database growth and links to data analysis tools , 2012, Nucleic Acids Res..