Optimizing gene set annotations combining GO structure and gene expression data

BackgroundWith the rapid accumulation of genomic data, it has become a challenge issue to annotate and interpret these data. As a representative, Gene set enrichment analysis has been widely used to interpret large molecular datasets generated by biological experiments. The result of gene set enrichment analysis heavily relies on the quality and integrity of gene set annotations. Although several methods were developed to annotate gene sets, there is still a lack of high quality annotation methods. Here, we propose a novel method to improve the annotation accuracy through combining the GO structure and gene expression data.ResultsWe propose a novel approach for optimizing gene set annotations to get more accurate annotation results. The proposed method filters the inconsistent annotations using GO structure information and probabilistic gene set clusters calculated by a range of cluster sizes over multiple bootstrap resampled datasets. The proposed method is employed to analyze p53 cell lines, colon cancer and breast cancer gene expression data. The experimental results show that the proposed method can filter a number of annotations unrelated to experimental data and increase gene set enrichment power and decrease the inconsistent of annotations.ConclusionsA novel gene set annotation optimization approach is proposed to improve the quality of gene annotations. Experimental results indicate that the proposed method effectively improves gene set annotation quality based on the GO structure and gene expression data.

[1]  D. Koller,et al.  A module map showing conditional activity of expression modules in cancer , 2004, Nature Genetics.

[2]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[3]  Marco Masseroli,et al.  Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[4]  Purvesh Khatri,et al.  A semantic analysis of the annotations of the human genome , 2005, Bioinform..

[5]  Christos Faloutsos,et al.  Random walk with restart: fast solutions and applications , 2008, Knowledge and Information Systems.

[6]  Suhua Chang,et al.  i-GSEA4GWAS: a web server for identification of pathways/gene sets associated with traits by applying an improved gene set enrichment analysis to genome-wide association study , 2010, Nucleic Acids Res..

[7]  A. Rajadhyaksha Archive , 2008, BioScope: South Asian Screen Studies.

[8]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[9]  Jiming Liu,et al.  Predicting protein function via downward random walks on a gene ontology , 2015, BMC Bioinformatics.

[10]  G. Smyth,et al.  Camera: a competitive gene set test accounting for inter-gene correlation , 2012, Nucleic acids research.

[11]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[12]  Rachael P. Huntley,et al.  QuickGO: a web-based tool for Gene Ontology searching , 2009, Bioinform..

[13]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[14]  Edward I. Altman,et al.  FINANCIAL RATIOS, DISCRIMINANT ANALYSIS AND THE PREDICTION OF CORPORATE BANKRUPTCY , 1968 .

[15]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[16]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[17]  Giorgio Valentini,et al.  True Path Rule Hierarchical Ensembles for Genome-Wide Gene Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[19]  Jeffrey R. Whiteaker,et al.  Proteogenomic characterization of human colon and rectal cancer , 2014, Nature.

[20]  Roded Sharan,et al.  A network diffusion approach to inferring sample-specific function reveals functional changes associated with breast cancer , 2017, PLoS Comput. Biol..

[21]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  R. Young,et al.  BET Bromodomain Inhibition as a Therapeutic Strategy to Target c-Myc , 2011, Cell.

[23]  D. M. Green,et al.  Signal detection theory and psychophysics , 1966 .

[24]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[25]  Marco Frasca,et al.  Automated gene function prediction through gene multifunctionality in biological networks , 2015, Neurocomputing.

[26]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[27]  M. Kendall,et al.  The Problem of $m$ Rankings , 1939 .

[28]  Giorgio Valentini,et al.  GOssTo: a stand-alone application and a web tool for calculating semantic similarities on the Gene Ontology , 2014, Bioinform..

[29]  Juancarlos Chan,et al.  Gene Ontology Consortium: going forward , 2014, Nucleic Acids Res..

[30]  Jason H. Moore,et al.  Optimization of gene set annotations via entropy minimization over variable clusters (EMVC) , 2014, Bioinform..

[31]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..

[32]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[33]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[34]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[35]  Giorgio Valentini,et al.  A neural network algorithm for semi-supervised node label learning from unbalanced data , 2013, Neural Networks.

[36]  Mario Albrecht,et al.  Mining GO Annotations for Improving Annotation Consistency , 2012, PloS one.

[37]  J. Mesirov,et al.  The Molecular Signatures Database Hallmark Gene Set Collection , 2015 .