Inferring Unknown Biological Function by Integration of GO Annotations and Gene Expression Data

Characterizing genes with semantic information is an important process regarding the description of gene products. In spite that complete genomes of many organisms have been already sequenced, the biological functions of all of their genes are still unknown. Since experimentally studying the functions of those genes, one by one, would be unfeasible, new computational methods for gene functions inference are needed. We present here a novel computational approach for inferring biological function for a set of genes with previously unknown function, given a set of genes with well-known information. This approach is based on the premise that genes with similar behaviour should be grouped together. This is known as the guilt-by-association principle. Thus, it is possible to take advantage of clustering techniques to obtain groups of unknown genes that are co-clustered with genes that have well-known semantic information (GO annotations). Meaningful knowledge to infer unknown semantic information can therefore be provided by these well-known genes. We provide a method to explore the potential function of new genes according to those currently annotated. The results obtained indicate that the proposed approach could be a useful and effective tool when used by biologists to guide the inference of biological functions for recently discovered genes. Our work sets an important landmark in the field of identifying unknown gene functions through clustering, using an external source of biological input. A simple web interface to this proposal can be found at http://fich.unl.edu.ar/sinc/webdemo/gamma-am/.

[1]  Jin-Kao Hao,et al.  Pattern Recognition in Bioinformatics , 2013, Lecture Notes in Computer Science.

[2]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[3]  Pablo M. Granitto,et al.  Clustering gene expression data with a penalized graph-based metric , 2011, BMC Bioinformatics.

[4]  Zhiping Weng,et al.  Gene set enrichment analysis: performance evaluation and usage guidelines , 2012, Briefings Bioinform..

[5]  Ni Li,et al.  Gene Ontology Annotations and Resources , 2012, Nucleic Acids Res..

[6]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[7]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[8]  R. Kustra,et al.  Data-Fusion in Clustering Microarray Data: Balancing Discovery and Interpretability , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Carsten Wiuf,et al.  Co-clustering and visualization of gene expression data and gene ontology terms for Saccharomyces cerevisiae using self-organizing maps , 2007, J. Biomed. Informatics.

[10]  Atul J. Butte,et al.  Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks , 2005, BMC Bioinformatics.

[11]  Jaak Vilo,et al.  g:Profiler—a web server for functional interpretation of gene lists (2011 update) , 2011, Nucleic Acids Res..

[12]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[13]  V. Lacroix,et al.  An Introduction to Metabolic Networks and Their Structural Analysis , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[15]  Marco Masseroli,et al.  Computational algorithms to predict Gene Ontology annotations , 2015, BMC Bioinformatics.

[16]  Georgina Stegmayer,et al.  *omeSOM: a software for clustering and visualization of transcriptional and metabolite data mined from interspecific crosses of crop plants , 2010, BMC Bioinformatics.

[17]  Pablo M. Granitto,et al.  A novel clustering approach for biological data using a new distance based on Gene Ontology , 2013 .

[18]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[19]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[20]  Pablo M. Granitto,et al.  How Many Clusters: A Validation Index for Arbitrary-Shaped Clusters , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[22]  Simon Kasif,et al.  Seeing the forest for the trees: using the Gene Ontology to restructure hierarchical clustering , 2009, Bioinform..

[23]  Olivier Bodenreider,et al.  Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[24]  Lothar Willmitzer,et al.  Interaction with Diurnal and Circadian Regulation Results in Dynamic Metabolic and Transcriptional Changes during Cold Acclimation in Arabidopsis , 2010, PloS one.

[25]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[27]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Kotagiri Ramamohanarao,et al.  A Novel Path-Based Clustering Algorithm Using Multi-dimensional Scaling , 2009, Australasian Conference on Artificial Intelligence.

[29]  K. Dolinski,et al.  Use and misuse of the gene ontology annotations , 2008, Nature Reviews Genetics.

[30]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[31]  Georgina Stegmayer,et al.  A Biologically Inspired Validity Measure for Comparison of Clustering Methods over Metabolic Data Sets , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  David Waxman,et al.  A Problem With the Correlation Coefficient as a Measure of Gene Expression Divergence , 2009, Genetics.

[33]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[34]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[35]  Georgina Stegmayer,et al.  Improving clustering with metabolic pathway data , 2014, BMC Bioinformatics.

[36]  Staffan Persson,et al.  Co-expression tools for plant biology: opportunities for hypothesis generation and caveats. , 2009, Plant, cell & environment.

[37]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[38]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[39]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[40]  Susmita Datta,et al.  Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes , 2006, BMC Bioinformatics.