A factor analysis model for functional genomics

BackgroundExpression array data are used to predict biological functions of uncharacterized genes by comparing their expression profiles to those of characterized genes. While biologically plausible, this is both statistically and computationally challenging. Typical approaches are computationally expensive and ignore correlations among expression profiles and functional categories.ResultsWe propose a factor analysis model (FAM) for functional genomics and give a two-step algorithm, using genome-wide expression data for yeast and a subset of Gene-Ontology Biological Process functional annotations. We show that the predictive performance of our method is comparable to the current best approach while our total computation time was faster by a factor of 4000. We discuss the unique challenges in performance evaluation of algorithms used for genome-wide functions genomics. Finally, we discuss extensions to our method that can incorporate the inherent correlation structure of the functional categories to further improve predictive performance.ConclusionOur factor analysis model is a computationally efficient technique for functional genomics and provides a clear and unified statistical framework with potential for incorporating important gene ontology information to improve predictions.

[1]  B. Frey,et al.  Genome-wide analysis of mouse transcripts using exon microarrays and factor graphs , 2005, Nature Genetics.

[2]  Katta G. Murty,et al.  Nonlinear Programming Theory and Algorithms , 2007, Technometrics.

[3]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[4]  Lani F. Wu,et al.  Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters , 2002, Nature Genetics.

[5]  Pedro M. Coutinho,et al.  Implementation of a Functional Semantic Similarity Measure between Gene-Products , 2003 .

[6]  W. Wong,et al.  Transitive functional annotation by shortest-path analysis of gene expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[8]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[9]  K. Jöreskog A general approach to confirmatory maximum likelihood factor analysis , 1969 .

[10]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[11]  Colin Campbell,et al.  The latent process decomposition of cDNA microarray data sets , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[13]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[14]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[15]  Frank Holstege,et al.  Predicting gene function through systematic analysis and quality assessment of high-throughput data , 2005, Bioinform..

[16]  Mehryar Mohri,et al.  AUC Optimization vs. Error Rate Minimization , 2003, NIPS.

[17]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..

[18]  Trevor Hastie,et al.  Imputing Missing Data for Gene Expression Arrays , 2001 .

[19]  Geoffrey J. McLachlan,et al.  Further results on the effect of intraclass correlation among training samples in discriminant analysis , 1976, Pattern Recognit..

[20]  B. Frey,et al.  The functional landscape of mouse gene expression , 2004, Journal of biology.

[21]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[22]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[23]  Henry Wolkowicz,et al.  Handbook of Semidefinite Programming , 2000 .

[24]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[25]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[26]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[27]  Dong Xu,et al.  Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. , 2004, Nucleic acids research.

[28]  Dimitris Bertsimas,et al.  Optimization over integers , 2005 .

[29]  D. Novins,et al.  Imputing missing data. , 2004, Journal of the American Academy of Child and Adolescent Psychiatry.

[30]  Jake D. Tubbs,et al.  Effect of autocorrelated training samples on Bayes' probabilities of misclassification , 1980, Pattern Recognit..