Systematic Planning of Genome-Scale Experiments in Poorly Studied Species

Genome-scale datasets have been used extensively in model organisms to screen for specific candidates or to predict functions for uncharacterized genes. However, despite the availability of extensive knowledge in model organisms, the planning of genome-scale experiments in poorly studied species is still based on the intuition of experts or heuristic trials. We propose that computational and systematic approaches can be applied to drive the experiment planning process in poorly studied species based on available data and knowledge in closely related model organisms. In this paper, we suggest a computational strategy for recommending genome-scale experiments based on their capability to interrogate diverse biological processes to enable protein function assignment. To this end, we use the data-rich functional genomics compendium of the model organism to quantify the accuracy of each dataset in predicting each specific biological process and the overlap in such coverage between different datasets. Our approach uses an optimized combination of these quantifications to recommend an ordered list of experiments for accurately annotating most proteins in the poorly studied related organisms to most biological processes, as well as a set of experiments that target each specific biological process. The effectiveness of this experiment- planning system is demonstrated for two related yeast species: the model organism Saccharomyces cerevisiae and the comparatively poorly studied Saccharomyces bayanus. Our system recommended a set of S. bayanus experiments based on an S. cerevisiae microarray data compendium. In silico evaluations estimate that less than 10% of the experiments could achieve similar functional coverage to the whole microarray compendium. This estimation was confirmed by performing the recommended experiments in S. bayanus, therefore significantly reducing the labor devoted to characterize the poorly studied genome. This experiment-planning framework could readily be adapted to the design of other types of large-scale experiments as well as other groups of organisms.

[1]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[2]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[3]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[4]  S. Bergmann,et al.  Comparative Gene Expression Analysis by a Differential Clustering Approach: Application to the Candida albicans Transcription Program , 2005, PLoS genetics.

[5]  Nigam H. Shah,et al.  The Stanford Tissue Microarray Database , 2007, Nucleic Acids Res..

[6]  D. Botstein,et al.  Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF , 2001, Nature.

[7]  Rachel B. Brem,et al.  The landscape of genetic complexity across 5,700 gene expression traits in yeast. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Katharina Morik,et al.  Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring , 1999, ICML.

[9]  B. Pugh,et al.  Interplay of TBP inhibitors in global transcriptional control. , 2002, Molecular cell.

[10]  Curtis Huttenhower,et al.  Bayesian data integration: a functional perspective. , 2006, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[11]  Yuanfang Guan,et al.  A Genomewide Functional Network for the Laboratory Mouse , 2008, PLoS Comput. Biol..

[12]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[13]  A. Fraser,et al.  A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans , 2008, Nature Genetics.

[14]  Ken E. Whelan,et al.  The Automation of Science , 2009, Science.

[15]  Trey Ideker,et al.  Functional Maps of Protein Complexes from Quantitative Genetic Interaction Data , 2008, PLoS Comput. Biol..

[16]  Helen E. Parkinson,et al.  ArrayExpress—a public database of microarray experiments and gene expression profiles , 2006, Nucleic Acids Res..

[17]  J. Pronk,et al.  Contribution of the Saccharomyces cerevisiae transcriptional regulator Leu3p to physiology and gene expression in nitrogen- and carbon-limited chemostat cultures. , 2005, FEMS yeast research.

[18]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[19]  R. Fisher FREQUENCY DISTRIBUTION OF THE VALUES OF THE CORRELATION COEFFIENTS IN SAMPLES FROM AN INDEFINITELY LARGE POPU;ATION , 1915 .

[20]  Matthew A. Hibbs,et al.  Exploring the human genome with functional maps. , 2009, Genome research.

[21]  Eleonora Kurtenbach,et al.  Genomic expression pattern in Saccharomyces cerevisiae cells in response to high hydrostatic pressure , 2004, FEBS letters.

[22]  Dong Dong,et al.  IntNetDB v1.0: an integrated protein-protein interaction network database generated by a probabilistic model , 2006, BMC Bioinformatics.

[23]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[24]  Olga G. Troyanskaya,et al.  Computationally Driven, Quantitative Experiments Discover Genes Required for Mitochondrial Biogenesis , 2009, PLoS genetics.

[25]  Matthew A. Hibbs,et al.  Discovery of biological networks from diverse functional genomic data , 2005, Genome Biology.

[26]  O. Troyanskaya,et al.  Predicting gene function in a hierarchical context with an ensemble of classifiers , 2008, Genome Biology.

[27]  Stéphane Le Crom,et al.  yMGV: helping biologists with yeast microarray data mining , 2002, Nucleic Acids Res..

[28]  L. Kruglyak,et al.  Genetic Dissection of Transcriptional Regulation in Budding Yeast , 2002, Science.

[29]  S. Bergmann,et al.  Similarities and Differences in Genome-Wide Expression Data of Six Organisms , 2003, PLoS biology.

[30]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[31]  Kara Dolinski,et al.  Saccharomyces Genome Database provides tools to survey gene expression and functional analysis data , 2001, Nucleic Acids Res..

[32]  P. Brown,et al.  A second iron-regulatory system in yeast independent of Aft1p , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Rachel B. Brem,et al.  Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors , 2003, Nature Genetics.

[34]  Catherine Shaffer Next-generation sequencing outpaces expectations , 2007, Nature Biotechnology.

[35]  Wenjiang J. Fu,et al.  Estimating misclassification error with small samples via bootstrap cross-validation , 2005, Bioinform..

[36]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[37]  J. Pronk,et al.  Two-dimensional Transcriptome Analysis in Chemostat Cultures , 2005, Journal of Biological Chemistry.

[38]  Matthew A. Hibbs,et al.  Finding function: evaluation methods for functional genomic data , 2006, BMC Genomics.

[40]  Michael I. Jordan,et al.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence , 2008, Genome Biology.

[41]  Kai Li,et al.  Exploring the functional landscape of gene expression: directed search of large microarray compendia , 2007, Bioinform..

[42]  Albert-László Barabási,et al.  Genetic Dissection of Transcriptional Regulation in Budding Yeast , 2002 .