UvA-DARE ( Digital Academic Repository ) SigWin-detector : A Grid-enabled workflow for discovering enriched windows of genomic features related to DNA sequences

BackgroundChromosome location is often used as a scaffold to organize genomic information in both the living cell and molecular biological research. Thus, ever-increasing amounts of data about genomic features are stored in public databases and can be readily visualized by genome browsers. To perform in silico experimentation conveniently with this genomics data, biologists need tools to process and compare datasets routinely and explore the obtained results interactively. The complexity of such experimentation requires these tools to be based on an e-Science approach, hence generic, modular, and reusable. A virtual laboratory environment with workflows, workflow management systems, and Grid computation are therefore essential.FindingsHere we apply an e-Science approach to develop SigWin-detector, a workflow-based tool that can detect significantly enriched windows of (genomic) features in a (DNA) sequence in a fast and reproducible way. For proof-of-principle, we utilize a biological use case to detect regions of increased and decreased gene expression (RIDGEs and anti-RIDGEs) in human transcriptome maps. We improved the original method for RIDGE detection by replacing the costly step of estimation by random sampling with a faster analytical formula for computing the distribution of the null hypothesis being tested and by developing a new algorithm for computing moving medians. SigWin-detector was developed using the WS-VLAM workflow management system and consists of several reusable modules that are linked together in a basic workflow. The configuration of this basic workflow can be adapted to satisfy the requirements of the specific in silico experiment.ConclusionAs we show with the results from analyses in the biological use case on RIDGEs, SigWin-detector is an efficient and reusable Grid-based tool for discovering windows enriched for features of a particular type in any sequence of values. Thus, SigWin-detector provides the proof-of-principle for the modular e-Science based concept of integrative bioinformatics experimentation.

[1]  William Stafford Noble,et al.  Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays , 2006, Nature Methods.

[2]  Louis O. Hertzberger,et al.  VLAM-G: Interactive data driven workflow engine for Grid-enabled resources , 2007, Sci. Program..

[3]  Joshua M. Stuart,et al.  Chromosomal clustering of muscle-expressed genes in Caenorhabditis elegans , 2002, Nature.

[4]  H. Bussemaker,et al.  The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. , 2003, Genome research.

[5]  Diane L. Evans,et al.  The Distribution of Order Statistics for Discrete Random Variables with Applications to Bootstrapping , 2006, INFORMS J. Comput..

[6]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[7]  Yuri Y. Shevelyov,et al.  Large clusters of co-expressed genes in the Drosophila genome , 2002, Nature.

[8]  Gerald M Rubin,et al.  Evidence for large domains of similarly expressed genes in the Drosophila genome , 2002, Journal of biology.

[9]  Inda,et al.  Interactive dataflow driven workflow engine for Grid enabled resources Scientific Programming. , 2007 .

[10]  J. Rogers,et al.  DNA methylation profiling of human chromosomes 6, 20 and 22 , 2006, Nature Genetics.

[11]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[12]  David Charles De Roure,et al.  myExperiment: social networking for workflow-using e-scientists , 2007, WORKS '07.

[13]  D. Farnsworth A First Course in Order Statistics , 1993 .

[14]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[15]  Ash A. Alizadeh,et al.  Genome-wide analysis of DNA copy number variation in breast cancer using DNA microarrays , 1999, Nature Genetics.

[16]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[17]  Tjerk P. Straatsma,et al.  NEW CHALLENGES FACING INTEGRATIVE BIOLOGICAL SCIENCE IN THE POST-GENOMIC ERA , 2006 .

[18]  M. Scott Marshall,et al.  A semantic web approach applied to integrative bioinformatics experimentation: a biological use case with genomics data , 2007, Bioinform..

[19]  Doron Lancet,et al.  Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification , 2005, Bioinform..

[20]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[21]  B. Steensel Mapping of genetic and epigenetic regulatory networks using microarrays , 2005, Nature Genetics.

[22]  Ash A. Alizadeh,et al.  Genome-wide analysis of DNA copy-number changes using cDNA microarrays , 1999, Nature Genetics.

[23]  Wolfgang Härdle,et al.  Optimal Median Smoothing , 1995 .

[24]  Xiang-Jun Lu,et al.  Detecting transcriptionally active regions using genomic tiling arrays , 2006, Genome Biology.

[25]  Marco Roos,et al.  The promise of a virtual lab in drug discovery. , 2006, Drug discovery today.

[26]  Cees T. A. M. de Laat,et al.  Interactive Workflows in a Virtual Laboratory for e-Bioscience: The SigWin-Detector Tool for Gene Expression Analysis , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[27]  Carole Goble,et al.  The Low Down on e-Science and Grids for Biology , 2001, Comparative and functional genomics.