Interactive visualization and model-based analysis of genomics data

Biotechnologies, such as DNA sequencing and microarray technology, have transformed research in molecular biology from being "gene-centric" to "genome-centric", opening new avenues in the areas of drug discovery, clinical diagnostics and agriculture. There is a pressing need for development of new computational and visualization techniques to gain biological knowledge from massive amounts of heterogeneous and complex genomics data. Computational methods transform biological questions into mathematical problems to produce large quantities of numerical results that require biological interpretation. Visual presentation of such data combined with interactive exploration will allow biologists to comprehend underlying biology. Motivated by these challenges of bridging the gaps between data processing by computational scientists and its interpretation by life scientists, this dissertation presents visualization tools for genomics data with easy-to-use interfaces. Understanding the process of gene regulation is one of the most important and challenging questions for which the scientific community is increasingly looking for an answer in genomics data. Different types of genomics data shed light on different aspects of gene regulation making integration of data essential to get a handle on the whole process of gene regulation. To facilitate such integration, this dissertation presents a model of gene regulation that leads to a graph theoretic structure, which provides an invariant view of regulation from both the sequence and gene expression. Methods to obtain approximations to such a structure from gene expression data and DNA-protein interaction data are presented.

[1]  B. Spratt,et al.  Recruitment of a penicillin-binding protein gene from Neisseria flavescens during the emergence of penicillin resistance in Neisseria meningitidis. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[2]  D R Bentley,et al.  Long-range comparison of human and mouse SCL loci: localized regions of sensitivity to restriction endonucleases correspond precisely with peaks of conserved noncoding sequences. , 2001, Genome research.

[3]  Nameeta Shah,et al.  Inferring Cis-region Hierarchies from Patterns in Time-Course Gene Expression Data , 2004, Regulatory Genomics.

[4]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[5]  L. Pachter,et al.  Strategies and tools for whole-genome alignments. , 2002, Genome research.

[6]  H. Sagan Space-filling curves , 1994 .

[7]  Bernd Hamann,et al.  SNP-VISTA: An interactive SNP visualization tool , 2005, BMC Bioinformatics.

[8]  G. Church,et al.  Identifying regulatory networks by combinatorial analysis of promoter elements , 2001, Nature Genetics.

[9]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[10]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[11]  John Riedl,et al.  Visualization of biological sequence similarity search results , 1995, Proceedings Visualization '95.

[12]  Hsuan T. Chang,et al.  Visualisation and Comparison of DNA Sequences by Use of Three-Dimensional Trajectories , 2003, APBC.

[13]  David Botstein,et al.  The Stanford Microarray Database: data access and quality assessment tools , 2003, Nucleic Acids Res..

[14]  Ben Shneiderman,et al.  The eyes have it: a task by data type taxonomy for information visualizations , 1996, Proceedings 1996 IEEE Symposium on Visual Languages.

[15]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[16]  E. Davidson Genomic Regulatory Systems , 2001 .

[17]  S. Batzoglou,et al.  Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. , 2003, Genome research.

[18]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[19]  G. Church,et al.  A global view of pleiotropy and phenotypically derived gene function in yeast , 2005, Molecular systems biology.

[20]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[21]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[22]  Bernd Hamann,et al.  Phylo-VISTA: interactive visualization of multiple DNA sequence alignments , 2004, Bioinform..

[23]  Ron Shamir,et al.  Integrative analysis of genome-wide experiments in the context of a large high-throughput data compendium , 2005, Molecular systems biology.

[24]  Nicholas Chen,et al.  TreeJuxtaposer : Scalable Tree Comparison using Focus + Context with Guaranteed Visibility , 2006 .

[25]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[26]  Naama Barkai,et al.  The design of transcription-factor binding sites is affected by combinatorial regulation , 2005, Genome Biology.

[27]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[28]  John Riedl,et al.  Flexible information visualization of multivariate data from biological sequence similarity searches , 1996, Proceedings of Seventh Annual IEEE Visualization '96.

[29]  Nicola J. Rinaldi,et al.  Computational discovery of gene modules and regulatory networks , 2003, Nature Biotechnology.

[30]  Tamara Munzner,et al.  SequenceJuxtaposer: Fluid Navigation For Large-Scale Sequence Comparison in Context , 2004, German Conference on Bioinformatics.

[31]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[32]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[33]  M. Rieder,et al.  Sequence variation in the human angiotensin converting enzyme , 1999, Nature Genetics.

[34]  Nelson L. Max Visualizing Hilbert curves , 1998, Proceedings Visualization '98 (Cat. No.98CB36276).

[35]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[36]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[37]  Nancy F. Hansen,et al.  Comparative analyses of multi-species sequences from targeted genomic regions , 2003, Nature.

[38]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[39]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[40]  Daphne Koller,et al.  Rich probabilistic models for genomic data , 2004 .

[41]  D. Voorhies SPACE-FILLING CURVES AND A MEASURE OF COHERENCE , 1991 .

[42]  Webb Miller,et al.  PipTools: a computational toolkit to annotate and analyze pairwise comparisons of genomic sequences. , 2002, Genomics.

[43]  Giovanni Parmigiani,et al.  Mutational Analysis of the Tyrosine Phosphatome in Colorectal Cancers , 2004, Science.

[44]  P. Bork,et al.  Human non-synonymous SNPs: server and survey. , 2002, Nucleic acids research.

[45]  Roded Sharan,et al.  Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[46]  S. Carroll,et al.  The regulatory content of intergenic DNA shapes genome architecture , 2004, Genome Biology.

[47]  Jon D. McAuliffe,et al.  Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome , 2003, Science.

[48]  E. Davidson,et al.  Gene regulatory networks for development. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[50]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[51]  Guang R. Gao,et al.  Visualizing biosequence data using texture mapping , 2002, IEEE Symposium on Information Visualization, 2002. INFOVIS 2002..

[52]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[53]  Tommi S. Jaakkola,et al.  Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models , 2001, Pacific Symposium on Biocomputing.

[54]  Steven Skiena,et al.  Heterogeneous Data Integration with the Consensus Clustering Formalism , 2004, DILS.

[55]  Lior Pachter,et al.  VISTA : visualizing global DNA sequence alignments of arbitrary length , 2000, Bioinform..

[56]  Roded Sharan,et al.  Biclustering Algorithms: A Survey , 2007 .

[57]  S. Batzoglou,et al.  Application of independent component analysis to microarrays , 2003, Genome Biology.

[58]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[59]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[60]  P. Brown,et al.  New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis. , 2000, Molecular biology of the cell.

[61]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[62]  Pak Chung Wong,et al.  Global Visualization and Alignments of Whole Bacterial Genomes , 2003, IEEE Trans. Vis. Comput. Graph..

[63]  Lior Pachter,et al.  MAVID multiple alignment server , 2003, Nucleic Acids Res..

[64]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[65]  Alberto Riva,et al.  SNPper: retrieval and analysis of human SNPs , 2002, Bioinform..

[66]  Rachael Brady,et al.  BARD: a visualization tool for biological sequence analysis , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[67]  Berthold Göttgens,et al.  Transcriptional regulation of the stem cell leukemia gene (SCL)--comparative analysis of five vertebrate SCL loci. , 2002, Genome research.

[68]  John Quackenbush,et al.  Genesis: cluster analysis of microarray data , 2002, Bioinform..

[69]  E. Boerwinkle,et al.  DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene , 1998, Nature Genetics.

[70]  Robert F Erbacher,et al.  Multidimensional data visualization , 2002, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[71]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Ron Shamir,et al.  EXPANDER – an integrative program suite for microarray data analysis , 2005, BMC Bioinformatics.

[73]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[74]  Michael B. Eisen,et al.  Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles , 2001, ISMB.

[75]  Hans-Peter Kriegel,et al.  Recursive pattern: a technique for visualizing very large amounts of data , 1995, Proceedings Visualization '95.

[76]  Joshua M. Stuart,et al.  A Gene Expression Map for Caenorhabditis elegans , 2001, Science.

[77]  S. Henikoff,et al.  Accounting for human polymorphisms predicted to affect protein function. , 2002, Genome research.

[78]  Steven Skiena,et al.  Analysis techniques for microarray time-series data , 2001, RECOMB.

[79]  H. Bussemaker,et al.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[80]  Aravinda Chakravarti,et al.  ViewGene: a graphical tool for polymorphism visualization and characterization. , 2002, Genome research.

[81]  Thomas Huber,et al.  Bellerophon: a program to detect chimeric sequences in multiple sequence alignments , 2004, Bioinform..

[82]  Bernd Hamann,et al.  GeneBox: Interactive Visualization of Microarray Data Sets , 2003, METMBS.

[83]  D. Koller,et al.  GeneXPress : A Visualization and Statistical Analysis Tool for Gene Expression and Sequence Data , .

[84]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[85]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[86]  Roded Sharan,et al.  CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments , 2003, ISMB.

[87]  Dat H. Nguyen,et al.  Deciphering principles of transcription regulation in eukaryotic genomes , 2006, Molecular systems biology.

[88]  Michael D. McCool,et al.  Interactive maximum projection volume rendering , 1995, Proceedings Visualization '95.

[89]  Michael Q. Zhang,et al.  Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data , 2002 .

[90]  Hugues Sicotte,et al.  Genewindow: an interactive tool for visualization of genomic variation , 2005, Nature Genetics.

[91]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[92]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.

[93]  Daniel Ashlock,et al.  Evolutionary Computation and Fractal Visualization of Sequence Data , 2003 .

[94]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[95]  J. Lieb,et al.  ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. , 2004, Genomics.

[96]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[97]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[98]  Mei Li,et al.  MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences , 2003, Nucleic Acids Res..

[99]  L. Hood,et al.  Regulatory gene networks and the properties of the developmental process , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[100]  Joshua M. Stuart,et al.  A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules , 2003, Science.

[101]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[102]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[103]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[104]  D. Church,et al.  Cross-species sequence comparisons: a review of methods and available resources. , 2003, Genome research.

[105]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[106]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[107]  I-Min A. Dubchak,et al.  Active conservation of noncoding sequences revealed by three-way species comparisons. , 2000, Genome research.

[108]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.