Prediction and integration of regulatory and protein-protein interactions.

Knowledge of transcriptional regulatory interactions (TRIs) is essential for exploring functional genomics and systems biology in any organism. While several results from genome-wide analysis of transcriptional regulatory networks are available, they are limited to model organisms such as yeast ( 1 ) and worm ( 2 ). Beyond these networks, experiments on TRIs study only individual genes and proteins of specific interest. In this chapter, we present a method for the integration of various data sets to predict TRIs for 54 organisms in the Bioverse ( 3 ). We describe how to compile and handle various formats and identifiers of data sets from different sources and how to predict TRIs using a homology-based approach, utilizing the compiled data sets. Integrated data sets include experimentally verified TRIs, binding sites of transcription factors, promoter sequences, protein subcellular localization, and protein families. Predicted TRIs expand the networks of gene regulation for a large number of organisms. The integration of experimentally verified and predicted TRIs with other known protein-protein interactions (PPIs) gives insight into specific pathways, network motifs, and the topological dynamics of an integrated network with gene expression under different conditions, essential for exploring functional genomics and systems biology.

[1]  Cheng-Yan Kao,et al.  POINT: a database for the prediction of protein-protein interactions based on the orthologous interactome , 2004, Bioinform..

[2]  Michael Q. Zhang,et al.  Identifying combinatorial regulation of transcription factors and binding motifs , 2004, Genome Biology.

[3]  Igor Jurisica,et al.  Online Predicted Human Interaction Database , 2005, Bioinform..

[4]  W Ansorge,et al.  Sequence and analysis of chromosome 3 of the plant Arabidopsis thaliana. , 2000, Nature.

[5]  M. Gerstein,et al.  A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. , 2000, Journal of molecular biology.

[6]  J. Berg Genome sequence of the nematode C. elegans: a platform for investigating biology. , 1998, Science.

[7]  Kimberly Van Auken,et al.  WormBase: a comprehensive data resource for Caenorhabditis biology and genomics , 2004, Nucleic Acids Res..

[8]  E. O’Shea,et al.  Global analysis of protein localization in budding yeast , 2003, Nature.

[9]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[10]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..

[11]  Jonathan Lim,et al.  Ulysses - an application for the projection of molecular interactions across species , 2005, Genome Biology.

[12]  M. Vidal,et al.  RBF, a novel RB-related gene that regulates E2F activity and interacts with cyclin E in Drosophila. , 1996, Genes & development.

[13]  Hiroaki Kitano,et al.  The PANTHER database of protein families, subfamilies, functions and pathways , 2004, Nucleic Acids Res..

[14]  Frances M. G. Pearl,et al.  The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis , 2004, Nucleic Acids Res..

[15]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[16]  Jungwon Yoon,et al.  The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community , 2003, Nucleic Acids Res..

[17]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[18]  European Union Chromosome 3 Arabidopsis Genome Sequencing Consortium,et al.  Sequence and analysis of chromosome 3 of the plant Arabidopsis thaliana , 2000, Nature.

[19]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[20]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[21]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[22]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[23]  M. Gerstein,et al.  Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. , 2001, Genome research.

[24]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[25]  Jun Kawai,et al.  LOCATE: a mouse protein subcellular localization database , 2005, Nucleic Acids Res..

[26]  John B. Anderson,et al.  CDD: a Conserved Domain Database for protein classification , 2004, Nucleic Acids Res..

[27]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..

[28]  Shoshana J. Wodak,et al.  CYGD: the Comprehensive Yeast Genome Database , 2004, Nucleic Acids Res..

[29]  Andrew Smith Genome sequence of the nematode C-elegans: A platform for investigating biology , 1998 .

[30]  F. Baas,et al.  The Human Transcriptome Map: Clustering of Highly Expressed Genes in Chromosomal Domains , 2001, Science.

[31]  Thomas Meitinger,et al.  MitoP2, an integrated database on mitochondrial proteins in yeast and man , 2004, Nucleic Acids Res..

[32]  William C. Nierman,et al.  Lin, X. et al. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402, 761-768 , 1999 .

[33]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[34]  Jérôme Gouzy,et al.  The ProDom database of protein domain families , 1998, Nucleic Acids Res..

[35]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[36]  Robert S. Ledley,et al.  PIRSF: family classification system at the Protein Information Resource , 2004, Nucleic Acids Res..

[37]  Christian von Mering,et al.  STRING: known and predicted protein–protein associations, integrated and transferred across organisms , 2004, Nucleic Acids Res..

[38]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[39]  Kei-Hoi Cheung,et al.  TRIPLES: a database of gene function in Saccharomyces cerevisiae , 2000, Nucleic Acids Res..

[40]  David A. Lee,et al.  Gene3D: modelling protein structure, function and evolution , 2005, Nucleic Acids Res..

[41]  H. Margalit,et al.  Quantitative parameters for amino acid-base interaction: implications for prediction of protein-DNA binding sites. , 1998, Nucleic acids research.

[42]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[43]  Burkhard Rost,et al.  LOC3D: annotate sub-cellular localization for protein structures , 2003, Nucleic Acids Res..

[44]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[45]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[46]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2006, Nucleic Acids Res..

[47]  C. Chothia,et al.  The geometry of domain combination in proteins. , 2002, Journal of molecular biology.

[48]  Tianwei Yu,et al.  Inference of transcriptional regulatory network by two-stage constrained space factor analysis , 2005, Bioinform..

[49]  D. Baker,et al.  Protein–DNA binding specificity predictions with structural models , 2005, Nucleic acids research.

[50]  M. Gerstein,et al.  Genomic analysis of regulatory network dynamics reveals large topological changes , 2004, Nature.

[51]  Kei-Hoi Cheung,et al.  Large-scale analysis of the yeast genome by transposon tagging and gene disruption , 1999, Nature.

[52]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[53]  Kenta Nakai,et al.  BTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics , 2004, Nucleic Acids Res..

[54]  Owen White,et al.  The TIGRFAMs database of protein families , 2003, Nucleic Acids Res..

[55]  M Gerstein,et al.  Genome-wide analysis relating expression level with protein subcellular localization. , 2000, Trends in genetics : TIG.

[56]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[57]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[58]  Julio Collado-Vides,et al.  RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions , 2005, Nucleic Acids Res..

[59]  A. Elofsson,et al.  Domain rearrangements in protein evolution. , 2005, Journal of molecular biology.

[60]  Kimberly Van Auken,et al.  WormBase: better software, richer content , 2005, Nucleic Acids Res..

[61]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[62]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[63]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[64]  S. Shen-Orr,et al.  Network motifs in the transcriptional regulation network of Escherichia coli , 2002, Nature Genetics.

[65]  G. Church,et al.  Identifying regulatory networks by combinatorial analysis of promoter elements , 2001, Nature Genetics.

[66]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[67]  R. Milo,et al.  Network motifs in integrated cellular networks of transcription-regulation and protein-protein interaction. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[68]  M. Vidal,et al.  Protein interaction mapping in C. elegans using proteins involved in vulval development. , 2000, Science.

[69]  The Chinese Human Genome Sequencing Consortium,et al.  Sequence and analysis of chromosome 5 of the plant Arabidopsis thaliana , 2000, Nature.

[70]  Francis D. Gibbons,et al.  Genomewide Identification of Sko1 Target Promoters Reveals a Regulatory Network That Operates in Response to Osmotic Stress in Saccharomyces cerevisiae , 2005, Eukaryotic Cell.

[71]  K Mayer,et al.  Sequence and analysis of chromosome 5 of the plant Arabidopsis thaliana. , 2000, Nature.

[72]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[73]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[74]  Cyrus Chothia,et al.  SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments , 2002, Nucleic Acids Res..

[75]  Martin Vingron,et al.  The SYSTERS Protein Family Database in 2005 , 2004, Nucleic Acids Res..

[76]  M. Vidal,et al.  Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". , 2001, Genome research.

[77]  Ram Samudrala,et al.  BIOVERSE: enhancements to the framework for structural, functional and contextual modeling of proteins and proteomes , 2005, Nucleic Acids Res..

[78]  Jason Gertz,et al.  Discovery, validation, and genetic dissection of transcription factor binding sites by comparative and functional genomics. , 2005, Genome research.

[79]  M. Gerstein,et al.  Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. , 2004, Genome research.

[80]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[81]  G. Hill,et al.  Function and evolution , 2006 .

[82]  Lewis Y. Geer,et al.  CDART: protein homology by domain architecture. , 2002, Genome research.

[83]  A. Fraser,et al.  A first-draft human protein-interaction map , 2004, Genome Biology.

[84]  Wei Zhu,et al.  The Institute for Genomic Research Osa1 Rice Genome Annotation Database1 , 2005, Plant Physiology.

[85]  Paul Shinn,et al.  Sequence and analysis of chromosome 1 of the plant Arabidopsis thaliana , 2000, Nature.

[86]  J. Collado-Vides,et al.  Identifying global regulators in transcriptional regulatory networks in bacteria. , 2003, Current opinion in microbiology.

[87]  Nan Guo,et al.  PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways , 2006, Nucleic Acids Res..

[88]  Michael McClelland,et al.  Identification of promoters bound by c-Jun/ATF2 during rapid large-scale gene activation following genotoxic stress. , 2004, Molecular cell.

[89]  D. Guhathakurta,et al.  Computational identification of transcriptional regulatory elements in DNA sequence , 2006, Nucleic acids research.

[90]  David Haussler,et al.  The UCSC genome browser database: update 2007 , 2006, Nucleic Acids Res..

[91]  Shmuel Pietrokovski,et al.  Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations , 1999, Bioinform..

[92]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[93]  Ian M. Donaldson,et al.  The Biomolecular Interaction Network Database and related tools 2005 update , 2004, Nucleic Acids Res..

[94]  S. Reymann,et al.  Transcriptome profiling of human hepatocytes treated with Aroclor 1254 reveals transcription factor regulatory networks and clusters of regulated genes , 2006, BMC Genomics.

[95]  Terrence S. Furey,et al.  The UCSC Genome Browser Database: update 2006 , 2005, Nucleic Acids Res..

[96]  Christian A. Grove,et al.  A Gene-Centered C. elegans Protein-DNA Interaction Network , 2006, Cell.

[97]  Jonghwan Kim,et al.  Mapping DNA-protein interactions in large genomes by sequence tag analysis of genomic enrichment , 2005, Nature Methods.

[98]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[99]  Peer Bork,et al.  SMART 5: domains in the context of genomes and networks , 2005, Nucleic Acids Res..

[100]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[101]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[102]  A. Nanji,et al.  Transcriptional networks in a rat model for nonalcoholic fatty liver disease: a microarray analysis. , 2006, Experimental and molecular pathology.

[103]  Terri K. Attwood,et al.  PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[104]  Bing Ren,et al.  Direct isolation and identification of promoters in the human genome. , 2005, Genome research.

[105]  Huanming Yang,et al.  A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. indica) , 2002, Science.

[106]  Sébastien Carrère,et al.  The ProDom database of protein domain families: more emphasis on 3D , 2004, Nucleic Acids Res..

[107]  Janet M Thornton,et al.  Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity. , 2002, Journal of molecular biology.

[108]  Ting Chen,et al.  Network motif identification in stochastic networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[109]  Ian M. Donaldson,et al.  BIND: the Biomolecular Interaction Network Database , 2001, Nucleic Acids Res..

[110]  S. L. Wong,et al.  Motifs, themes and thematic maps of an integrated Saccharomyces cerevisiae interaction network , 2005, Journal of biology.