LGL: creating a map of protein function with an algorithm for visualizing very large biological networks.

Networks are proving to be central to the study of gene function, protein-protein interaction, and biochemical pathway data. Visualization of networks is important for their study, but visualization tools are often inadequate for working with very large biological networks. Here, we present an algorithm, called large graph layout (LGL), which can be used to dynamically visualize large networks on the order of hundreds of thousands of vertices and millions of edges. LGL applies a force-directed iterative layout guided by a minimal spanning tree of the network in order to generate coordinates for the vertices in two or three dimensions, which are subsequently visualized and interactively navigated with companion programs. We demonstrate the use of LGL in visualizing an extensive protein map summarizing the results of approximately 21 billion sequence comparisons between 145579 proteins from 50 genomes. Proteins are positioned in the map according to sequence homology and gene fusions, with the map ultimately serving as a theoretical framework that integrates inferences about gene function derived from sequence homology, remote homology, gene fusions, and higher-order fusions. We confirm that protein neighbors in the resulting map are functionally related, and that distinct map regions correspond to distinct cellular systems, enabling a computational strategy for discovering proteins' functions on the basis of the proteins' map positions. Using the map produced by LGL, we infer general functions for 23 uncharacterized protein families.

[1]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[2]  Byungkyu Brian Park,et al.  Visualization and analysis of protein interactions , 2003, Bioinform..

[3]  N Linial,et al.  ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space , 1999, Proteins.

[4]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[5]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[6]  Nathan Linial,et al.  A Map of the Protein Space: An Automatic Hierarchical Classification of all Protein Sequences , 1998, ISMB.

[7]  Vladimir Batagelj,et al.  Pajek - Program for Large Network Analysis , 1999 .

[8]  Benno Schwikowski,et al.  Discovering regulatory and signalling circuits in molecular interaction networks , 2002, ISMB.

[9]  Robert Giegerich,et al.  PathFinder: reconstruction and dynamic visualization of metabolic pathways , 2002, Bioinform..

[10]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[11]  David S. Eisenberg,et al.  Finding families for genomic ORFans , 1999, Bioinform..

[12]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[13]  Ioannis Xenarios,et al.  DIP: The Database of Interacting Proteins: 2001 update , 2001, Nucleic Acids Res..

[14]  Anton J. Enright,et al.  BioLayout-an automatic graph layout algorithm for similarity visualization , 2001, Bioinform..

[15]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[16]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[17]  Peter Ross,et al.  Three cdg Operons Control Cellular Turnover of Cyclic Di-GMP in Acetobacter xylinum: Genetic Organization and Occurrence of Conserved Domains in Isoenzymes , 1998, Journal of bacteriology.

[18]  Alfonso Valencia,et al.  Clustering of proximal sequence space for the identification of protein families , 2002, Bioinform..

[19]  C. DeLisi,et al.  Genes linked by fusion events are generally of the same functional category: A systematic analysis of 30 microbial genomes , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Temple F. Smith,et al.  Comparison of the complete protein sets of worm and yeast: orthology and divergence. , 1998, Science.

[21]  Christian von Mering,et al.  STRING: a database of predicted functional associations between proteins , 2003, Nucleic Acids Res..

[22]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[23]  M. Lynch,et al.  The evolutionary fate and consequences of duplicate genes. , 2000, Science.

[24]  Dr. Susumu Ohno Evolution by Gene Duplication , 1970, Springer Berlin Heidelberg.

[25]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[26]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[27]  Erich Bornberg-Bauer,et al.  BioMiner - modeling, analyzing, and visualizing biochemical pathways and networks , 2002, ECCB.

[28]  Bill Cheswick,et al.  Mapping and Visualizing the Internet , 2000, USENIX Annual Technical Conference, General Track.

[29]  Alex Bateman,et al.  The InterPro database, an integrated documentation resource for protein families, domains and functional sites , 2001, Nucleic Acids Res..

[30]  Golan Yona,et al.  Towards a Complete Map of the Protein Space Based on a Unified Sequence and Structure Analysis of All Known Proteins , 2000, ISMB.

[31]  M. R. Adams,et al.  Comparative genomics of the eukaryotes. , 2000, Science.

[32]  N Linial,et al.  Global self-organization of all known protein sequences reveals inherent biological signatures. , 1997, Journal of molecular biology.

[33]  Ioannis Xenarios,et al.  DIP: the Database of Interacting Proteins , 2000, Nucleic Acids Res..

[34]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[35]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[36]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[37]  J Schultz,et al.  SMART, a simple modular architecture research tool: identification of signaling domains. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[38]  David S. Eisenberg,et al.  Erratum. Finding families for genomic ORFans , 1999, Bioinform..

[39]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..

[40]  B. Snel,et al.  Genome evolution. Gene fusion versus gene fission. , 2000, Trends in Genetics.

[41]  Thomas Lengauer,et al.  Pathway analysis in metabolic databases via differetial metabolic display (DMD) , 2000, German Conference on Bioinformatics.

[42]  Kyungsook Han,et al.  A fast layout algorithm for protein interaction networks , 2003, Bioinform..

[43]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[44]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[45]  A. Newton,et al.  Identification of a novel response regulator required for the swarmer-to-stalked-cell transition in Caulobacter crescentus , 1995, Journal of bacteriology.

[46]  Huafeng Xu,et al.  Exploring the nonlinear geometry of protein homology , 2003, Protein science : a publication of the Protein Society.