Visualizing the Protein Sequence Universe

Modern biology is experiencing a rapid increase in data volumes that challenges our analytical skills and existing cyberinfrastructure. Exponential expansion of the protein sequence universe (PSU), the protein sequence space, together with the costs and complexities of manual curation creates a major bottleneck in life sciences research. Existing resources lack scalable visualization tools that are instrumental for functional annotation. Here, we describe a new visualization tool using multidimensional scaling to create a 3D embedding of the protein space. The advantages of the proposed PSU method include the ability to scale to large numbers of sequences, integrate different similarity measures with other functional and experimental data, and facilitate protein annotation. We applied the method to visualize the prokaryotic PSU using sequence alignment scores. As an annotation example, we used the interpolation approach to map the set of annotated archaeal proteins into the prokaryotic PSU. Transdisciplinary approaches akin to the one described in this paper are urgently needed to quickly and efficiently translate the influx of new data into tangible innovations and groundbreaking discoveries. Copyright © 2013 John Wiley & Sons, Ltd.

[1]  Tatiana A. Tatusova,et al.  The National Center for Biotechnology Information's Protein Clusters Database , 2008, Nucleic Acids Res..

[2]  Geoffrey C. Fox,et al.  Dimension reduction and visualization of large high-dimensional data via interpolation , 2010, HPDC '10.

[3]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[4]  Edward M Marcotte,et al.  LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. , 2004, Journal of molecular biology.

[5]  Geoffrey C. Fox,et al.  High Performance Dimension Reduction and Visualization for Large High-Dimensional Data Analysis , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[6]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[7]  Susan J. Brown,et al.  Creating a buzz about insect genomes. , 2011, Science.

[8]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[9]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[10]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[11]  Rolf Apweiler,et al.  CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins , 2001, Nucleic Acids Res..

[12]  Doron Lancet,et al.  MOPED: Model Organism Protein Expression Database , 2011, Nucleic Acids Res..

[13]  Anton J. Enright,et al.  BioLayout-an automatic graph layout algorithm for similarity visualization , 2001, Bioinform..

[14]  Michael Y. Galperin,et al.  Identification and functional analysis of ‘hypothetical’ genes expressed in Haemophilus influenzae , 2004 .

[15]  Natalya Yutin,et al.  Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer , 2012, Biology Direct.

[16]  L. Proctor,et al.  The Human Microbiome Project in 2011 and beyond. , 2011, Cell host & microbe.

[17]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[18]  Geoffrey C. Fox,et al.  Performance of Windows Multicore Systems on Threading and MPI , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[19]  J. Leeuw Applications of Convex Analysis to Multidimensional Scaling , 2000 .

[20]  Winston Haynes,et al.  Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins. , 2011, Omics : a journal of integrative biology.

[21]  Geoffrey C. Fox,et al.  Parallel Data Mining from Multicore to Cloudy Grids , 2008, High Performance Computing Workshop.

[22]  Vural Ozdemir,et al.  Asia-Pacific Health 2020 and Genomics without Borders: Co-Production of Knowledge by Science and Society Partnership for Global Personalized Medicine. , 2011, Current pharmacogenomics and personalized medicine.

[23]  Elizabeth Pennisi,et al.  Human genome 10th anniversary. Will computers crash genomics? , 2011, Science.

[24]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[25]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[26]  Joachim M. Buhmann,et al.  Data visualization by multidimensional scaling: a deterministic annealing approach , 1996, Pattern Recognit..

[27]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[28]  Michael Y. Galperin,et al.  Sequence ― Evolution ― Function: Computational Approaches in Comparative Genomics , 2010 .

[29]  Michael Y. Galperin,et al.  New metrics for comparative genomics. , 2006, Current opinion in biotechnology.

[30]  Amos Bairoch,et al.  PROSITE, a protein domain database for functional characterization and annotation , 2009, Nucleic Acids Res..

[31]  Geoffrey C. Fox,et al.  Biomedical Case Studies in Data Intensive Computing , 2009, CloudCom.

[32]  J. Leeuw Convergence of the majorization method for multidimensional scaling , 1988 .

[33]  Dmitrij Frishman,et al.  Protein annotation at genomic scale: the current status. , 2007, Chemical reviews.

[34]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[35]  Christian von Mering,et al.  eggNOG: automated construction and annotation of orthologous groups of genes , 2007, Nucleic Acids Res..

[36]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[37]  J. van Helden,et al.  Interactive visualization and exploration of relationships between biological objects. , 2000, Trends in biotechnology.

[38]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[39]  Robert L. Grossman,et al.  The Case for Cloud Computing , 2009, IT Professional.

[40]  E. Kolker,et al.  A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions , 2009, PloS one.

[41]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[42]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[43]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[44]  Yang Zhang,et al.  I-TASSER: a unified platform for automated protein structure and function prediction , 2010, Nature Protocols.

[45]  Eugene Kolker,et al.  Opportunities and challenges for the life sciences community. , 2012, Omics : a journal of integrative biology.

[46]  P. Bork Powers and pitfalls in sequence analysis: the 70% hurdle. , 2000, Genome research.