Visualizing the Protein Sequence Universe

Modern biology is experiencing a rapid increase in data volumes that challenges our analytical skills and existing cyberinfrastructure. Exponential expansion of the Protein Sequence Universe (PSU), the protein sequence space, together with the costs and complexities of manual curation creates a major bottleneck in life sciences research. Existing resources lack scalable visualization tools that are instrumental for functional annotation. Here, we describe a multi-dimensional scaling (MDS) implementation to create a 3D embedding of the PSU that allows visualizing the relationships between large numbers of proteins. To demonstrate the method, we use sequence similarity scores as a measure of proximity. An example of the prokaryotic PSU shows that the low-dimensional representation preserves important grouping features such as relative proximity of functionally similar clusters and clear structural separation between clusters with specific and general functions. The advantages of the method and its implementation include the ability to scale to large numbers of sequences, integrate different similarity measures with other functional and experimental data, and facilitate protein annotation. Transdisciplinary approaches akin to the one described in this paper are urgently needed to quickly and efficiently translate the influx of new data into tangible innovations and groundbreaking discoveries.

[1]  Kenneth Levenberg A METHOD FOR THE SOLUTION OF CERTAIN NON – LINEAR PROBLEMS IN LEAST SQUARES , 1944 .

[2]  J. Leeuw Applications of Convex Analysis to Multidimensional Scaling , 2000 .

[3]  Yann Joly,et al.  Towards an Ecology of Collective Innovation: Human Variome Project (HVP), Rare Disease Consortium for Autosomal Loci (RaDiCAL) and Data-Enabled Life Sciences Alliance (DELSA). , 2011, Current pharmacogenomics and personalized medicine.

[4]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[5]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[6]  Geoffrey C. Fox,et al.  Parallel Data Mining from Multicore to Cloudy Grids , 2008, High Performance Computing Workshop.

[7]  Yang Zhang,et al.  I-TASSER: a unified platform for automated protein structure and function prediction , 2010, Nature Protocols.

[8]  Elizabeth Pennisi Human genome 10th anniversary. Using DNA to reveal a mosquito's history. , 2011, Science.

[9]  Eugene Kolker,et al.  Opportunities and challenges for the life sciences community. , 2012, Omics : a journal of integrative biology.

[10]  Michael Y. Galperin,et al.  In Silico Metabolic Model and Protein Expression of Haemophilus influenzae Strain Rd KW20 in Rich Medium. , 2004, Omics : a journal of integrative biology.

[11]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[12]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[13]  Anthony J. Kearsley,et al.  The Solution of the Metric STRESS and SSTRESS Problems in Multidimensional Scaling Using Newton's Method , 1995 .

[14]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[15]  Geoffrey C. Fox,et al.  Applying Twister to Scientific Applications , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[16]  Susan J. Brown,et al.  Creating a buzz about insect genomes. , 2011, Science.

[17]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[18]  Joachim M. Buhmann,et al.  Data visualization by multidimensional scaling: a deterministic annealing approach , 1996, Pattern Recognit..

[19]  Michael Y. Galperin,et al.  Sequence ― Evolution ― Function: Computational Approaches in Comparative Genomics , 2010 .

[20]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[21]  Eugene Kolker,et al.  Special issue on data-intensive science. , 2011, Omics : a journal of integrative biology.

[22]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[23]  Christian von Mering,et al.  eggNOG: automated construction and annotation of orthologous groups of genes , 2007, Nucleic Acids Res..

[24]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[25]  Eugene Kolker,et al.  Quantifying Protein Function Specificity in the Gene Ontology , 2010, Standards in genomic sciences.

[26]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[27]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[28]  Geoffrey C. Fox,et al.  Dimension reduction and visualization of large high-dimensional data via interpolation , 2010, HPDC '10.

[29]  Edward M Marcotte,et al.  LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. , 2004, Journal of molecular biology.

[30]  Shoshana J. Wodak,et al.  Markov clustering versus affinity propagation for the partitioning of protein interaction graphs , 2009, BMC Bioinformatics.

[31]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[32]  J. van Helden,et al.  Interactive visualization and exploration of relationships between biological objects. , 2000, Trends in biotechnology.

[33]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[34]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[35]  Alexey I Nesvizhskii,et al.  Initial Proteome Analysis of Model Microorganism Haemophilus influenzae Strain Rd KW20 , 2003, Journal of bacteriology.

[36]  Anushya Muruganujan,et al.  PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification , 2003, Nucleic Acids Res..

[37]  Winston Haynes,et al.  Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins. , 2011, Omics : a journal of integrative biology.

[38]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[39]  Doron Lancet,et al.  MOPED: Model Organism Protein Expression Database , 2011, Nucleic Acids Res..

[40]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[41]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[42]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[43]  Eugene Kolker,et al.  Modeling sequence and function similarity between proteins for protein functional annotation , 2010, HPDC '10.

[44]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[45]  E. Kolker,et al.  A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions , 2009, PloS one.

[46]  P. Bork Powers and pitfalls in sequence analysis: the 70% hurdle. , 2000, Genome research.

[47]  Gordon A Anderson,et al.  Global profiling of Shewanella oneidensis MR-1: expression of hypothetical genes and improved functional annotations. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[49]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[50]  Ben M. Webb,et al.  Comparative Protein Structure Modeling Using Modeller , 2006, Current protocols in bioinformatics.

[51]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[52]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[53]  Michael Y. Galperin,et al.  New metrics for comparative genomics. , 2006, Current opinion in biotechnology.

[54]  Amos Bairoch,et al.  PROSITE, a protein domain database for functional characterization and annotation , 2009, Nucleic Acids Res..

[55]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[56]  Lincoln Stein,et al.  Reactome knowledgebase of human biological pathways and processes , 2008, Nucleic Acids Res..

[57]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[58]  Geoffrey C. Fox,et al.  Performance of Windows Multicore Systems on Threading and MPI , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[59]  Dmitrij Frishman Protein Annotation at Genomic Scale: The Current Status , 2007 .

[60]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[61]  J. Leeuw Convergence of the majorization method for multidimensional scaling , 1988 .

[62]  Folker Meyer,et al.  The United States of America and Scientific Research , 2010, PloS one.

[63]  Rolf Apweiler,et al.  CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins , 2001, Nucleic Acids Res..

[64]  Elizabeth Pennisi,et al.  Human genome 10th anniversary. Will computers crash genomics? , 2011, Science.

[65]  Geoffrey C. Fox,et al.  Biomedical Case Studies in Data Intensive Computing , 2009, CloudCom.

[66]  Salim Hariri,et al.  Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, Chicago, Illinois, USA, June 21-25, 2010 , 2010, HPDC.

[67]  Tatiana A. Tatusova,et al.  The National Center for Biotechnology Information's Protein Clusters Database , 2008, Nucleic Acids Res..

[68]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[69]  Geoffrey C. Fox,et al.  High Performance Dimension Reduction and Visualization for Large High-Dimensional Data Analysis , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[70]  Anton J. Enright,et al.  BioLayout-an automatic graph layout algorithm for similarity visualization , 2001, Bioinform..

[71]  Michael Y. Galperin,et al.  Identification and functional analysis of ‘hypothetical’ genes expressed in Haemophilus influenzae , 2004 .

[72]  Natalya Yutin,et al.  Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer , 2012, Biology Direct.

[73]  Robert Petryszak,et al.  Building a biological space based on protein sequence similarities and biological ontologies. , 2008, Combinatorial chemistry & high throughput screening.

[74]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[75]  Narmada Thanki,et al.  CDD: a Conserved Domain Database for the functional annotation of proteins , 2010, Nucleic Acids Res..

[76]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[77]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.

[78]  L. Proctor,et al.  The Human Microbiome Project in 2011 and beyond. , 2011, Cell host & microbe.