High performance computing workflow for protein functional annotation

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the PSU (Protein Sequence Universe) expands exponentially. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible whereas a high compute cost limits the utility of existing automated approaches. In this study, we built an automated workflow to enable large-scale protein annotation into existing orthologous groups using HPC (High Performance Computing) architectures. We developed a low complexity classification algorithm to assign proteins into bacterial COGs (Clusters of Orthologous Groups of proteins). Based on the PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool), the algorithm was validated on simulated and archaeal data to ensure at least 80% specificity and sensitivity. The workflow with highly scalable parallel applications for classification and sequence alignment was developed on XSEDE (Extreme Science and Engineering Discovery Environment) supercomputers. Using the workflow, we have classified one million newly sequenced bacterial proteins. With the rapid expansion of the PSU, the proposed workflow will enable scientists to annotate big genome data.

[1]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[2]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[3]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[4]  Rolf Apweiler,et al.  CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins , 2001, Nucleic Acids Res..

[5]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[6]  Elizabeth Pennisi,et al.  Human genome 10th anniversary. Will computers crash genomics? , 2011, Science.

[7]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[8]  Doron Lancet,et al.  MOPED: Model Organism Protein Expression Database , 2011, Nucleic Acids Res..

[9]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[10]  Nagiza F. Samatova,et al.  Efficient data access for parallel BLAST , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[11]  Michael Y. Galperin,et al.  New metrics for comparative genomics. , 2006, Current opinion in biotechnology.

[12]  P. Bork Powers and pitfalls in sequence analysis: the 70% hurdle. , 2000, Genome research.

[13]  Kuo-Bin Li,et al.  ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[14]  Srinath Perera,et al.  Apache airavata: a framework for distributed applications and computational workflows , 2011, GCE '11.

[15]  Geoffrey C. Fox,et al.  Visualizing the Protein Sequence Universe , 2012, ECMLS '12.

[16]  Susan J. Brown,et al.  Creating a buzz about insect genomes. , 2011, Science.

[17]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[18]  Winston Haynes,et al.  SPIRE: Systematic protein investigative research environment. , 2011, Journal of proteomics.

[19]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[20]  Damian Szklarczyk,et al.  eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges , 2011, Nucleic Acids Res..

[21]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[22]  Dmitrij Frishman,et al.  Protein annotation at genomic scale: the current status. , 2007, Chemical reviews.

[23]  Emilie Devries-Seguin Vaccines of the 21st Century and Vaccinomics: Data Enabled Science Meets Global Health to Spark Collective Action for Vaccine Innovation , 2011 .

[24]  Michael Y. Galperin,et al.  Identification and functional analysis of ‘hypothetical’ genes expressed in Haemophilus influenzae , 2004 .

[25]  Natalya Yutin,et al.  Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer , 2012, Biology Direct.

[26]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[27]  Winston Haynes,et al.  Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins. , 2011, Omics : a journal of integrative biology.

[28]  Aaron Vose,et al.  HSPp-BLAST : Highly Scalable Parallel PSI-BLAST for Very Large-scale Sequence Searches , 2012 .

[29]  Eugene Kolker,et al.  DELSA Global for “Big Data” and the Bioeconomy: Catalyzing Collective Innovation , 2012 .

[30]  Christopher M. Reardon,et al.  PoPLAR: Portal for Petascale Lifescience Applications and Research , 2013, BMC Bioinformatics.

[31]  Michael Y. Galperin,et al.  Sequence ― Evolution ― Function: Computational Approaches in Comparative Genomics , 2010 .

[32]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[33]  Brian E. Smith,et al.  Massively Parallel BLAST for the Blue Gene / L , 2005 .

[34]  Shoshana J. Wodak,et al.  Markov clustering versus affinity propagation for the partitioning of protein interaction graphs , 2009, BMC Bioinformatics.

[35]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[36]  Christian von Mering,et al.  eggNOG: automated construction and annotation of orthologous groups of genes , 2007, Nucleic Acids Res..

[37]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[38]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[39]  Christopher S. Oehmen,et al.  ScalaBLAST 2.0: rapid and robust BLAST calculations on multiprocessor systems , 2013, Bioinform..

[40]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[41]  Winston Haynes,et al.  Unraveling the Complexities of Life Sciences Data , 2013, Big Data.

[42]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..