Optimizing high performance computing workflow for protein functional annotation

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation

[1]  Michael Y. Galperin,et al.  Sequence ― Evolution ― Function: Computational Approaches in Comparative Genomics , 2010 .

[2]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[3]  P. Bork Powers and pitfalls in sequence analysis: the 70% hurdle. , 2000, Genome research.

[4]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[5]  Winston Haynes,et al.  Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins. , 2011, Omics : a journal of integrative biology.

[6]  Aaron Vose,et al.  HSPp-BLAST : Highly Scalable Parallel PSI-BLAST for Very Large-scale Sequence Searches , 2012 .

[7]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[8]  Srinath Perera,et al.  Apache airavata: a framework for distributed applications and computational workflows , 2011, GCE '11.

[9]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[10]  Michael Y. Galperin,et al.  New metrics for comparative genomics. , 2006, Current opinion in biotechnology.

[11]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[12]  Christopher S. Oehmen,et al.  ScalaBLAST 2.0: rapid and robust BLAST calculations on multiprocessor systems , 2013, Bioinform..

[13]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[14]  Robert L. Grossman,et al.  The Case for Cloud Computing , 2009, IT Professional.

[15]  Winston Haynes,et al.  SPIRE: Systematic protein investigative research environment. , 2011, Journal of proteomics.

[16]  Damian Szklarczyk,et al.  eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges , 2011, Nucleic Acids Res..

[17]  Shoshana J. Wodak,et al.  Markov clustering versus affinity propagation for the partitioning of protein interaction graphs , 2009, BMC Bioinformatics.

[18]  Yuan Liu,et al.  High performance computing workflow for protein functional annotation , 2013, XSEDE.

[19]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[20]  Rolf Apweiler,et al.  CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins , 2001, Nucleic Acids Res..

[21]  Kuo-Bin Li,et al.  ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[22]  Elizabeth Pennisi,et al.  Human genome 10th anniversary. Will computers crash genomics? , 2011, Science.

[23]  Nagiza F. Samatova,et al.  Efficient data access for parallel BLAST , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[24]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[25]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[26]  Doron Lancet,et al.  MOPED: Model Organism Protein Expression Database , 2011, Nucleic Acids Res..

[27]  Winston Haynes,et al.  Unraveling the Complexities of Life Sciences Data , 2013, Big Data.

[28]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[29]  Dmitrij Frishman,et al.  Protein annotation at genomic scale: the current status. , 2007, Chemical reviews.

[30]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[31]  Eugene Kolker,et al.  DELSA Global for “Big Data” and the Bioeconomy: Catalyzing Collective Innovation , 2012 .

[32]  Christopher M. Reardon,et al.  PoPLAR: Portal for Petascale Lifescience Applications and Research , 2013, BMC Bioinformatics.

[33]  Brian E. Smith,et al.  Massively Parallel BLAST for the Blue Gene / L , 2005 .

[34]  Michael Y. Galperin,et al.  Identification and functional analysis of ‘hypothetical’ genes expressed in Haemophilus influenzae , 2004 .

[35]  Christian von Mering,et al.  eggNOG: automated construction and annotation of orthologous groups of genes , 2007, Nucleic Acids Res..

[36]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[37]  Emilie Devries-Seguin Vaccines of the 21st Century and Vaccinomics: Data Enabled Science Meets Global Health to Spark Collective Action for Vaccine Innovation , 2011 .

[38]  Geoffrey C. Fox,et al.  Visualizing the Protein Sequence Universe , 2012, ECMLS '12.

[39]  Susan J. Brown,et al.  Creating a buzz about insect genomes. , 2011, Science.

[40]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[41]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[42]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[43]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.