论文信息 - High performance computing workflow for protein functional annotation

High performance computing workflow for protein functional annotation

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the PSU (Protein Sequence Universe) expands exponentially. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible whereas a high compute cost limits the utility of existing automated approaches. In this study, we built an automated workflow to enable large-scale protein annotation into existing orthologous groups using HPC (High Performance Computing) architectures. We developed a low complexity classification algorithm to assign proteins into bacterial COGs (Clusters of Orthologous Groups of proteins). Based on the PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool), the algorithm was validated on simulated and archaeal data to ensure at least 80% specificity and sensitivity. The workflow with highly scalable parallel applications for classification and sequence alignment was developed on XSEDE (Extreme Science and Engineering Discovery Environment) supercomputers. Using the workflow, we have classified one million newly sequenced bacterial proteins. With the rapid expansion of the PSU, the proposed workflow will enable scientists to annotate big genome data.

[1] Nathan Linial,et al. ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[2] D. Lipman,et al. A genomic perspective on protein families. , 1997, Science.

[3] Michael C. Schatz,et al. Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[4] Rolf Apweiler,et al. CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins , 2001, Nucleic Acids Res..

[5] K. Bretonnel Cohen,et al. Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[6] Elizabeth Pennisi,et al. Human genome 10th anniversary. Will computers crash genomics? , 2011, Science.

[7] David L. Wheeler,et al. GenBank , 2015, Nucleic Acids Res..

[8] Doron Lancet,et al. MOPED: Model Organism Protein Expression Database , 2011, Nucleic Acids Res..

[9] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[10] Nagiza F. Samatova,et al. Efficient data access for parallel BLAST , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[11] Michael Y. Galperin,et al. New metrics for comparative genomics. , 2006, Current opinion in biotechnology.

[12] P. Bork. Powers and pitfalls in sequence analysis: the 70% hurdle. , 2000, Genome research.

[13] Kuo-Bin Li,et al. ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[14] Srinath Perera,et al. Apache airavata: a framework for distributed applications and computational workflows , 2011, GCE '11.

[15] Geoffrey C. Fox,et al. Visualizing the Protein Sequence Universe , 2012, ECMLS '12.

[16] Susan J. Brown,et al. Creating a buzz about insect genomes. , 2011, Science.

[17] Bjarne Stroustrup,et al. C++ Programming Language , 1986, IEEE Softw..

[18] Winston Haynes,et al. SPIRE: Systematic protein investigative research environment. , 2011, Journal of proteomics.

[19] Anton J. Enright,et al. An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[20] Damian Szklarczyk,et al. eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges , 2011, Nucleic Acids Res..

[21] Adam Godzik,et al. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[22] Dmitrij Frishman,et al. Protein annotation at genomic scale: the current status. , 2007, Chemical reviews.