Developing eThread Pipeline Using SAGA-Pilot Abstraction for Large-Scale Structural Bioinformatics

While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread—a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure.

[1]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[2]  Shantenu Jha,et al.  P∗: A model of pilot-abstractions , 2012, 2012 IEEE 8th International Conference on E-Science.

[3]  Tak-Lon Wu,et al.  Cloud computing paradigms for pleasingly parallel biomedical applications , 2011, Concurr. Comput. Pract. Exp..

[4]  Shantenu Jha,et al.  Distributed Application Runtime Environment (DARE): A Standards-based Middleware Framework for Science-Gateways , 2012, Journal of Grid Computing.

[5]  T. Furey ChIP – seq and beyond : new and improved methodologies to detect and characterize protein – DNA interactions , 2012 .

[6]  Rui Chen,et al.  Promise of personalized omics to precision medicine , 2013, Wiley interdisciplinary reviews. Systems biology and medicine.

[7]  Daniel S. Katz,et al.  Understanding Scientific Applications for Cloud Environments , 2011, CloudCom 2011.

[8]  M. Brylinski,et al.  eThread: A Highly Optimized Machine Learning-Based Approach to Meta-Threading and the Modeling of Protein Tertiary Structures , 2012, PloS one.

[9]  Mona Singh,et al.  Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure , 2009, PLoS Comput. Biol..

[10]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[11]  Ben Langmead,et al.  Genotyping in the Cloud with Crossbow , 2012, Current protocols in bioinformatics.

[12]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[13]  Michal Brylinski,et al.  Setting up a Meta-Threading Pipeline for High-Throughput Structural Bioinformatics: eThread Software Distribution, Walkthrough and Resource Profiling , 2013 .

[14]  Jacquelyn S. Fetrow,et al.  Structural genomics and its importance for gene function analysis , 2000, Nature Biotechnology.

[15]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[16]  Monya Baker,et al.  Next-generation sequencing: adjusting to data overload , 2010, Nature Methods.

[17]  Shantenu Jha,et al.  Understanding mapreduce-based next-generation sequencing alignment on distributed cyberinfrastructure , 2012, ECMLS '12.

[18]  N. Ben-Tal,et al.  The ConSurf‐HSSP database: The mapping of evolutionary conservation among homologs onto PDB structures , 2004, Proteins.

[19]  Qingming Luo,et al.  Mass spectrometry in systems biology: an overview. , 2008, Mass spectrometry reviews.

[20]  John D McPherson,et al.  Next-generation gap , 2009, Nature Methods.

[21]  Shantenu Jha,et al.  Advancing next‐generation sequencing data analytics with scalable distributed infrastructure , 2014, Concurr. Comput. Pract. Exp..

[22]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[23]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[24]  Shantenu Jha,et al.  Pilot-Data: An abstraction for distributed data , 2013, J. Parallel Distributed Comput..

[25]  C. Orengo,et al.  Protein function annotation by homology-based inference , 2009, Genome Biology.

[26]  Shantenu Jha,et al.  SAGA BigJob: An Extensible and Interoperable Pilot-Job Abstraction for Distributed Applications and Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[27]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[28]  Rui Chen,et al.  Systems biology: personalized medicine for the future? , 2012, Current opinion in pharmacology.

[29]  Michal Brylinski,et al.  Unleashing the power of meta-threading for evolution/structure-based function inference of proteins , 2013, Front. Genet..

[30]  M. Tress,et al.  Sequence-based feature prediction and annotation of proteins , 2009, Genome Biology.

[31]  Geoffrey C. Fox,et al.  Cloud computing paradigms for pleasingly parallel biomedical applications , 2010, HPDC '10.

[32]  Erez Lieberman Aiden,et al.  The expanding scope of DNA sequencing , 2012, Nature Biotechnology.

[33]  Michal Brylinski,et al.  FINDSITE: a combined evolution/structure-based approach to protein function prediction , 2009, Briefings Bioinform..