Selection of computational environments for PSP processing on scientific gateways☆

Science Gateways have been widely accepted as an important tool in academic research, due to their flexibility, simple use and extension. However, such systems may yield performance traps that delay work progress and cause waste of resources or generation of poor scientific results. This paper addresses an investigation on some of the failures in a Galaxy system and analyses of their impacts. The use case is based on protein structure prediction experiments performed. A novel science gateway component is proposed towards the definition of the relation between general parameters and capacity of machines. The machine-learning strategies used appoint the best machine setup in a heterogeneous environment and the results show a complete overview of Galaxy, a diverse platform organization, and the workload behavior. A Support Vector Regression (SVR) model generated and based on a historic data-set provided an excellent learning module and proved a varied platform configuration is valuable as infrastructure in a science gateway. The results revealed the advantages of investing in local cluster infrastructures as a base for scientific experiments.

[1]  Eduard Ayguadé,et al.  MASA: A Multiplatform Architecture for Sequence Aligners with Block Pruning , 2016, ACM Trans. Parallel Comput..

[2]  Yang Zhang,et al.  I-TASSER server: new development for protein structure and function predictions , 2015, Nucleic Acids Res..

[3]  Péter Kacsuk,et al.  Building Science Gateways by Utilizing the Generic WS-Pgrade/gUSE Workflow System , 2013, Comput. Sci..

[4]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[5]  Oswaldo Trelles,et al.  Out of Core Computation of HSPs for Large Biological Sequences , 2013, IWANN.

[6]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[7]  Ming Ouhyoung,et al.  A web-based protein retrieval system by matching visual similarity , 2005, Conference, Emerging Information Technology 2005..

[8]  Azzedine Boukerche,et al.  Parallel Optimal Pairwise Biological Sequence Comparison , 2016, ACM Comput. Surv..

[9]  Miriam A. M. Capretz,et al.  Data Providing Web Service-based integration framework for use in a health care context , 2011, 2011 24th Canadian Conference on Electrical and Computer Engineering(CCECE).

[10]  Charu C. Aggarwal,et al.  Data Mining: The Textbook , 2015 .

[11]  P. T. V. Lakshmi,et al.  A fuzzy inference system for predicting allergenicity and allergic cross-reactivity in proteins , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[12]  Xavier Martorell,et al.  CUDAlign 4.0: Incremental Speculative Traceback for Exact Chromosome-Wide Alignment in GPU Clusters , 2016, IEEE Transactions on Parallel and Distributed Systems.

[13]  Célia Ghedini Ralha,et al.  An agent-based solution for dynamic multi-node wavefront balancing in biological sequence comparison , 2014, Expert Syst. Appl..

[14]  Dariusz Mrozek,et al.  Scaling Ab Initio Predictions of 3D Protein Structures in Microsoft Azure Cloud , 2015, Journal of Grid Computing.

[15]  Sandra Gesing,et al.  From the desktop to the grid: scalable bioinformatics via workflow conversion , 2016, BMC Bioinformatics.

[16]  Mike P. Papazoglou,et al.  Service-oriented computing: concepts, characteristics and directions , 2003, Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003. WISE 2003..

[17]  Lukas Zimmermann,et al.  Maintaining a Science Gateway - Lessons Learned from MoSGrid , 2017, HICSS.

[18]  Nino Antulov-Fantulin,et al.  Parallel Protein Docking Tool , 2010, The 33rd International Convention MIPRO.

[19]  Gabor T Marth,et al.  bam.iobio: a web-based, real-time, sequence alignment file inspector , 2014, Nature Methods.

[20]  Elias S. Manolakos,et al.  Accelerating All-to-All Protein Structures Comparison with TMalign Using a NoC Many-Cores Processor Architecture , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[21]  John Chilton,et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update , 2016, Nucleic Acids Res..

[22]  Héctor Corrada Bravo,et al.  Epiviz: interactive visual analytics for functional genomics data , 2014, Nature Methods.

[23]  Xin Yan,et al.  Linear Regression Analysis: Theory and Computing , 2009 .

[24]  Douglas Thain,et al.  Folding proteins at 500 ns/hour with Work Queue , 2012, 2012 IEEE 8th International Conference on E-Science.

[25]  Fan Ying,et al.  Optimal Scheduling Simulation of Software for Multi-tenant in Cloud Computing Environment , 2014, 2014 Fifth International Conference on Intelligent Systems Design and Engineering Applications.

[26]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[27]  Yoshihide Hayashizaki,et al.  Interactive visualization and analysis of large-scale sequencing datasets using ZENBU , 2014, Nature Biotechnology.

[28]  Holger Stitz,et al.  CloudGazer: A divide-and-conquer approach to monitoring and optimizing cloud-based networks , 2015, 2015 IEEE Pacific Visualization Symposium (PacificVis).

[30]  Peter M. Kasson,et al.  Copernicus: A new paradigm for parallel adaptive molecular dynamics , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[31]  Debotosh Bhattacharjee,et al.  A software tool for extraction of annotation data from a PDB file , 2012, 2012 3rd National Conference on Emerging Trends and Applications in Computer Science.

[32]  Péter Kacsuk,et al.  Using a private desktop grid system for accelerating drug discovery , 2011, Future Gener. Comput. Syst..

[33]  Azzedine Boukerche,et al.  Multiple biological sequence alignment in heterogeneous multicore clusters with user-selectable task allocation policies , 2012, The Journal of Supercomputing.

[34]  Alexandre C. B. Delbem,et al.  Multiobjective evolutionary algorithm with many tables for purely ab initio protein structure prediction , 2013, J. Comput. Chem..

[35]  Richard Grunzke,et al.  Using Science Gateways for Bridging the Differences between Research Infrastructures , 2016, Journal of Grid Computing.

[36]  Alexandre C. B. Delbem,et al.  Multi-objective evolutionary algorithm for variable selection in calibration problems: A case study for protein concentration prediction , 2013, 2013 IEEE Congress on Evolutionary Computation.

[37]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[38]  Kamal Taha,et al.  iPFPi: A System for Improving Protein Function Prediction through Cumulative Iterations , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[39]  Sung-Ryul Kim,et al.  Protein fold prediction using cluster merging , 2011, 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT).

[40]  Dragutin Petkovic,et al.  Microenvironment-Based Protein Function Analysis by Random Forest , 2014, 2014 22nd International Conference on Pattern Recognition.

[41]  Madhu Chetty,et al.  A Guided Genetic Algorithm for Protein Folding Prediction Using 3D Hydrophobic-Hydrophilic Model , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[42]  Bo Song,et al.  Web Services Integration on Data Mining Based on SOA , 2010, 2010 International Symposium on Intelligence Information Processing and Trusted Computing.

[43]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[44]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[45]  Borja Sotomayor,et al.  Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses , 2014, J. Biomed. Informatics.

[46]  Md. Rafiqul Islam,et al.  Protein structure prediction using chemical reaction optimization , 2016, 2016 19th International Conference on Computer and Information Technology (ICCIT).

[47]  Hongwei Wang,et al.  Knowledge-Based Resource Allocation for Collaborative Simulation Development in a Multi-Tenant Cloud Computing Environment , 2018, IEEE Transactions on Services Computing.

[48]  Péter Kacsuk,et al.  A Meta-Brokering Framework for Science Gateways , 2016, Journal of Grid Computing.

[49]  J. Tao,et al.  A broker-based framework for multi-cloud workflows , 2013, MultiCloud '13.

[50]  R. Saritha,et al.  Computational prediction of continuous B-cell epitopes using random forest classifier , 2013, 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[51]  Péter Kacsuk,et al.  P‐GRADE portal family for grid infrastructures , 2011, Concurr. Comput. Pract. Exp..

[52]  Bing-Yu Chen,et al.  A web-based three-dimensional protein retrieval system by matching visual similarity , 2005, Bioinform..

[53]  Kamal Kishore,et al.  Integrated Systems for NGS Data Management and Analysis: Open Issues and Available Solutions , 2016, Front. Genet..

[54]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[55]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[56]  E. Alm,et al.  Prediction of protein-folding mechanisms from free-energy landscapes derived from native structures. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Obi L. Griffith,et al.  Genome Modeling System: A Knowledge Management Platform for Genomics , 2015, PLoS Comput. Biol..

[58]  Nancy M. Amato,et al.  Parallel protein folding with STAPL , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[59]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..