Large-scale virtual screening on public cloud resources with Apache Spark

AbstractBackground Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google’s MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark.ResultsWe developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach. We benchmarked our method, which is implemented in Apache Spark, docking a publicly available target receptor against $$\sim $$∼2.2 M compounds. The performance experiments show a good parallel efficiency (87%) when running in a public cloud environment.ConclusionOur method enables parallel Structure-based virtual screening on public cloud resources or commodity computer clusters. The degree of scalability that we achieve allows for trying out our method on relatively small libraries first and then to scale to larger libraries. Our implementation is named Spark-VS and it is freely available as open source from GitHub (https://github.com/mcapuccini/spark-vs).Graphical abstract.

[1]  Arthur Dalby,et al.  Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited , 1992, J. Chem. Inf. Comput. Sci..

[2]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[3]  Ola Spjuth,et al.  Using Iterative MapReduce for Parallel Virtual Screening , 2013, 2013 IEEE 5th International Conference on Cloud Computing Technology and Science.

[4]  Jing Zhao,et al.  Hadoop MapReduce Framework to Implement Molecular Docking of Large-Scale Virtual Screening , 2012, 2012 IEEE Asia-Pacific Services Computing Conference.

[5]  Jean-Pierre A. Kocher,et al.  Multilevel Parallelization of AutoDock 4.2 , 2011, J. Cheminformatics.

[6]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[7]  Yanli Wang,et al.  Structure-Based Virtual Screening for Drug Discovery: a Problem-Centric Review , 2012, The AAPS Journal.

[8]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[9]  Sally R. Ellingson,et al.  High-throughput virtual molecular docking: Hadoop implementation of AutoDock4 on a private cloud , 2011, ECMLS '11.

[10]  Markus H. J. Seifert,et al.  Essential factors for successful virtual screening. , 2008, Mini reviews in medicinal chemistry.

[11]  K Osterlund,et al.  Unexpected binding mode of a cyclic sulfamide HIV-1 protease inhibitor. , 1997, Journal of medicinal chemistry.

[12]  George Papadatos,et al.  SureChEMBL: a large-scale, chemically annotated patent document database , 2015, Nucleic Acids Res..

[13]  Sandra Fox,et al.  High-Throughput Screening: Update on Practices and Success , 2006, Journal of biomolecular screening.

[14]  Judy Qiu,et al.  Proceedings of the second international workshop on Emerging computational methods for the life sciences , 2011, HPDC 2011.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[17]  Maria A Miteva,et al.  Structure-based virtual ligand screening: recent success stories. , 2009, Combinatorial chemistry & high throughput screening.

[18]  S. Rees,et al.  Principles of early drug discovery , 2011, British journal of pharmacology.

[19]  Chen Wang,et al.  Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..