Reengineering High-throughput Molecular Datasets for Scalable Clustering Using MapReduce

We propose a linear clustering approach for large datasets of molecular geometries produced by high-throughput molecular dynamics simulations (e.g., protein folding and protein-ligand docking simulations). To this scope, we transform each three-dimensional (3D) molecular conformation into a single point in the 3D space reducing the space complexity while still encoding the molecular similarities and geometries. We assign an identifier to each single 3D point mapping a docked ligand, generate a tree from the whole space, and apply a tree-based clustering on the reduced conformation space that identifies most dense hyperspaces. We adapt our method for MapReduce and implement it in Hadoop. The load-balancing, fault-tolerance, and scalability in MapReduce allows screening of very large conformation datasets not approachable with traditional clustering methods. We analyze results for datasets with different concentrations of optimal solutions, and draw conclusions about the limitations and usability of our method. The advantages of this approach make it attractive for complex applications in real-world high-throughput molecular simulations.

[1]  Trilce Estrada,et al.  A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach , 2012, Comput. Biol. Medicine.

[2]  Yang Zhang,et al.  Identification of near‐native structures by clustering protein docking conformations , 2007, Proteins.

[3]  John L. Klepeis,et al.  A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Trilce Estrada,et al.  Automatic selection of near-native protein-ligand conformations using a hierarchical clustering and volunteer computing , 2010, BCB '10.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Charles L. Brooks,et al.  New analytic approximation to the standard molecular volume definition and its application to generalized Born calculations , 2003, J. Comput. Chem..

[7]  Guillaume Bouvier,et al.  Automatic clustering of docking poses in virtual screening process using self-organizing map , 2010, Bioinform..

[8]  David S. Goodsell,et al.  Empirical entropic contributions in computational docking: Evaluation in APS reductase complexes , 2008, J. Comput. Chem..

[9]  David R. O'Hallaron,et al.  Materialized community ground models for large-scale earthquake simulation , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Hari Sundar,et al.  Parallel Fast Gauss Transform , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Lexing Ying,et al.  A massively parallel adaptive fast-multipole method on heterogeneous architectures , 2009, SC.

[12]  Anna Vulpetti,et al.  Predicting Polypharmacology by Binding Site Similarity: From Kinases to the Protein Universe , 2010, J. Chem. Inf. Model..