Data Analysis for Massively Distributed Simulations

More computing power allows increases in the fidelity of simulations. Fast networking allows large clusters of high performance computing resources, often distributed across wide geographic areas, to be brought to bear on the simulations. This increase in fidelity has correspondingly increased the volumes of data simulations are capable of generating. Coordinating distant computing resources and making sense of this mass of data is a problem that must be addressed. Unless data are analyzed and converted into information, simulations will provide no useful knowledge. This paper reports on experiments using distributed analysis, particularly the Apache Hadoop framework, to address the analysis issues and suggests directions for enhancing the analysis capabilities to keep pace with the data generating capabilities found in modern simulation environments. Hadoop provides a scalable, but conceptually simple, distributed computation paradigm based on map/reduce operations implemented over a highly parallel, distributed filesystem. We developed map/reduce implementations of K-Means and ExpectationMaximization data mining algorithms that take advantage of the Hadoop framework. The Hadoop filesystem dramatically improves the disk scan time needed by these iterative data mining algorithms. We ran these algorithms across multiple Linux clusters over specially reserved high speed networks. The results of these experiments point to potential enhancements for Hadoop and other analysis tools.