Data intensive analysis on the gordon high performance data and compute system

The Gordon data intensive computing system was designed to handle problems with large memory requirements that cannot easily be solved using standard workstations or distributed memory supercomputers. We describe the unique features of Gordon that make it ideally suited for data mining and knowledge discovery applications: memory aggregation using the vSMP software solution from ScaleMP, I/O nodes containing 4 TB of low-latency flash memory, and a high performance parallel file system with 4 PB capacity. We also demonstrate how a number of standard data mining tools (e.g. Matlab, WEKA, R) can be used effectively on Dash, an early prototype of the full Gordon system.