Building a research data science platform from industrial machines

Data Science research has a long history in academia which spans from large-scale data management, to data mining and data analysis using technologies from database management systems (DBMS's). While traditional HPC offers tools on leveraging existing technologies with data processing needs, the large volume of data and the speed of data generation pose significant challenges. Using the Hadoop platform and tools built on top of it drew immense interest from academia after it gained success in industry. Georgia Institute of Technology received a donation of 200 compute nodes from Yahoo. Turning these industrial machines into a research Data Science Platform (DSP) poses unique challenges, such as: nontrivial hardware design decisions, configuration tool choices, node integration into existing HPC infrastructure, partitioning resource to meet different application needs, software stack choices, etc. We have 40 nodes up and running, 24 running as a Hadoop and Spark cluster, 12 running as a HBase and OpenTSDB cluster, the others running as service nodes. We successfully tested it against Spark Machine Learning algorithms using a 88GB image dataset, Spark DataFrame and GraphFrame with a Wikipedia dataset, and Hadoop MapReduce wordcount on a 300GB dataset. The OpenTSDB cluster is for real-time time series data ingestion and storage for sensor data. We are working on bringing up more nodes. We share our first-hand experience gained in our journey, which we believe will benefit and inspire other academic institutions.