HPC and the Big Data challenge

Abstract High performance computing (HPC) and Big Data are technologies vital for advancement in science, business and industry. HPC combines computing power of supercomputers and computer clusters, and parallel and distributed processing techniques for solving complex computational problems. The term Big Data refers to the fact that more data are being produced, consumed and stored than ever before. This is resulting in datasets that are too large, complex, and/or dynamic to be managed and analysed by traditional methods. Access to HPC systems and the ability to model, simulate and manipulate massive and dynamic data, is now critical for research, business and innovation. In this paper an overview of HPC and Big Data technology is presented. The paper outlines the advances in computer technology enabling Peta and Exa scale and energy efficient computing, and Big Data challenges of extracting meaning and new information from the data. As an example of HPC and Big Data synergy in risk analysis, a case study of processing close-call data is conducted using HPC resources at the University of Huddersfield. A parallel program was designed and implemented on the university's Hadoop cluster to speed up processing of unstructured free form text records pertaining to close call railway events, in order to identify potential risks and incidents. This case study demonstrates the benefits of using HPC with parallel programming techniques, and the improvements achieved compared to serial processing on a standard workstation computer system. However, it also highlights the challenges in risk analysis of Big Data that require novel approaches in HPC system and software design.