A view of programming scalable data analysis: from clouds to exascale

Scalability is a key feature for big data analysis and machine learning frameworks and for applications that need to analyze very large and real-time data available from data repositories, social media, sensor networks, smartphones, and the Web. Scalable big data analysis today can be achieved by parallel implementations that are able to exploit the computing and storage facilities of high performance computing (HPC) systems and clouds, whereas in the near future Exascale systems will be used to implement extreme-scale data analysis. Here is discussed how clouds currently support the development of scalable data mining solutions and are outlined and examined the main challenges to be addressed and solved for implementing innovative data analysis applications on Exascale systems.

[1]  Domenico Talia,et al.  P2P-MapReduce: Parallel data processing in dynamic Cloud environments , 2012, J. Comput. Syst. Sci..

[2]  Abdallah Khreishah,et al.  Program Scalability Analysis for HPC Cloud: Applying Amdahl's Law to NAS Benchmarks , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[3]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[4]  Dana Petcu,et al.  On Processing Extreme Data , 2016, Scalable Comput. Pract. Exp..

[5]  Domenico Talia,et al.  Data Analysis in the Cloud: Models, Techniques and Applications , 2015 .

[6]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[7]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[8]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[9]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Domenico Talia,et al.  Making knowledge discovery services scalable on clouds for big data mining , 2015, 2015 2nd IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services (ICSDM).

[11]  Domenico Talia,et al.  Clouds for Scalable Big Data Analytics , 2013, Computer.

[12]  Eduard Ayguadé,et al.  Task-Based Programming with OmpSs and Its Application , 2014, Euro-Par Workshops.

[13]  Ian T. Foster,et al.  Language Features for Scalable Distributed-Memory Dataflow Computing , 2014, 2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[14]  Nancy A. Lynch,et al.  The impossibility of implementing reliable communication in the face of crashes , 1993, JACM.

[15]  Andrew Davison,et al.  Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers , 1995 .

[16]  William Gropp,et al.  Programming for Exascale Computers , 2013, Computing in Science & Engineering.

[17]  Domenico Talia,et al.  JS4Cloud: script‐based workflow programming for scalable data analysis on cloud platforms , 2015, Concurr. Comput. Pract. Exp..

[18]  Jesper Larsson Träff,et al.  The EPiGRAM Project: Preparing Parallel Programming Models for Exascale , 2016, ISC Workshops.

[19]  David Cunningham,et al.  X10 and APGAS at Petascale , 2016, ACM Trans. Parallel Comput..

[20]  Laura Carrington,et al.  Tools for Benchmarking , Tracing , and Simulating SHMEM Applications , 2022 .

[21]  Vipin Kumar,et al.  Introduction to Parallel Computing , 1994 .

[22]  Andy B. Yoo,et al.  Evaluating use of data flow systems for large graph analysis , 2009, MTAGS '09.

[23]  Domenico Talia,et al.  A Cloud Framework for Big Data Analytics Workflows on Azure , 2012, High Performance Computing Workshop.

[24]  Robert L. Grossman,et al.  Sector and Sphere: the design and implementation of a high-performance data cloud , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[25]  Alexander S. Szalay,et al.  Data-Intensive Computing in the 21st Century , 2008, Computer.

[26]  Katherine A. Yelick,et al.  Tuning collective communication for Partitioned Global Address Space programming models , 2011, Parallel Comput..

[27]  Reynold Xin,et al.  Apache Spark , 2016 .

[28]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[29]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[30]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[31]  Thomas Fahringer,et al.  LibWater: heterogeneous distributed computing made easy , 2013, ICS '13.

[32]  Alfonso Niño,et al.  A Survey of Parallel Programming Models and Tools in the Multi and Many-core Era , 2022 .

[33]  Alok Choudhary,et al.  Synergistic Challenges in Data-Intensive Science and Exascale Computing: DOE ASCAC Data Subcommittee Report , 2013 .

[34]  Johan Montagnat,et al.  Scientific Workflow Development Using Both Visual and Script-Based Representation , 2010, 2010 6th World Congress on Services.

[35]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[36]  David H. Bailey,et al.  Twelve ways to fool the masses when giving performance results on parallel computers , 1991 .