On Processing Extreme Data

Extreme Data is an incarnation of Big Data concept distinguished by the massive amounts of data that must be queried, communicated and analyzed in near real-time by using a very large number of memory or storage elements and exascale computing systems. Immediate examples are the scientific data produced at a rate of hundreds of gigabits-per-second that must be stored, filtered and analyzed, the millions of images per day that must be analyzed in parallel, the one billion of social data posts queried in real-time on an in-memory components database. Traditional disks or commercial storage nowadays cannot handle the extreme scale of such application data. Following the need of improvement of current concepts and technologies, we focus in this paper on the needs of data intensive applications running on systems composed of up to millions of computing elements (exascale systems). We propose in this paper a methodology to advance the state-of-the-art. The starting point is the definition of new programming paradigms, APIs, runtime tools and methodologies for expressing data-intensive tasks on exascale systems. This will pave the way for the exploitation of massive parallelism over a simplified model of the system architecture, thus promoting high performance and efficiency, offering powerful operations and mechanisms for processing extreme data sources at high speed and/or real time.

[1]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[2]  Ian T. Foster,et al.  Making a case for distributed file systems at Exascale , 2011, LSAP '11.

[3]  Ahmad Taher Azar,et al.  Supervised hybrid feature selection based on PSO and rough sets for medical diagnosis , 2014, Comput. Methods Programs Biomed..

[4]  Xin Yao,et al.  Performance Scaling of Multi-objective Evolutionary Algorithms , 2003, EMO.

[5]  K. Lagouvardos,et al.  Weather forecast in north-western Greece: RISKMED warnings and verification of MM5 model , 2010 .

[6]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[7]  Dino Pedreschi,et al.  Efficient distributed computation of human mobility aggregates through user mobility profiles , 2012, UrbComp '12.

[8]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Hyunsoo Yoon,et al.  Algorithm learning based neural network integrating feature selection and classification , 2013, Expert Syst. Appl..

[10]  Rob van Nieuwpoort,et al.  Correlating Radio Astronomy Signals with Many-Core Hardware , 2011, International Journal of Parallel Programming.

[11]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[12]  Amnon Barak,et al.  A package for OpenCL based heterogeneous computing on clusters with many GPU devices , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).

[13]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[14]  Abdullah Sharaf Alghamdi,et al.  Towards the Designing of a Robust Intrusion Detection System through an Optimized Advancement of Neural Networks , 2010, AST/UCMA/ISA/ACN.

[15]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[16]  J. Dudhia,et al.  Coupling an Advanced Land Surface–Hydrology Model with the Penn State–NCAR MM5 Modeling System. Part I: Model Implementation and Sensitivity , 2001 .

[17]  Rajkumar Buyya,et al.  Multiobjective differential evolution for scheduling workflow applications on global Grids , 2009, Concurr. Comput. Pract. Exp..

[18]  Rajkumar Buyya,et al.  Reliability-Oriented Genetic Algorithm for Workflow Applications Using Max-Min Strategy , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[19]  Jianwei Li,et al.  Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[20]  Anastasios Papadopoulos,et al.  A 2-year intercomparison of the WAM-Cycle4 and the WAVEWATCH-III wave models implemented within the Mediterranean Sea , 2011 .

[21]  T. Flatia,et al.  Partnership for Advanced Computing in Europe , 2017 .

[22]  Sathish S. Vadhiyar,et al.  ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability , 2012, ICCS.

[23]  Emanuele Della Valle,et al.  Parallelization and Distribution Techniques for Ontology Matching in Urban Computing Environments , 2009, OM.

[24]  Kazuyuki Murase,et al.  A new wrapper feature selection approach using neural network , 2010, Neurocomputing.

[25]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[26]  Jon Hill,et al.  SPRINT: A new parallel framework for R , 2008, BMC Bioinformatics.

[27]  Rajkumar Buyya,et al.  Multiobjective differential evolution for workflow execution on grids , 2007, MGC '07.

[28]  Francisco de Sande,et al.  accULL: An OpenACC Implementation with CUDA and OpenCL Support , 2012, Euro-Par.

[29]  GhemawatSanjay,et al.  The Google file system , 2003 .

[30]  Domenico Talia,et al.  The Weka4WS framework for distributed data mining in service‐oriented Grids , 2008, Concurr. Comput. Pract. Exp..

[31]  Muttukrishnan Rajarajan,et al.  A survey of intrusion detection techniques in Cloud , 2013, J. Netw. Comput. Appl..

[32]  Timoleon Kipouros,et al.  The Design and Implementation of a GPU-enabled Multi-objective Tabu-search Intended for Real World and High-dimensional Applications , 2014, ICCS.

[33]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[34]  El-Ghazali Talbi,et al.  A multi-start local search heuristic for an energy efficient VMs assignment on top of the OpenNebula cloud manager , 2014, Future Gener. Comput. Syst..

[35]  Dirk Schmidl,et al.  Score-P: A Unified Performance Measurement System for Petascale Applications , 2010, CHPC.

[36]  Ritu Garg,et al.  A robust multi-objective optimization to workflow scheduling for dynamic grid , 2011, ACAI '11.

[37]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[38]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[39]  Barton P. Miller,et al.  On-line automated performance diagnosis on thousands of processes , 2006, PPoPP '06.

[40]  Dick H. J. Epema,et al.  Scheduling malleable applications in multicluster systems , 2007, 2007 IEEE International Conference on Cluster Computing.

[41]  Rajesh Sudarsan,et al.  ReSHAPE: A Framework for Dynamic Resizing and Scheduling of Homogeneous Applications in a Parallel Environment , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[42]  Jimy Dudhia,et al.  The Weather Research and Forecast Model: software architecture and performance [presentation] , 2005 .

[43]  Emmanuel Jeannot,et al.  MO-Greedy: An Extended Beam-Search Approach for Solving a Multi-criteria Scheduling Problem on Heterogeneous Machines , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[44]  El-Ghazali Talbi,et al.  GPU Computing for Parallel Local Search Metaheuristic Algorithms , 2013, IEEE Transactions on Computers.

[45]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[46]  Thomas Fahringer,et al.  LibWater: heterogeneous distributed computing made easy , 2013, ICS '13.

[47]  Satoshi Matsuoka,et al.  Extreme Big Data (EBD): Next Generation Big Data Infrastructure Technologies Towards Yottabyte/Year , 2014, Supercomput. Front. Innov..

[48]  Francisco Almeida,et al.  Towards a Unified Heterogeneous Development Model in AndroidTM , 2013, Euro-Par Workshops.

[49]  Daniel Engel,et al.  A Survey of Dimension Reduction Methods for High-dimensional Data Analysis and Visualization , 2011, VLUDS.

[50]  Chee Peng Lim,et al.  A multi-objective evolutionary algorithm-based ensemble optimizer for feature selection and classification with neural network models , 2014, Neurocomputing.

[51]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[52]  Mateo Valero,et al.  Moving from petaflops to petadata , 2013, CACM.

[53]  R. Lakshmi,et al.  Minimal infrequent pattern based approach for mining outliers in data streams , 2015, Expert Syst. Appl..

[54]  Jesús Carretero,et al.  VIDAS: object-based virtualized data sharing for high performance storage I/O , 2013, Science Cloud '13.

[55]  Juan Gonzalez,et al.  On-line detection of large-scale parallel application's structure , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[56]  Rick Kufrin,et al.  PerfSuite: An Accessible, Open Source Performance Analysis Environment for Linux , 2005 .

[57]  Jesús Carretero,et al.  Making the case for reforming the I/O software stack of extreme-scale systems , 2017, Adv. Eng. Softw..

[58]  Martin Schulz,et al.  Clustering performance data efficiently at massive scales , 2010, ICS '10.

[59]  M. Hanumanthappa,et al.  Intrusion Detection System using decision tree algorithm , 2012, 2012 IEEE 14th International Conference on Communication Technology.

[60]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[61]  Pascal Bouvry,et al.  A Multi-objective GRASP Algorithm for Joint Optimization of Energy Consumption and Schedule Length of Precedence-Constrained Applications , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[62]  Mohamed Ben Ahmed,et al.  A Framework for an Adaptive Intrusion Detection System using Bayesian Network , 2007, 2007 IEEE Intelligence and Security Informatics.

[63]  Zhenguo Chen,et al.  Anomaly Detection Based on Enhanced DBScan Algorithm , 2011 .

[64]  In-Young Ko,et al.  Spontaneous task composition in urban computing environments based on social, spatial, and temporal aspects , 2011, Eng. Appl. Artif. Intell..

[65]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[66]  Dean Hildebrand,et al.  Panache: A Parallel File System Cache for Global File Access , 2010, FAST.

[67]  KhanLatifur,et al.  A new intrusion detection system using support vector machines and hierarchical clustering , 2007, VLDB 2007.

[68]  Rajkumar Buyya,et al.  Multi-objective planning for workflow execution on Grids , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[69]  Mor Harchol-Balter,et al.  PriorityMeister: Tail Latency QoS for Shared Networked Storage , 2014, SoCC.

[70]  Malcolm P. Atkinson,et al.  dispel4py: A Python framework for data-intensive scientific computing , 2014, 2014 International Workshop on Data Intensive Scalable Computing Systems.

[71]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[72]  Gregory R. Ganger,et al.  Argon: Performance Insulation for Shared Storage Servers , 2007, FAST.

[73]  William Gropp,et al.  Programming for Exascale Computers , 2013, Computing in Science & Engineering.

[74]  Carlos Maltzahn,et al.  SciHadoop: Array-based query processing in Hadoop , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[75]  Gary B. Wills,et al.  Unsupervised Clustering Approach for Network Anomaly Detection , 2012, NDT.

[76]  Jean-Marc Pierson,et al.  Towards a generic power estimator , 2014, Computer Science - Research and Development.

[77]  Jian Zhuang,et al.  Multi-objective unsupervised feature selection algorithm utilizing redundancy measure and negative epsilon-dominance for fault diagnosis , 2014, Neurocomputing.

[78]  El-Ghazali Talbi,et al.  GPU-Based Multi-start Local Search Algorithms , 2011, LION.

[79]  Philippe Olivier Alexandre Navaux,et al.  Supporting Malleability in Parallel Architectures with Dynamic CPUSETsMapping and Dynamic MPI , 2010, ICDCN.

[80]  Romain Rouvoy,et al.  PowerAPI: A Software Library to Monitor the Energy Consumed at the Process-Level , 2013, ERCIM News.

[81]  Rongda Chen,et al.  A SVM Stock Selection Model within PCA , 2014, ITQM.

[82]  Robert B. Ross,et al.  On the duality of data-intensive file system design: Reconciling HDFS and PVFS , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[83]  Basilio B. Fraguela,et al.  Exploiting heterogeneous parallelism with the Heterogeneous Programming Library , 2013, J. Parallel Distributed Comput..

[84]  Licia Capra,et al.  Urban Computing: Concepts, Methodologies, and Applications , 2014, TIST.

[85]  Jean-Marc Pierson Large-Scale Distributed Systems and Energy Efficiency: A Holistic View , 2015 .

[86]  Boleslaw K. Szymanski,et al.  Malleable iterative MPI applications , 2009, Concurr. Comput. Pract. Exp..

[87]  M. Kubát An Introduction to Machine Learning , 2017, Springer International Publishing.

[88]  Carlos Maltzahn,et al.  I/O acceleration with pattern detection , 2013, HPDC.

[89]  El-Ghazali Talbi,et al.  ParadisEO-MO-GPU: a framework for parallel GPU-based local search metaheuristics , 2013, GECCO '13.