Genetic Programming over Spark for Higgs Boson Classification

With the growing number of available databases having a very large number of records, existing knowledge discovery tools need to be adapted to this shift and new tools need to be created. Genetic Programming (GP) has been proven as an efficient algorithm in particular for classification problems. Notwithstanding, GP is impaired with its computing cost that is more acute with large datasets. This paper, presents how an existing GP implementation (DEAP) can be adapted by distributing evaluations on a Spark cluster. Then, an additional sampling step is applied to fit tiny clusters. Experiments are accomplished on Higgs Boson classification with different settings. They show the benefits of using Spark as parallelization technology for GP.

[1]  Jesús S. Aguilar-Ruiz,et al.  An Approach to Reduce the Cost of Evaluation in Evolutionary Learning , 2005, IWANN.

[2]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[3]  Leonardo Trujillo,et al.  ECJ+HADOOP: An Easy Way to Deploy Massive Runs of Evolutionary Algorithms , 2016, EvoApplications.

[4]  Simone A. Ludwig,et al.  Scaling Genetic Programming for data classification using MapReduce methodology , 2013, 2013 World Congress on Nature and Biologically Inspired Computing.

[5]  Marc Parizeau,et al.  DEAP: evolutionary algorithms made easy , 2012, J. Mach. Learn. Res..

[6]  P Baldi,et al.  Enhanced Higgs boson to τ(+)τ(-) search with deep learning. , 2014, Physical review letters.

[7]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[8]  Wlodzimierz Funika,et al.  Scaling Evolutionary Programming with the Use of Apache Spark , 2016, Comput. Sci..

[9]  Sergio Ramírez-Gallego,et al.  Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach , 2015 .

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Amel Borgi,et al.  Scale Genetic Programming for large Data Sets: Case of Higgs Bosons Classification , 2018, KES.

[12]  Peter Ross,et al.  Dynamic Training Subset Selection for Supervised Learning in Genetic Programming , 1994, PPSN.

[13]  Vinay D. Rao,et al.  Evaluation of Machine Learning Frameworks on Bank Marketing and Higgs Datasets , 2015, 2015 Second International Conference on Advances in Computing and Communication Engineering.

[14]  Ciprian Paduraru,et al.  A distributed implementation using apache spark of a genetic algorithm applied to test data generation , 2017, GECCO.

[15]  Zhi-Jian Wang,et al.  A Parallel Genetic Algorithm Based on Spark for Pairwise Test Suite Generation , 2016, Journal of Computer Science and Technology.

[16]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.