Scaling Evolutionary Programming with the Use of Apache Spark

Organizations across the globe gather more and more data, encouraged by easy-to-use and cheap cloud storage services. Large datasets require new approaches to analysis and processing, which include methods based on machine learning. In particular, symbolic regression can provide many useful insights. Unfortunately, due to high resource requirements, use of this method for large-scale dataset analysis might be unfeasible. In this paper, we analyze a bottleneck in the open-source implementation of this method we call hubert. We identify that the evaluation of individuals is the most costly operation. As a solution to this problem, we propose a new evaluation service based on the Apache Spark framework, which attempts to speed up computations by executing them in a distributed manner on a cluster of machines. We analyze the performance of the service by comparing the evaluation execution time of a number of samples with the use of both implementations. Finally, we draw conclusions and outline plans for further research.

[1]  Hod Lipson,et al.  Data-Mining Dynamical Systems: Automated Symbolic System Identification for Exploratory Analysis , 2008 .

[2]  Zhiqiang Yao,et al.  High performance parallel evolutionary algorithm model based on MapReduce framework , 2013, Int. J. Comput. Appl. Technol..

[3]  Wlodzimierz Funika,et al.  Towards Autonomic Semantic-Based Management of Distributed Applications , 2010, Comput. Sci..

[4]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[5]  Marco Tomassini,et al.  A Parallel Genetic Programming Tool Based on PVM , 1999, PVM/MPI.

[6]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[7]  Hod Lipson,et al.  Age-fitness pareto optimization , 2010, GECCO '10.

[8]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[9]  Wlodzimierz Funika,et al.  Genetic Programming in Automatic Discovery of Relationships in Computer System Monitoring Data , 2013, PPAM.

[11]  Hod Lipson,et al.  Distilling Free-Form Natural Laws from Experimental Data , 2009, Science.

[12]  Ken E. Whelan,et al.  The Automation of Science , 2009, Science.

[13]  Wlodzimierz Funika,et al.  Semantic-Oriented Performance Monitoring of Distributed Applications , 2012, Comput. Informatics.

[14]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.