Towards a Scalable Distributed Fitness Evaluation Service

Organizations across the globe gather more and more data. Large datasets require new approaches to analysis and processing, which include methods based on machine learning. In particular, the symbolic regression can provide many useful insights. Unfortunately, due to high resource requirements, the use of this method for large datasets might be unfeasible. In this paper we analyze a bottleneck in an open-source implementation of this method, we call hubert. We identify that the evaluation of individuals is the most costly operation. As a solution to this problem, we propose a new evaluation service based on the Apache Spark framework, which attempts to speed up computations by distributing them on a cluster of machines. We compare the performance of the service by analyzing the execution time for a number of samples with use of both implementations. Then we discuss how the computation time improves with increased amount of resources. Finally we draw conclusions and outline plans for further research.