Rafiki: a middleware for parameter tuning of NoSQL datastores for dynamic metagenomics workloads

High performance computing (HPC) applications, such as metagenomics and other big data systems, need to store and analyze huge volumes of semi-structured data. Such applications often rely on NoSQL-based datastores, and optimizing these databases is a challenging endeavor, with over 50 configuration parameters in Cassandra alone. As the application executes, database workloads can change rapidly from read-heavy to write-heavy ones, and a system tuned with a read-optimized configuration becomes suboptimal when the workload becomes write-heavy. In this paper, we present a method and a system for optimizing NoSQL configurations for Cassandra and ScyllaDB when running HPC and metagenomics workloads. First, we identify the significance of configuration parameters using ANOVA. Next, we apply neural networks using the most significant parameters and their workload-dependent mapping to predict database throughput, as a surrogate model. Then, we optimize the configuration using genetic algorithms on the surrogate to maximize the workload-dependent performance. Using the proposed methodology in our system (Rafiki), we can predict the throughput for unseen workloads and configuration values with an error of 7.5% for Cassandra and 6.9-7.8% for ScyllaDB. Searching the configuration spaces using the trained surrogate models, we achieve performance improvements of 41% for Cassandra and 9% for ScyllaDB over the default configuration with respect to a read-heavy workload, and also significant improvement for mixed workloads. In terms of searching speed, Rafiki, using only 1/10000-th of the searching time of exhaustive search, reaches within 15% and 9.5% of the theoretically best achievable performances for Cassandra and ScyllaDB, respectively---supporting optimizations for highly dynamic workloads.

[1]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[2]  Kushal Datta,et al.  Gunther: Search-Based Auto-Tuning of MapReduce , 2013, Euro-Par.

[3]  Li Zhang,et al.  MRONLINE: MapReduce online performance tuning , 2014, HPDC '14.

[4]  Margaret E. Glasner Finding enzymes in the gut metagenome , 2017, Science.

[5]  Athanasios V. Vasilakos,et al.  Parallel Processing Systems for Big Data: A Survey , 2016, Proceedings of the IEEE.

[6]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[7]  Ali Ghodsi,et al.  Bolt-on causal consistency , 2013, SIGMOD '13.

[8]  Nathan Blow,et al.  Metagenomics: Exploring unseen communities , 2008, Nature.

[9]  Changjun Jiang,et al.  Automated and Agile Server ParameterTuning by Coordinated Learning and Control , 2014, IEEE Transactions on Parallel and Distributed Systems.

[10]  A. Peirce Computer Methods in Applied Mechanics and Engineering , 2010 .

[11]  Shivnath Babu,et al.  Tuning Database Configuration Parameters with iTuned , 2009, Proc. VLDB Endow..

[12]  Emin Gün Sirer,et al.  Configuring Distributed Computations Using Response Surfaces , 2015, Middleware.

[13]  Günter Rudolph,et al.  Tuning optimization algorithms for real-world problems by means of surrogate modeling , 2010, GECCO '10.

[14]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[15]  Surajit Chaudhuri,et al.  Table of Contents (pdf) , 2007, VLDB.

[16]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[17]  Xiandong Meng,et al.  A case study of tuning MapReduce for efficient Bioinformatics in the cloud , 2017, Parallel Comput..

[18]  Max Chevalier,et al.  Benchmark for OLAP on NoSQL technologies comparing NoSQL multidimensional data warehousing solutions , 2015, 2015 IEEE 9th International Conference on Research Challenges in Information Science (RCIS).

[19]  Andreas Wilke,et al.  The MG-RAST metagenomics database and portal in 2015 , 2015, Nucleic Acids Res..

[20]  Kusum Deep,et al.  A real coded genetic algorithm for solving integer and mixed integer optimization problems , 2009, Appl. Math. Comput..

[21]  Bowen Zhou,et al.  Mitigating interference in cloud services by middleware reconfiguration , 2014, Middleware.

[22]  Lieven Eeckhout,et al.  RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop's Configuration , 2016, IEEE Transactions on Parallel and Distributed Systems.

[23]  Robert D. Finn,et al.  The European Bioinformatics Institute in 2016: Data growth and integration , 2015, Nucleic Acids Res..

[24]  Philippe Bonnet,et al.  Database tuning: principles, experiments, and troubleshooting techniques (part I) , 2002, SIGMOD '02.

[25]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[26]  Cristian Bucur,et al.  A comparison between several NoSQL databases with comments and notes , 2011, 2011 RoEduNet International Conference 10th Edition: Networking in Education and Research.

[27]  Imre Csiszár,et al.  Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[28]  Jaroslav Pokorny NoSQL databases: a step to database scalability in web environment , 2011, iiWAS '11.

[29]  Folker Meyer,et al.  Suitability of NoSQL systems — Cassandra and ScyllaDB — For IoT workloads , 2017, 2017 9th International Conference on Communication Systems and Networks (COMSNETS).

[30]  Saurabh Bagchi,et al.  ICE: An Integrated Configuration Engine for Interference Mitigation in Cloud Services , 2015, 2015 IEEE International Conference on Autonomic Computing.

[31]  J. Hess,et al.  Analysis of variance , 2018, Transfusion.

[32]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[33]  Rudolf Eigenmann,et al.  The NEEShub Cyberinfrastructure for Earthquake Engineering , 2011, Computing in Science & Engineering.

[34]  Chengzhong Xu,et al.  MEST: A Model-Driven Efficient Searching Approach for MapReduce Self-Tuning , 2017, IEEE Access.

[35]  Jorge Bernardino,et al.  NoSQL databases: MongoDB vs cassandra , 2013, C3S2E '13.

[36]  Philippe Bonnet,et al.  Database tuning principles, experiments, and troubleshooting techniques , 2004, SGMD.

[37]  Paolo Romano,et al.  Enhancing Performance Prediction Robustness by Combining Analytical Modeling and Machine Learning , 2015, ICPE.

[38]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[39]  K. Deb An Efficient Constraint Handling Method for Genetic Algorithms , 2000 .

[40]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.