Automating Characterization Deployment in Distributed Data Stream Management Systems

Distributed data stream management systems (DDSMS) are usually composed of upper layer relational query systems (RQS) and lower layer stream processing systems (SPS). When users submit new queries to RQS, a query planner needs to be converted into a directed acyclic graph (DAG) consisting of tasks which are running on SPS. Based on different query requests and data stream properties, SPS need to configure different deployments strategies. However, how to dynamically predict deployment configurations of SPS to ensure the processing throughput and low resource usage is a great challenge. This article presents OrientStream, a framework for automating characterization deployment in DDSMS using incremental machine learning techniques. By introducing the data-level, query plan-level, operator-level, and cluster-level’s four-level feature extraction mechanism, we first use the different query workloads as training sets to predict the resource usage by DDSMS, and select the optimal resource configuration from candidate settings based on the current query requests and stream properties, then migrate the operator state by introducing dynamic reconfiguration. Finally, we validate our approach on the open source SPS–Storm. In view of the application scenarios with long monitoring cycle and non-frequent data fluctuation, experiments show that OrientStream can reduce CPU usage of 8-15 percent and memory usage of 38-48 percent, respectively.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Prem Prakash Jayaraman,et al.  Resource Usage Estimation of Data Stream Processing Workloads in Datacenter Clouds , 2015, ArXiv.

[3]  Roberto Baldoni,et al.  Adaptive online scheduling in storm , 2013, DEBS.

[4]  Yin Yang,et al.  Efficient Operator State Migration for Cloud-Based Data Stream Management Systems , 2015 .

[5]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[6]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[7]  C. Bishop Mixture density networks , 1994 .

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Chen Yang,et al.  OrientStream: A Framework for Dynamic Resource Allocation in Distributed Data Stream Management Systems , 2016, CIKM.

[10]  David Carrera,et al.  ALOJA-ML: A Framework for Automating Characterization and Knowledge Discovery in Hadoop Deployments , 2015, KDD.

[11]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[12]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[13]  Eli Upfal,et al.  Learning-based Query Performance Modeling and Prediction , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[14]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[15]  Marko Robnik-Sikonja,et al.  An adaptation of Relief for attribute estimation in regression , 1997, ICML.

[16]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[17]  Bingsheng He,et al.  AdaStorm: Resource Efficient Storm with Adaptive Configuration , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[18]  Surajit Chaudhuri,et al.  Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques , 2012, Proc. VLDB Endow..

[19]  Jordi Torres,et al.  ALOJA: A systematic study of Hadoop deployment variables to enable automated characterization of cost-effectiveness , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[20]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[21]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[22]  Yin Yang,et al.  DRS: Dynamic Resource Scheduling for Real-Time Analytics over Fast Streams , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[23]  Alfons Kemper,et al.  Locality-sensitive operators for parallel main-memory database clusters , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[24]  Xing Xie,et al.  Mining interesting locations and travel sequences from GPS trajectories , 2009, WWW '09.

[25]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[26]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[27]  Qiming Chen,et al.  Aeolus: An optimizer for distributed intra-node-parallel streaming systems , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[28]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[29]  Stuart J. Russell,et al.  Experimental comparisons of online and batch versions of bagging and boosting , 2001, KDD '01.

[30]  Reinaldo Morabito,et al.  OPEN QUEUEING NETWORKS: OPTIMIZATION AND PERFORMANCE EVALUATION MODELS FOR DISCRETE MANUFACTURING SYSTEMS * , 2009 .

[31]  Giuliano Casale,et al.  An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing Systems , 2016, 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS).

[32]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[33]  Mohammad Hosseini,et al.  R-Storm: Resource-Aware Scheduling in Storm , 2015, Middleware.

[34]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.