The Art of Scheduling for Big Data Science

Many applications generate big data, like social networking and social influence programs, Cloud applications, public websites, scientific experiments and simulations, data warehouses, monitoring platforms, and e-government services. Data grow rapidly, since applications produce continuously increasing volumes of unstructured and structured data. The impact on data processing, transfer, and storage is the need to reevaluate the approaches and solutions to better answer user needs. In this context, scheduling models and algorithms have an important role. A large variety of solutions for specific applications and platforms exist, so a thorough and systematic analysis of existing solutions for scheduling models, methods, and algorithms used in big data processing and storage environments has high importance. This chapter presents the best of existing solutions and creates an overview of current and near-future trends. It will highlight, from a research perspective, the performance and limitations of existing solutions and will offer the scientists from academia and designers from industry an overview of the current situation in the area of scheduling and resource management related to big data processing.

[1]  Keqin Li,et al.  Future Generation Computer Systems ( ) – Future Generation Computer Systems Multi-objective Scheduling of Many Tasks in Cloud Platforms , 2022 .

[2]  Qing Zhang,et al.  Job Scheduling Optimization for Multi-user MapReduce Clusters , 2011, 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming.

[3]  Rajkumar Buyya,et al.  Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms , 2006, Sci. Program..

[4]  M. Tim Jones Scheduling in Hadoop An introduction to the pluggable scheduler framework , 2013 .

[5]  Jie Li,et al.  Cloud auto-scaling with deadline and budget constraints , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[6]  Kemafor Anyanwu,et al.  Scheduling Hadoop Jobs to Meet Deadlines , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[7]  Rajkumar Buyya,et al.  Big Data Computing and Clouds: Challenges, Solutions, and Future Directions , 2013, ArXiv.

[8]  Xiaohong Jiang,et al.  Live Migration of Multiple Virtual Machines with Resource Reservation in Cloud Computing Environments , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[9]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[10]  Matei Zaharia,et al.  Job Scheduling for Multi-User MapReduce Clusters , 2009 .

[11]  Hao Wu,et al.  Enhancing throughput of the Hadoop Distributed File System for interaction-intensive tasks , 2014, J. Parallel Distributed Comput..

[12]  Douglas G. Down,et al.  COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems , 2014, Future Gener. Comput. Syst..

[13]  Khaled Elmeleegy,et al.  Piranha: Optimizing Short Jobs in Hadoop , 2013, Proc. VLDB Endow..

[14]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[15]  Raghu Ramakrishnan Scale-out Beyond Map-Reduce , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[16]  Gueyoung Jung,et al.  Synchronous Parallel Processing of Big-Data Analytics Services to Optimize Performance in Federated Clouds , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[17]  Rajiv Ranjan,et al.  G-Hadoop: MapReduce across distributed data centers for data-intensive computing , 2013, Future Gener. Comput. Syst..

[18]  Anand Sivasubramaniam,et al.  To Move or Not to Move: The Economics of Cloud Computing , 2011, HotCloud.

[19]  Thilo Kielmann,et al.  Budget Estimation and Control for Bag-of-Tasks Scheduling in Clouds , 2011, Parallel Process. Lett..

[20]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[21]  Xiao Liu,et al.  A Compromised-Time-Cost Scheduling Algorithm in SwinDeW-C for Instance-Intensive Cost-Constrained Workflows on a Cloud Computing Platform , 2010, Int. J. High Perform. Comput. Appl..

[22]  Helen D. Karatza Scheduling in Distributed Systems , 2003, MASCOTS Tutorials.

[23]  C. Dobre,et al.  A SLA-based method for big-data transfers with multi-criteria optimization constraints for IaaS , 2013, 2013 11th RoEduNet International Conference.

[24]  Valentin Cristea,et al.  Speculative Genetic Scheduling Method for Hadoop Environments , 2012, 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[25]  David Loshin Chapter 7 – Big Data Tools and Techniques , 2013 .

[26]  Unai Arronategui,et al.  A task routing approach to large-scale scheduling , 2013, Future Gener. Comput. Syst..

[27]  Alexander Stage,et al.  Network-aware migration control and scheduling of differentiated virtual machine workloads , 2009, 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing.

[28]  Jan Broeckhove,et al.  Online cost-efficient scheduling of deadline-constrained workloads on hybrid clouds , 2013, Future Gener. Comput. Syst..

[29]  Ciprian Dobre,et al.  Large-Scale Distributed Computing and Applications: Models and Trends , 2010 .

[30]  Minghua Chen,et al.  Moving Big Data to The Cloud: An Online Cost-Minimizing Approach , 2013, IEEE Journal on Selected Areas in Communications.

[31]  Joseph Hall,et al.  On algorithms for efficient data migration , 2001, SODA '01.