Co-scheduling Data and Task for a Data-Driven Distribution of Data-Intensive Applications

In distributed computing, data scheduling is becoming an important field of research with the emergence of Big Data. High level features provided by software data scheduler often rely on data management policies – possibly user-defined – such as fault tolerance, multi-protocol file transfer, reliable and multi-tenant storage, security and data privacy, locality-aware data distribution etc. Nowadays, to execute data-intensive applications, such advanced features become necessary, and this means that data and task schedulers are capable to cooperate closely. In this paper, we propose a data driven cooperative platform by combining two existing middleware: XtremWeb-HEP, as the task scheduler, and BitDew, as the data scheduler. Taking advantage of both middleware, our solution allows user to select the suitable data scheduling strategy as well as the adequate task granularity which provide the optimal data distribution. To evaluate the efficiency of our approach, we compare different strategies of scheduling tasks and data and prove the efficiency of the cooperation of data and task schedulers to execute data-intensive applications.

[1]  Abdelfettah Belghith,et al.  Towards a distributed Arabic OCR based on the DTW algorithm: performance analysis , 2009, Int. Arab J. Inf. Technol..

[2]  Maher KHEMAKHEM,et al.  The DTW Data Distribution over a Grid Computing Architecture , 2007 .

[3]  Gilles Fedak,et al.  Extending the EGEE Grid with XtremWeb-HEP Desktop Grids , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[4]  B. Bzeznik,et al.  Intensive processing with iRODS and the middleware CiGri for the Whisper project , 2014 .

[5]  Arie Shoshani,et al.  Co-Scheduling of Computation and Data on Computer Clusters , 2005, SSDBM.

[6]  Baohua Wei Collaborative Data Distribution with BitTorrent for Computational Desktop Grids , 2005, The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05).

[7]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[8]  Zahir Tari,et al.  MetaCDN: Harnessing 'Storage Clouds' for high performance content delivery , 2009, J. Netw. Comput. Appl..

[9]  Gilles Fedak,et al.  Desktop Grid Computing , 2012 .

[10]  Gilles Fedak,et al.  Assessing MapReduce for Internet Computing: A Comparison of Hadoop and BitDew-MapReduce , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[11]  Gilles Fedak,et al.  Scheduling Data on Data-Driven Master/Worker Platform , 2012, 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[12]  Renato J. O. Figueiredo,et al.  GatorShare: a file system framework for high-throughput data management , 2010, HPDC '10.

[13]  Ian T. Foster,et al.  Globus Online: Accelerating and Democratizing Science through Cloud-Based Services , 2011, IEEE Internet Computing.

[14]  Gilles Fedak,et al.  BitDew: A programmable environment for large-scale data management and distribution , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Abdelfettah Belghith,et al.  A multipurpose multi-agent system based on a loosely coupled architecture to speedup the DTW algorithm for Arabic printed cursive OCR , 2005, The 3rd ACS/IEEE International Conference onComputer Systems and Applications, 2005..