A Methodology for Handling Data Movements by Anticipation: Position Paper

The enhanced capabilities of large scale parallel and distributed platforms produce a continuously increasing amount of data which have to be stored, exchanged and used by various tasks allocated on different nodes of the system. The management of such a huge communication demand is crucial for reaching the best possible performance of the system. Meanwhile, we have to deal with more interferences as the trend is to use a single all-purpose interconnection network whatever the interconnect (tree-based hierarchies or topology-based heterarchies). There are two different types of communications, namely, the flows induced by data exchanges during the computations, and the flows related to Input/Output operations. We propose in this paper a general model for interference-aware scheduling, where explicit communications are replaced by external topological constraints. Specifically, the interferences of both communication types are reduced by adding geometric constraints on the allocation of tasks into machines. The proposed constraints reduce implicitly the data movements by restricting the set of possible allocations for each task. This methodology has been proved to be efficient in a recent study for a restricted interconnection network (a line/ring of processors which is an intermediate between a tree and higher dimensions grids/torus). The obtained results illustrated well the difficulty of the problem even on simple topologies, but also provided a pragmatic greedy solution, which was assessed to be efficient by simulations. We are currently extending this solution for more complex topologies. This work is a position paper which describes the methodology, it does not focus on the solving part.

[1]  Denis Trystram,et al.  Contiguity and Locality in Backfilling Scheduling , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[2]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  James C. Browne,et al.  Understanding Application and System Performance Through System-Wide Monitoring , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[4]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[5]  Emmanuel Jeannot,et al.  Topology-aware resource management for HPC applications , 2017, ICDCN.

[6]  Yanik Ngoko Heating as a Cloud-Service, A Position Paper (Industrial Presentation) , 2016, Euro-Par.

[7]  D. Atkin OR scheduling algorithms. , 2000, Anesthesiology.

[8]  Maciej Drozdowski,et al.  Scheduling for Parallel Processing , 2009, Computer Communications and Networks.

[9]  Franck Cappello,et al.  Scheduling the I/O of HPC Applications Under Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[10]  José Antonio Lozano,et al.  Application-aware metrics for partition selection in cube-shaped topologies , 2014, Parallel Comput..

[11]  Lavanya Ramakrishnan,et al.  Considering Time in Designing Large-Scale Systems for Scientific Computing , 2016, CSCW.

[12]  Tei-Wei Kuo,et al.  A driver-layer caching policy for removable storage devices , 2011, TOS.

[13]  Jesús Carretero,et al.  CLARISSE: A Middleware for Data-Staging Coordination and Control on Large-Scale HPC Platforms , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[14]  Francieli Zanon Boito,et al.  Automatic I/O scheduling algorithm selection for parallel file systems , 2016, Concurr. Comput. Pract. Exp..

[15]  Vitus J. Leung,et al.  PaCMap: Topology Mapping of Unstructured Communication Patterns onto Non-contiguous Allocations , 2015, ICS.

[16]  Denis Trystram,et al.  Interference-Aware Scheduling Using Geometric Constraints , 2018, Euro-Par.

[17]  Jean-Charles Billaut,et al.  Flexibility and Robustness in Scheduling , 2008 .

[18]  Torsten Hoefler,et al.  Cost-effective diameter-two topologies: analysis and evaluation , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Maciej Drozdowski,et al.  On contiguous and non-contiguous parallel task scheduling , 2015, J. Sched..

[20]  Stephen L. Olivier,et al.  Exploiting Geometric Partitioning in Task Mapping for Parallel Computers , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[21]  Florin Isaila,et al.  Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers , 2016, 2016 First International Workshop on Communication Optimizations in HPC (COMHPC).

[22]  J. Enos,et al.  Topology-Aware Job Scheduling Strategies for Torus Networks , 2014 .

[23]  D. Hilbert Ueber die stetige Abbildung einer Line auf ein Flächenstück , 1891 .

[24]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Robert B. Ross,et al.  Using Formal Grammars to Predict I/O Behaviors in HPC: The Omnisc'IO Approach , 2016, IEEE Transactions on Parallel and Distributed Systems.

[26]  Uwe Schwiegelshohn,et al.  Theory and Practice in Parallel Job Scheduling , 1997, JSSPP.

[27]  Carl Albing Characterizing node orderings for improved performance , 2015, PMBS '15.