Using machine learning to partition streaming programs

Stream-based parallel languages are a popular way to express parallelism in modern applications. The efficient mapping of streaming parallelism to today's multicore systems is, however, highly dependent on the program and underlying architecture. We address this by developing a portable and automatic compiler-based approach to partitioning streaming programs using machine learning. Our technique predicts the ideal partition structure for a given streaming application using prior knowledge learned offline. Using the predictor we rapidly search the program space (without executing any code) to generate and select a good partition. We applied this technique to standard StreamIt applications and compared against existing approaches. On a 4-core platform, our approach achieves 60% of the best performance found by iteratively compiling and executing over 3000 different partitions per program. We obtain, on average, a 1.90× speedup over the already tuned partitioning scheme of the StreamIt compiler. When compared against a state-of-the-art analytical, model-based approach, we achieve, on average, a 1.77× performance improvement. By porting our approach to an 8-core platform, we are able to obtain 1.8× improvement over the StreamIt default scheme, demonstrating the portability of our approach.

[1]  Albert Cohen,et al.  Iterative optimization in the polyhedral model: part ii, multidimensional time , 2008, PLDI '08.

[2]  Mark Stephenson,et al.  Predicting unroll factors using supervised classification , 2005, International Symposium on Code Generation and Optimization.

[3]  Michael F. P. O'Boyle,et al.  A workload-aware mapping approach for data-parallel programs , 2011, HiPEAC.

[4]  Long Li,et al.  Automatically partitioning packet processing applications for pipelined architectures , 2005, PLDI '05.

[5]  Curt Jones,et al.  Finding Good Approximate Vertex and Edge Partitions is NP-Hard , 1992, Inf. Process. Lett..

[6]  V. Sarkar,et al.  Automatic partitioning of a program dependence graph into parallel tasks , 1991, IBM J. Res. Dev..

[7]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[8]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[9]  William Thies,et al.  Language and compiler support for stream programs , 2009 .

[10]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[11]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[12]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[13]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[14]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[15]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[16]  Rafael Asenjo,et al.  Analytical Modeling of Pipeline Parallelism , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[17]  Robert Stephens,et al.  A survey of stream processing , 1997, Acta Informatica.

[18]  James Demmel,et al.  A view of the parallel computing landscape , 2009, CACM.

[19]  Carla E. Brodley,et al.  Learning to Schedule Straight-Line Code , 1997, NIPS.

[20]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[21]  Lieven Eeckhout,et al.  Cole: compiler optimization level exploration , 2008, CGO '08.

[22]  William Thies,et al.  An empirical characterization of stream programs and its implications for language and compiler design , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Abhishek Udupa,et al.  Software Pipelined Execution of Stream Programs on GPUs , 2009, 2009 International Symposium on Code Generation and Optimization.

[24]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[25]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[26]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[27]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[28]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[29]  Saman P. Amarasinghe,et al.  Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.

[30]  Michael F. P. O'Boyle,et al.  Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31]  Scott A. Mahlke,et al.  Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[32]  Santosh Pande,et al.  Input-driven dynamic execution prediction of streaming applications , 2010, PPoPP '10.

[33]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Krithi Ramamritham,et al.  Dynamic Task Scheduling in Hard Real-Time Distributed systems , 1984, IEEE Software.

[35]  Zhaohui Du,et al.  Data and computation transformations for Brook streaming applications on multiprocessors , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[36]  Scott A. Mahlke,et al.  Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.