Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms

Matrix multiplication is a dominant but very time-consuming operation in many big data analytic applications. Thus its performance optimization is an important and fundamental research issue. The performance of large-scale matrix multiplication on distributed data-parallel platforms is determined by both computation and IO costs. For existing matrix multiplication execution strategies, when the execution concurrency scales up above a threshold, their execution performance deteriorates quickly because the increase of the IO cost outweighs the decrease of the computation cost. This paper presents a novel parallel execution strategy <italic>CRMM (Concurrent Replication-based Matrix Multiplication)</italic> along with a parallel algorithm, Marlin, for large-scale matrix multiplication on data-parallel platforms. The CRMM strategy exploits higher execution concurrency for sub-block matrix multiplication with the same IO cost. To further improve the performance of Marlin, we also propose a number of novel system-level optimizations, including increasing the concurrency of local data exchange by calling native library in batch, reducing the overhead of block matrix transformation, and reducing disk heavy shuffle operations by exploiting the semantics of matrix computation. We have implemented Marlin as a library along with a set of related matrix operations on Spark and also contributed Marlin to the open-source community. For large-sized matrix multiplication, Marlin outperforms existing systems including Spark MLlib, SystemML and SciDB, with about <inline-formula> <tex-math notation="LaTeX">$1.29\times$</tex-math><alternatives><inline-graphic xlink:href="tian-ieq1-2686384.gif"/> </alternatives></inline-formula>, <inline-formula><tex-math notation="LaTeX">$3.53\times$</tex-math><alternatives> <inline-graphic xlink:href="tian-ieq2-2686384.gif"/></alternatives></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$2.21\times$</tex-math><alternatives><inline-graphic xlink:href="tian-ieq3-2686384.gif"/> </alternatives></inline-formula> speedup on average, respectively. The evaluation upon a real-world DNN workload also indicates that Marlin outperforms above systems by about <inline-formula><tex-math notation="LaTeX">$12.8\times$ </tex-math><alternatives><inline-graphic xlink:href="tian-ieq4-2686384.gif"/></alternatives></inline-formula>, <inline-formula><tex-math notation="LaTeX">$5.1\times$</tex-math><alternatives> <inline-graphic xlink:href="tian-ieq5-2686384.gif"/></alternatives></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$27.2\times$</tex-math><alternatives><inline-graphic xlink:href="tian-ieq6-2686384.gif"/> </alternatives></inline-formula> speedup, respectively.

[1]  Jiannong Cao,et al.  MatrixMap: Programming Abstraction and Implementation of Matrix Computation for Big Data Applications , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[2]  Yu Cao,et al.  HeteroSpark: A heterogeneous CPU/GPU Spark platform for machine learning algorithms , 2015, 2015 IEEE International Conference on Networking, Architecture and Storage (NAS).

[3]  Michael Stonebraker,et al.  SciDB: A Database Management System for Applications with Complex Analytics , 2013, Computing in Science & Engineering.

[4]  Jin-Soo Kim,et al.  HAMA: An Efficient Matrix Computation with the MapReduce Framework , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[5]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[6]  James Demmel,et al.  Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[7]  Bin Cui,et al.  Exploiting Matrix Dependency for Efficient Distributed Matrix Computation , 2015, SIGMOD Conference.

[8]  Fabian Hueske,et al.  Apache Flink , 2019, Encyclopedia of Big Data Technologies.

[9]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[10]  A. Davidson Optimizing Shuffle Performance in Spark , 2013 .

[11]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[12]  Kadir Akbudak,et al.  Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors , 2016, IEEE Transactions on Parallel and Distributed Systems.

[13]  Marcos Dias de Assunção,et al.  Apache Spark , 2019, Encyclopedia of Big Data Technologies.

[14]  John Canny,et al.  BIDMach: Large-scale Learning with Zero Memory Allocation , 2013 .

[15]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[16]  Rong Gu,et al.  Efficient large scale distributed matrix computation with spark , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[17]  Jinyang Li,et al.  Spartan: A Distributed Array Framework with Smart Tiling , 2015, USENIX Annual Technical Conference.

[18]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[19]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[20]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[21]  Shirish Tatikonda,et al.  Resource Elasticity for Large-Scale Machine Learning , 2015, SIGMOD Conference.

[22]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[23]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[24]  Zhengping Qian,et al.  MadLINQ: large-scale distributed matrix computation for the cloud , 2012, EuroSys '12.

[25]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[26]  R. C. Whaley,et al.  Automatically Tuned Linear Algebra Software (ATLAS) , 2011, Encyclopedia of Parallel Computing.

[27]  Peter J. Haas,et al.  Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.

[28]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[29]  Yangqing Jia,et al.  Learning Semantic Image Representations at a Large Scale , 2014 .

[30]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[31]  James Demmel,et al.  Matrix Multiplication on Multidimensional Torus Networks , 2012, VECPAR.

[32]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[33]  Alvin AuYoung,et al.  Presto: distributed machine learning and graph processing with sparse matrices , 2013, EuroSys '13.

[34]  Tim Kraska,et al.  MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[35]  Matei Zaharia,et al.  Matrix Computations and Optimization in Apache Spark , 2015, KDD.