Using Memory in the Right Way to Accelerate Big Data Processing

Big data processing is becoming a standout part of data center computation. However, latest research has indicated that big data workloads cannot make full use of modern memory systems. We find that the dramatic inefficiency of the big data processing is from the enormous amount of cache misses and stalls of the depended memory accesses. In this paper, we introduce two optimizations to tackle these problems. The first one is the slice-and-merge strategy, which reduces the cache miss rate of the sort procedure. The second optimization is direct-memory-access, which reforms the data structure used in key/value storage. These optimizations are evaluated with both micro-benchmarks and the real-world benchmark HiBench. The results of our micro-benchmarks clearly demonstrate the effectiveness of our optimizations in terms of hardware event counts; and the additional results of HiBench show the 1.21X average speedup on the application-level. Both results illustrate that careful hardware/software co-design will improve the memory efficiency of big data processing. Our work has already been integrated into Intel distribution for Apache Hadoop.

[1]  Babak Falsafi,et al.  Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors , 2012, TOCS.

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[4]  Dong Yang,et al.  NativeTask: A Hadoop compatible framework for high performance , 2013, 2013 IEEE International Conference on Big Data.

[5]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Christoforos E. Kozyrakis,et al.  Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[7]  Gerth Stølting Brodal,et al.  Engineering a cache-oblivious sorting algorithm , 2008, JEAL.

[8]  Haibo Chen,et al.  Tiled-MapReduce: Optimizing resource usages of data-parallel applications on multicore with tiling , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  Wenguang Chen,et al.  MapCG: Writing parallel program portable between CPU and GPU , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Cong Xu,et al.  JVM-Bypass for Efficient Hadoop Shuffling , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[11]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[12]  Cycle Accounting Analysis on Intel ® Core TM 2 Processors , .

[13]  Alan L. Cox,et al.  The Hadoop distributed filesystem: Balancing portability and performance , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[14]  Mark Horowitz,et al.  Scaling, Power and the Future of CMOS , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[15]  Babak Falsafi,et al.  Toward Dark Silicon in Servers , 2011, IEEE Micro.

[16]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[17]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[18]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).