MapReduce Frame Work: Investigating Suitability for Faster Data Analytics

Faster data analytics is the ability to generate the desired report in near real time. Any application that looks at an aggregated view of a stream of data can be considered as an analytic application. The demand to process vast amounts of data to produce various market trends, user behavior, fraud behavior etc. becomes not just useful, but critical to the success of the business. In the past few years, fast data, i.e., high-speed data streams, has also exploded in volume and availability. Prime examples include sensor data streams, real-time stock market data, and social-media feeds such as Twitter, Facebook etc. New models for distributed stream processing have been evolved over a time. This research investigates the suitability of Google’s MapReduce (MR) parallel programming frame work for faster data processing. Originally MapReduce systems are geared towards batch processing. This paper proposes some optimizations to original MR framework for faster distributed data processing applications using distributed shared memory to store intermediate data and use of Remote Direct Access (RDMA) technology for faster data transfer across network.

[1]  Gregory R. Ganger,et al.  Applying Performance Models to Understand Data-Intensive Computing Efficiency , 2010 .

[2]  Wu-chun Feng,et al.  Enhancing MapReduce via Asynchronous Data Processing , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[3]  Ken Yocum,et al.  Ad-hoc data processing in the cloud , 2008, Proc. VLDB Endow..

[4]  Renato Recio,et al.  A Remote Direct Memory Access Protocol Specification , 2007, RFC.

[5]  Xin Yang,et al.  IncMR: Incremental Data Processing Based on MapReduce , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[6]  Lu Liu,et al.  Muppet: MapReduce-Style Processing of Fast Data , 2012, Proc. VLDB Endow..

[7]  Christopher Olston,et al.  Stateful bulk processing for incremental analytics , 2010, SoCC '10.

[8]  Adam Rifkin,et al.  Nutch: A Flexible and Scalable Open-Source Web Search Engine , 2005 .

[9]  Prashant J. Shenoy,et al.  Towards Scalable One-Pass Analytics Using MapReduce , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[10]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[11]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[12]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[13]  Dhabaleswar K. Panda,et al.  Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device , 2005, 2005 IEEE International Conference on Cluster Computing.

[14]  Prashant J. Shenoy,et al.  A platform for scalable one-pass analytics using MapReduce , 2011, SIGMOD '11.

[15]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[16]  Indranil Gupta,et al.  Breaking the MapReduce stage barrier , 2010, 2010 IEEE International Conference on Cluster Computing.