CodHoop: A system for optimizing big data processing

The rise of the cloud and distributed data-intensive (“Big Data”) applications puts pressure on data center networks due to the movement of massive volumes of data. This paper proposes CodHoop a system employing network coding techniques, specifically index coding, as a means of dynamically-controlled reduction in volume of communication. Using Hadoop as a representative of this class of applications, a motivating use-case is presented. The proof-of-concept implementation results exhibit an average advantage of 31% compared to vanilla Hadoop implementation which depending on use-case translates to 31% less energy utilization of the equipment, 31% more jobs that run simultaneously, or to a 31% decrease in job completion time.

[1]  Meng Wang,et al.  A Practical Performance Model for Hadoop MapReduce , 2012, 2012 IEEE International Conference on Cluster Computing Workshops.

[2]  Rudolf Ahlswede,et al.  Network information flow , 2000, IEEE Trans. Inf. Theory.

[3]  Vyas Sekar,et al.  SmartRE: an architecture for coordinated network-wide redundancy elimination , 2009, SIGCOMM '09.

[4]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[5]  M.A.R. Chaudhry,et al.  Efficient algorithms for Index Coding , 2008, IEEE INFOCOM Workshops 2008.

[6]  Praveen Yalagandula,et al.  Mahout: Low-overhead datacenter traffic management using end-host-based elephant detection , 2011, 2011 Proceedings IEEE INFOCOM.

[7]  Michael Langberg,et al.  Finding Sparse Solutions for the Index Coding Problem , 2011, 2011 IEEE Global Telecommunications Conference - GLOBECOM 2011.

[8]  Michael Langberg,et al.  On the complementary Index Coding problem , 2011, 2011 IEEE International Symposium on Information Theory Proceedings.

[9]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[10]  Antony I. T. Rowstron,et al.  Camdoop: Exploiting In-network Aggregation for Big Data Applications , 2012, NSDI.