Lessons learned from a year's worth of benchmarks of large data clouds

In this paper, we discuss some of the lessons that we have learned working with the Hadoop and Sector/Sphere systems. Both of these systems are cloud-based systems designed to support data intensive computing. Both include distributed file systems and closely coupled systems for processing data in parallel. Hadoop uses MapReduce, while Sphere supports the ability to execute an arbitrary user defined function over the data managed by Sector. We compare and contrast these systems and discuss some of the design trade-offs necessary in data intensive computing. In our experimental studies over the past year, Sector/Sphere has consistently performed about 2--4 times faster than Hadoop. We discuss some of the reasons that might be responsible for this difference in performance.

[1]  Joel H. Saltz,et al.  Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[2]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[3]  Robert L. Grossman,et al.  Distributing the Sloan Digital Sky Survey Using UDT and Sector , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[4]  Joel H. Saltz,et al.  Exploration and Visualization of Very Large Datasets with the Active Data Repository , 2001 .

[5]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[6]  Zhao Zhang,et al.  Toward loosely coupled programming on petascale systems , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Andrea C. Arpaci-Dusseau,et al.  Explicit Control in the Batch-Aware Distributed File System , 2004, NSDI.

[8]  Zhao Zhang,et al.  Towards Loo on , 2008 .

[9]  Robert L. Grossman,et al.  UDT: UDP-based data transfer for high-speed wide area networks , 2007, Comput. Networks.

[10]  Robert L. Grossman,et al.  Exploring data parallelism and locality in wide area networks , 2008, 2008 Workshop on Many-Task Computing on Grids and Supercomputers.

[11]  Yong Zhao,et al.  Many-task computing for grids and supercomputers , 2008, 2008 Workshop on Many-Task Computing on Grids and Supercomputers.

[12]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[15]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..