Sector and Sphere: the design and implementation of a high-performance data cloud

Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done simply, given the right programming model and infrastructure. In this paper, we describe the design and implementation of the Sector storage cloud and the Sphere compute cloud. By contrast with the existing storage and compute clouds, Sector can manage data not only within a data centre, but also across geographically distributed data centres. Similarly, the Sphere compute cloud supports user-defined functions (UDFs) over data both within and across data centres. As a special case, MapReduce-style programming can be implemented in Sphere by using a Map UDF followed by a Reduce UDF. We describe some experimental studies comparing Sector/Sphere and Hadoop using the Terasort benchmark. In these studies, Sector is approximately twice as fast as Hadoop. Sector/Sphere is open source.

[1]  Robert L. Grossman,et al.  Data mining using high performance data clouds: experimental studies using sector and sphere , 2008, KDD.

[2]  Arie Shoshani,et al.  Deep scientific computing requires deep data , 2004, IBM J. Res. Dev..

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[5]  Y. Chen [The change of serum alpha 1-antitrypsin level in patients with spontaneous pneumothorax]. , 1995, Zhonghua jie he he hu xi za zhi = Zhonghua jiehe he huxi zazhi = Chinese journal of tuberculosis and respiratory diseases.

[6]  Robert L. Grossman,et al.  UDT: UDP-based data transfer for high-speed wide area networks , 2007, Comput. Networks.

[7]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[8]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[9]  GhemawatSanjay,et al.  The Google file system , 2003 .

[10]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[11]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[12]  Anthony Skjellum,et al.  Using MPI: portable parallel programming with the message-passing interface, 2nd Edition , 1999, Scientific and engineering computation series.

[13]  Liang Chen,et al.  GATES: a grid-based middleware for processing distributed data streams , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[14]  Ian T. Foster,et al.  Globus Toolkit Version 4: Software for Service-Oriented Systems , 2005, Journal of Computer Science and Technology.

[15]  Philip S. Yu,et al.  CellSort: High Performance Sorting on the Cell Processor , 2007, VLDB.

[16]  Lauren Wood 技術解説 IEEE Internet Computing , 1999 .

[17]  Joel H. Saltz,et al.  DataCutter: Middleware for Filtering Very Large Scientific Datasets on Archival Storage Systems , 2000, IEEE Symposium on Mass Storage Systems.

[18]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.

[19]  Robert L. Grossman,et al.  Distributing the Sloan Digital Sky Survey Using UDT and Sector , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[20]  Bruce M. Maggs,et al.  Globally Distributed Content Delivery , 2002, IEEE Internet Comput..

[21]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).