Processing massive sized graphs using Sector/Sphere

Data intensive computing is having an increasing awareness among computer science researchers. As the data size increases even faster than Moore's Law, many traditional systems are failing to cope with the extreme large volumetric datasets. In this paper we use a real world graph processing application to demonstrate the challenges from the emerging data intensive computing and present a solution with a system called Sector/Sphere that we developed in the last several years. Sector provides scalable, fault-tolerant storage using commodity computers, while Sphere supports in-storage parallel data processing with a simplified programming interface. This paper describes the rationale behind Sector/Sphere and how to use it to effectively process massive sized graphs

[1]  Joel H. Saltz,et al.  Exploration and Visualization of Very Large Datasets with the Active Data Repository , 2001 .

[2]  Zhao Zhang,et al.  Toward loosely coupled programming on petascale systems , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Robert L. Grossman,et al.  Lessons learned from a year's worth of benchmarks of large data clouds , 2009, MTAGS '09.

[4]  Hasso Plattner,et al.  A common database approach for OLTP and OLAP using an in-memory column database , 2009, SIGMOD Conference.

[5]  Andrea C. Arpaci-Dusseau,et al.  Explicit Control in the Batch-Aware Distributed File System , 2004, NSDI.

[6]  Zhao Zhang,et al.  Towards Loo on , 2008 .

[7]  Robert L. Grossman,et al.  Sector and Sphere: the design and implementation of a high-performance data cloud , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[8]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[9]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[10]  Andy B. Yoo,et al.  Evaluating use of data flow systems for large graph analysis , 2009, MTAGS '09.

[11]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[12]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[15]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[16]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[17]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[18]  Andy B. Yoo,et al.  MSSG: A Framework for Massive-Scale Semantic Graphs , 2006, 2006 IEEE International Conference on Cluster Computing.

[19]  Joel H. Saltz,et al.  Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..