Storage and index support for data intensive web applications

In this paper, a system named as DisGR, for Distributed Graph Repository, that is designed and developed for supporting Chinese Web related research, is introduced. The system is designed based on a graph data model, TGM (for Tagged Graph Model), that is designed for representing Web data, especially forum and BBS data. DisGR supports the query language TGM-L that aims at analytical tasks for TGM data. For high-scalability and availability purpose, DisGR is designed for clusters with shared-nothing architecture. DisGR has several characteristics such as column-based storage, descriptive language support, and flexible user-defined function support. DisGR is different to other database systems with similar purpose in three perspectives. First, catalog is maintained by a set of servers connected via a DHT overlay. Second, signatures with different granularities are used for data distribution and query optimization. Last but not the least, update is supported via timestamps and regularily reorganization.

[1]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[2]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[3]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[4]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[5]  David R. Karger,et al.  Chord: a scalable peer-to-peer lookup protocol for internet applications , 2003, TNET.

[6]  Wilfred Ng,et al.  Efficient query processing on graph databases , 2009, TODS.

[7]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[8]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[9]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[10]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[11]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[12]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[13]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[14]  Aoying Zhou,et al.  Chinese Web Infrastructure Building: Challenges and Our Roadmap , 2008, 2008 International Workshop on Information-Explosion and Next Generation Search.

[15]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[16]  Alberto O. Mendelzon,et al.  Applications of a Web Query Language , 1997, Comput. Networks.

[17]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Alberto O. Mendelzon,et al.  Finding Regular Simple Paths in Graph Databases , 1989, SIAM J. Comput..

[20]  Li Ma,et al.  Efficient Indices Using Graph Partitioning in RDF Triple Stores , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[21]  Jianzhong Li,et al.  InfiniteDB: a pc-cluster based parallel massive database management system , 2007, SIGMOD '07.

[22]  Wei-Ying Ma,et al.  Webstudio: building infrastructure for web data management , 2007, SIGMOD '07.

[23]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..