论文信息 - Storage and index support for data intensive web applications

Storage and index support for data intensive web applications

In this paper, a system named as DisGR, for Distributed Graph Repository, that is designed and developed for supporting Chinese Web related research, is introduced. The system is designed based on a graph data model, TGM (for Tagged Graph Model), that is designed for representing Web data, especially forum and BBS data. DisGR supports the query language TGM-L that aims at analytical tasks for TGM data. For high-scalability and availability purpose, DisGR is designed for clusters with shared-nothing architecture. DisGR has several characteristics such as column-based storage, descriptive language support, and flexible user-defined function support. DisGR is different to other database systems with similar purpose in three perspectives. First, catalog is maintained by a set of servers connected via a DHT overlay. Second, signatures with different granularities are used for data distribution and query optimization. Last but not the least, update is supported via timestamps and regularily reorganization.

Weining Qian

[1] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[2] Antony I. T. Rowstron,et al. Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[3] David J. DeWitt,et al. Parallel database systems: the future of high performance database systems , 1992, CACM.

[4] Howard Gobioff,et al. The Google file system , 2003, SOSP '03.

[5] David R. Karger,et al. Chord: a scalable peer-to-peer lookup protocol for internet applications , 2003, TNET.

[6] Wilfred Ng,et al. Efficient query processing on graph databases , 2009, TODS.

[7] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[8] Andrew Lim,et al. D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[9] Joseph M. Hellerstein,et al. MapReduce Online , 2010, NSDI.

[10] Jingren Zhou,et al. SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[11] Mark Handley,et al. A scalable content-addressable network , 2001, SIGCOMM 2001.

[12] Christopher Olston,et al. Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[13] Alberto O. Mendelzon,et al. WebOQL: restructuring documents, databases, and webs , 1999 .

[14] Aoying Zhou,et al. Chinese Web Infrastructure Building: Challenges and Our Roadmap , 2008, 2008 International Workshop on Information-Explosion and Next Generation Search.

[15] Ben Y. Zhao,et al. Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[16] Alberto O. Mendelzon,et al. Applications of a Web Query Language , 1997, Comput. Networks.

[17] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[18] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19] Alberto O. Mendelzon,et al. Finding Regular Simple Paths in Graph Databases , 1989, SIAM J. Comput..

[20] Li Ma,et al. Efficient Indices Using Graph Partitioning in RDF Triple Stores , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[21] Jianzhong Li,et al. InfiniteDB: a pc-cluster based parallel massive database management system , 2007, SIGMOD '07.

[22] Wei-Ying Ma,et al. Webstudio: building infrastructure for web data management , 2007, SIGMOD '07.

[23] Rob Pike,et al. Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..