Chinese Web Infrastructure Building: Challenges and Our Roadmap

With the development of World-Wide Web, storage and utilization of Web data has become a big challenge to data management community. Though many commercial and academic tools emerge, the structure, content, and user behavior of Chinese Web is not fully studied. We are working on building a Chinese Web Infrastructure for support of such research. In this paper, the challenges of building such a system is analyzed, and our technical roadmap is discussed.

[1]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[2]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[3]  Aoying Zhou,et al.  SDI: a swift tree structure for multi-dimensional data indexing in peer-to-peer networks , 2007, Infoscale.

[4]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[5]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[6]  GhemawatSanjay,et al.  The Google file system , 2003 .

[7]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[8]  Aoying Zhou,et al.  GChord: Indexing for Multi-Attribute Query in P2P System with Low Maintenance Cost , 2007, DASFAA.

[9]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[10]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[13]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[14]  Li Xiaoming,et al.  From web archive to WebDigest: concept and examples , 2008 .

[15]  Aoying Zhou,et al.  COSTA: Adaptive Indexing for Terms in a Large-scale Distributed System , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[17]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[18]  Beng Chin Ooi,et al.  One table stores all: Enabling painless free-and-easy data publishing and sharing , 2007, CIDR.

[19]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[20]  Aoying Zhou,et al.  Approximately Processing Multi-granularity Aggregate Queries over Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[21]  Alberto O. Mendelzon,et al.  Applications of a Web Query Language , 1997, Comput. Networks.

[22]  Xiaoming Li,et al.  From WebArchive to WebDigest : Concept and Examples , 2008, ADC.

[23]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[24]  Ben Y. Zhao,et al.  Tapestry: a fault-tolerant wide-area application infrastructure , 2002, CCRV.

[25]  Wei-Ying Ma,et al.  Webstudio: building infrastructure for web data management , 2007, SIGMOD '07.