A best-effort approach to an infrastructure for Chinese Web related research

The design of the infrastructure for Chinese Web (CWI), a prototype system aimed at forum data analysis, is introduced. CWI takes a best effort approach. 1) It tries its best to extract or annotate semantics over the web data. 2) It provides flexible schemes for users to transform the web data into eXtensible Markup Language (XML) forms with more semantic annotations that are more friendly for further analytical tasks. 3) A distributed graph repository, called DISGR is used as backend for management of web data. The paper introduces the design issues, reports the progress of the implementation, and discusses the research issues that are under study.

[1]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[2]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[3]  Aoying Zhou,et al.  Semantic Entity Detection by Integrating CRF and SVM , 2010, WAIM.

[4]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[5]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[6]  Aoying Zhou,et al.  Chinese Web Infrastructure Building: Challenges and Our Roadmap , 2008, 2008 International Workshop on Information-Explosion and Next Generation Search.

[7]  Alberto O. Mendelzon,et al.  Applications of a Web Query Language , 1997, Comput. Networks.

[8]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[9]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[10]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[11]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[12]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[13]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[14]  Aoying Zhou,et al.  DISG: A DIStributed Graph Repository for Web Infrastructure (Invited Paper) , 2008, 2008 Second International Symposium on Universal Communication.

[15]  Gordon S. Blair,et al.  A generic component model for building systems software , 2008, TOCS.

[16]  Weining Qian Storage and index support for data intensive web applications , 2010, 2010 4th International Universal Communication Symposium.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Nick Koudas,et al.  The design of a query monitoring system , 2009, TODS.

[19]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[20]  Alberto O. Mendelzon,et al.  Finding Regular Simple Paths in Graph Databases , 1989, SIAM J. Comput..

[21]  Li Ma,et al.  Efficient Indices Using Graph Partitioning in RDF Triple Stores , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[22]  Wei-Ying Ma,et al.  Webstudio: building infrastructure for web data management , 2007, SIGMOD '07.

[23]  Jianzhong Li,et al.  InfiniteDB: a pc-cluster based parallel massive database management system , 2007, SIGMOD '07.

[24]  Wilfred Ng,et al.  Efficient query processing on graph databases , 2009, TODS.

[25]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.