Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach

Structured community portals extract and integrate information from raw Web pages to present a unified view of entities and relationships in the community. In this paper we argue that to build such portals, a top-down, compositional, and incremental approach is a good way to proceed. Compared to current approaches that employ complex monolithic techniques, this approach is easier to develop, understand, debug, and optimize. In this approach, we first select a small set of important community sources. Next, we compose plans that extract and integrate data from these sources, using a set of extraction/integration operators. Executing these plans yields an initial structured portal. We then incrementally expand this portal by monitoring the evolution of current data sources, to detect and add new data sources. We describe our initial solutions to the above steps, and a case study of employing these solutions to build DBLife, a portal for the database community. We found that DBLife could be built quickly and achieve high accuracy using simple extraction/integration operators, and that it can be maintained and expanded with little human effort. The initial solutions together with the case study demonstrate the feasibility and potential of our approach.

[1]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[2]  Andrew Tomkins,et al.  How to build a WebFountain: An architecture for very large-scale text analytics , 2004, IBM Syst. J..

[3]  Kôiti Hasida,et al.  POLYPHONET: An advanced social network extraction system from the Web , 2007, J. Web Semant..

[4]  Andrew McCallum,et al.  An Integrated, Conditional Model of Information Extraction and Coreference with Appli , 2004, UAI.

[5]  Andreas Thor,et al.  MOMA - A Mapping-based Object Matching System , 2007, CIDR.

[6]  Xiaojin Zhu,et al.  Building Community Wikipedias: A Machine-Human Partnership Approach , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Asunción Gómez-Pérez,et al.  (KA)2: building ontologies for the Internet: a mid-term report , 1999, Int. J. Hum. Comput. Stud..

[8]  Wei-Ying Ma,et al.  Object-level Vertical Search , 2007, CIDR.

[9]  Asunción Gómez-Pérez,et al.  WEBODE in a Nutshell , 2003, AI Mag..

[10]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[11]  Raghu Ramakrishnan,et al.  DBLife: A Community Information Management Platform for the Database Research Community (Demo) , 2007, CIDR.

[12]  Sunita Sarawagi,et al.  Efficient Batch Top-k Search for Dictionary-based Entity Recognition , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[13]  Markus Krötzsch,et al.  Semantic Wikipedia , 2006, WikiSym '06.

[14]  Frederick Reiss,et al.  An Algebraic Approach to Rule-Based Information Extraction , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[15]  Andrew McCallum,et al.  A Machine Learning Approach to Building Domain-Specific Search Engines , 1999, IJCAI.

[16]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[17]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[18]  Raghu Ramakrishnan,et al.  Source-aware Entity Matching: A Compositional Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[19]  York Sure-Vetter,et al.  OntoWeb - A Semantic Web Community Portal , 2002, PAKM.

[20]  C. Lee Giles,et al.  DEADLINER: building a new niche search engine , 2000, CIKM '00.

[21]  Andrew McCallum,et al.  Information Extraction from the World Wide Web , 2005 .

[22]  Dan Suciu,et al.  Declarative specification of Web sites with Strudel , 2000, The VLDB Journal.

[23]  Frank Wm. Tompa,et al.  Seeking Stable Clusters in the Blogosphere , 2007, VLDB.

[24]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[25]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[26]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[27]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[28]  Raghu Ramakrishnan,et al.  Community Information Management , 2006, IEEE Data Eng. Bull..

[29]  Óscar Corcho,et al.  A Semantic Portal for the International Affairs Sector , 2004, EKAW.

[30]  Wolfgang Nejdl,et al.  Finding Related Pages Using the Link Structure of the WWW , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[31]  Steffen Staab,et al.  SEAL: a framework for developing SEmantic PortALs , 2001, K-CAP '01.

[32]  Dieter Fensel,et al.  AN EVALUATION OF SEMANTIC WEB PORTALS , 2004 .

[33]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[34]  Raghu Ramakrishnan,et al.  Managing information extraction: state of the art and research directions , 2006, SIGMOD Conference.

[35]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[36]  Ben Taskar,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .