Weaving Entities into Relations: From Page Retrieval to Relation Mining on the Web

With its sheer amount of information, the Web is clearly an important frontier for data mining. While Web mining must start with content on the Web, there is no effective “search-based” mechanism to help sifting through the information on the Web. Our goal is to provide a such online search-based facility for supporting query primitives, upon which Web mining applications can be built. As a first step, this paper aims at entity-relation discovery, or E-R discovery, as a useful function– to weave scattered entities on the Web into coherent relations. To begin with, as our proposal, we formalize the concept of ER discovery. Further, to realize ER discovery, as our main thesis, we abstract tuple ranking– the essential challenge of ER discovery– as pattern-based cooccurrence analysis. Finally, as our key insight, we observe that such relation mining shares the same core functions as traditional page-retrieval systems, which enables us to build the new ER discovery upon today’s search engines, almost for free. We report our system prototype and testbed, WISDM-ER, with real Web corpus. Our case studies have demonstrated a high promise, achieving 83% − 91% accuracy for real benchmark queries– and thus the real possibilities of enabling ad-hoc Web mining tasks with online ER discovery.

[1]  Steffen Staab,et al.  An annotation framework for the semantic web , 2001 .

[2]  Roy Goldman,et al.  WSQ/DSQ: a practical approach for combined querying of databases and the Web , 2000, SIGMOD '00.

[3]  Bernardo Magnini,et al.  Is It the Right Answer? Exploiting Web Redundancy for Answer Validation , 2002, ACL.

[4]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[5]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[6]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[7]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[8]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[9]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[12]  Jeff Heflin,et al.  Searching the Web with SHOE , 2000 .

[13]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[14]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[15]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[16]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[17]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.