Searching patterns for relation extraction over the web: rediscovering the pattern-relation duality

While tuple extraction for a given relation has been an active research area, its dual problem of pattern search-- to find and rank patterns in a principled way-- has not been studied explicitly. In this paper, we propose and address the problem of pattern search, in addition to tuple extraction. As our objectives, we stress reusability for pattern search and scalability of tuple extraction, such that our approach can be applied to very large corpora like the Web. As the key foundation, we propose a conceptual model PRDualRank to capture the notion of precision and recall for both tuples and patterns in a principled way, leading to the "rediscovery" of the Pattern-Relation Duality-- the formal quantification of the reinforcement between patterns and tuples with the metrics of precision and recall. We also develop a concrete framework for PRDualRank, guided by the principles of a perfect sampling process over a complete corpus. Finally, we evaluated our framework over the real Web. Experiments show that on all three target relations our principled approach greatly outperforms the previous state-of-the-art system in both effectiveness and efficiency. In particular, we improved optimal F-score by up to 64%.

[1]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[2]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[3]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[4]  Charles L. A. Clarke,et al.  Exploiting redundancy in question answering , 2001, SIGIR '01.

[5]  Eugene Agichtein Confidence Estimation Methods for Partially Supervised Information Extraction , 2006, SDM.

[6]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[7]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[8]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[9]  Luis Gravano,et al.  Querying text databases for efficient information extraction , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[10]  Doug Downey,et al.  Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison , 2004, AAAI.

[11]  Jimmy J. Lin,et al.  Web question answering: is more always better? , 2002, SIGIR '02.

[12]  Satoshi Sekine,et al.  Preemptive Information Extraction using Unrestricted Relation Discovery , 2006, NAACL.

[13]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[14]  Kevin Chen-Chuan Chang,et al.  Towards rich query interpretation: walking back and forth for mining query templates , 2010, WWW '10.

[15]  Satoshi Sekine,et al.  On-Demand Information Extraction , 2006, ACL.

[16]  Kevin Chen-Chuan Chang,et al.  Data-oriented content query system: searching for data into text on the web , 2010, WSDM '10.

[17]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.