It is often desirable to extract structured information from raw web pages for better information browsing, query answering, and pattern mining. Many such Information Extraction (IE) technologies are costly and applying them at the web-scale is impractical. In this paper, we propose a novel prioritization approach where candidate pages from the corpus are ordered according to their expected contribution to the extraction results and those with higher estimated potential are extracted earlier. Systems employing this approach can stop the extraction process at any time when the resource gets scarce (i.e., not all pages in the corpus can be processed), without worrying about wasting extraction effort on unimportant pages. More specifically, we define a novel notion to measure the value of extraction results and design various mechanisms for estimating a candidate page's contribution to this value. We further design and build the EXTRACTION PRIORITIZATION (EP) system with efficient scoring and scheduling algorithms, and experimentally demonstrate that EP significantly outperforms the naive approach and is more flexible than the classifier approach.
[1]
Sandeep Pandey,et al.
Crawl ordering by search impact
,
2008,
WSDM '08.
[2]
Oren Etzioni,et al.
Open Information Extraction from the Web
,
2007,
CACM.
[3]
Cong Yu,et al.
Purple SOX extraction management system
,
2009,
SGMD.
[4]
Praveen Paritosh,et al.
Freebase: a collaboratively created graph database for structuring human knowledge
,
2008,
SIGMOD Conference.
[5]
Luis Gravano,et al.
Querying text databases for efficient information extraction
,
2003,
Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).
[6]
Alon Y. Halevy,et al.
Pay-as-you-go user feedback for dataspace systems
,
2008,
SIGMOD Conference.
[7]
Peter Norvig,et al.
Artificial Intelligence: A Modern Approach
,
1995
.