论文信息 - Ontology Suitability for Uncertain Extraction of Information from Multi-Record Web Documents

Ontology Suitability for Uncertain Extraction of Information from Multi-Record Web Documents

Ontology based data extraction from multi-record Web documents works well, but only if the ontology is suitable for the Web document. How do we know whether the ontology is suitable? To resolve this question, we present an approach based on three heuristics: density, schema, and grouping. We encode the first heuristic as a density function and use probabilistic models for the second and third. We argue that these heuristics and our computational models for these heuristics correctly determine the suitability of a Web document for a given ontology.

[1] David W. Embley,et al. Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[2] David W. Embley,et al. Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[3] Fabio Crestani,et al. “Is this document relevant?…probably”: a survey of probabilistic models in information retrieval , 1998, CSUR.

[4] Norbert Fuhr,et al. Probabilistic Datalog—a logic for powerful retrieval methods , 1995, SIGIR '95.

[5] David W. Embley,et al. Ontology-based extraction and structuring of information from data-rich unstructured documents , 1998, CIKM '98.

[6] David W. Embley,et al. A Conceptual-Modeling Approach to Extracting Data from the Web , 1998, ER.