论文信息 - Web information extraction based on news domain ontology theory

Web information extraction based on news domain ontology theory

For the current web information extraction can't adapt to the various page structures, this paper proposes a Web Information Extraction Method based on News Domain Ontology. The areas are accurately found out and the interested information was extracted exactly based on information extraction rules which is generated by news domain ontology. Using the technology of page processing, page conversion, XPath etc, the information extraction system based on news domain ontology is implemented. Testing from news site shows that the approach proposed doesn 't rely on the page structure and it can increase the recall and precision of information extraction.

Li Liu | Junfang Shi

[1] David W. Embley,et al. Ontology-based extraction and structuring of information from data-rich unstructured documents , 1998, CIKM '98.

[2] Gao Yue. Rules Construction and Implementation in DOM-based Web Information Extraction , 2007 .

[3] Tansel Özyer,et al. Employing Clustering Techniques for Automatic Information Extraction From HTML Documents , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[4] Gio Wiederhold,et al. Mediators in the architecture of future information systems , 1992, Computer.

[5] Wenfei Fan,et al. Rewriting Regular XPath Queries on XML Views , 2007, 2007 IEEE 23rd International Conference on Data Engineering.