论文信息 - Recognition of Data Records in Semi-structured Web-Pages Using Ontology and chi2 Statistical Distribution

Recognition of Data Records in Semi-structured Web-Pages Using Ontology and chi2 Statistical Distribution

Information extraction (IE) has been emerged as a noveldiscipline in computer science. In IE, intelligent algorithms areemployed to extract the required data, and structure them so thatthey are appropriate for query. In most IE systems, a web-pagestructure, e.g. HTML tags are used to recognize the looked-forinformation. In this article, an algorithm is developed torecognize the main region of web-pages containing the looked-forinformation, by means of an ontology, a web-page structure andgoodness-of-fit Χ2 test. After recognizingthe main region, the existing records of the region are recognized,and then each record is put in a text file.

Mehran Mohsenzadeh | Amir Masoud Rahmani | Reza Keshavarzi | Amin Keshavarzi

[1] Khaled Shaalan,et al. A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2] Bing Liu,et al. NET - A System for Extracting Web Data from Flat and Nested Data Records , 2005, WISE.

[3] Robert L. Grossman,et al. Mining data records in Web pages , 2003, KDD '03.

[4] Leo Obrst,et al. The Semantic Web: A Guide to the Future of XML, Web Services and Knowledge Management , 2003 .

[5] Tat-Seng Chua,et al. Learning object models from semistructured Web documents , 2006, IEEE Transactions on Knowledge and Data Engineering.

[6] Dimitrios Skoutas,et al. STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques , 2005, IEEE Transactions on Knowledge and Data Engineering.

[7] Hong Chen,et al. Odaies: ontology-driven adaptive Web information extraction system , 2003, IEEE/WIC International Conference on Intelligent Agent Technology, 2003. IAT 2003..