Recognition of Data Records in Semi-structured Web-Pages Using Ontology and chi2 Statistical Distribution

Information extraction (IE) has been emerged as a noveldiscipline in computer science. In IE, intelligent algorithms areemployed to extract the required data, and structure them so thatthey are appropriate for query. In most IE systems, a web-pagestructure, e.g. HTML tags are used to recognize the looked-forinformation. In this article, an algorithm is developed torecognize the main region of web-pages containing the looked-forinformation, by means of an ontology, a web-page structure andgoodness-of-fit Χ2 test. After recognizingthe main region, the existing records of the region are recognized,and then each record is put in a text file.

[1]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2]  Bing Liu,et al.  NET - A System for Extracting Web Data from Flat and Nested Data Records , 2005, WISE.

[3]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[4]  Leo Obrst,et al.  The Semantic Web: A Guide to the Future of XML, Web Services and Knowledge Management , 2003 .

[5]  Tat-Seng Chua,et al.  Learning object models from semistructured Web documents , 2006, IEEE Transactions on Knowledge and Data Engineering.

[6]  Dimitrios Skoutas,et al.  STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques , 2005, IEEE Transactions on Knowledge and Data Engineering.

[7]  Hong Chen,et al.  Odaies: ontology-driven adaptive Web information extraction system , 2003, IEEE/WIC International Conference on Intelligent Agent Technology, 2003. IAT 2003..