Web Data Extraction Using Tree Structure Algorithms - A Comparison

Nowadays, Web pages provide a large amount of structured data, which is required by many advanced applications. This data can be searched through their Web query interfaces. The retrieved information is also called 'deep or hidden data'. The deep data is enwrapped in Web pages in the form of data records. These special Web pages are generated dynamically and presented to users in the form of HTML documents along with other content. These webpages can be a virtual gold mine of information for business, if mined effectively. Web Data Extraction systems or web wrappers are software applications for the purpose of extracting information from Web sources like Web pages. A Web Data Extraction system usually interacts with a Web source and extracts data stored in it. The extracted data is converted into the most convenient structured format and stored for further usage. This paper deals with the development of such a wrapper, which takes search engine result pages as input and converts them into structured format. Secondly, this paper proposes a new algorithm called Improved Tree Matching algorithm, which in turn, is based on the efficient Simple Tree Matching (STM) algorithm. Towards the end of this work, there is given a comparison with existing works. Experimental results show that this approach can extract web data with lower complexity compared to other existing approaches. Index Terms—About Web Data Extraction, Document Object Model (DOM), Improved Tree Matching algorithm. Based on natural language processing - in this method of data extraction, first the structures of clauses and phrases, and the relationship of clauses are analyzed, and then rules for extraction based on the syntax and semantics are generated. Hence, this method is applicable to the source documents which contain a lot of text, especially grammatical text. But the texts in Web pages are usually imperfectly structured sentences, which narrows its applicability. The classical systems that are based on this principle are RAPIER (4), WHISK (5) and so on. Based on wrapper summing up the rules - this method of information extraction makes use of machine learning techniques to learn structural features from a number of Web pages, and then sums up the extraction rules using the structural features. Usually, one wrapper can only handle a specific source. To extract information from different sources, a series of wrapper libraries are needed, which requires a huge workload. Tools using this method are mainly WIEN (6) and SoftMealy (7).

[1]  Jing Li,et al.  Web Data Extraction Based on Tree Structure Analysis and Template Generation , 2010, 2010 International Conference on E-Product E-Service and E-Entertainment.

[2]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[3]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[4]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[5]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[6]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[7]  Yang Zhang,et al.  Web Data Extraction Based on Simple Tree Matching , 2010, 2010 WASE International Conference on Information Engineering.

[8]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[9]  Eugene J. Shekita,et al.  Querying XML Views of Relational Data , 2001, VLDB.

[10]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[11]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..