Extracting Structured Data from Ajax Site

Ajax is an important approach for improving rich interactivity between web server and end users during Web 2.0 eras. At the same time, the structured data in AJAX web pages can not be extracted easily due to its asynchronous loading. In this paper, we propose a technique for extracting the structured data from the AJAX based web pages. Firstly, an AjaxFetcher component is created to fetch the dynamic page content by using an embedded browser. Secondly, two different strategies are used to extract the structured data from the obtained page contents. Especially for the page that contains multi-records, an automatic approach to determine each possible record is proposed. Experimental results show that fetching Ajax pages and extracting the structured data from them is feasible.

[1]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[2]  Arie van Deursen,et al.  Crawling AJAX by Inferring User Interface State Changes , 2008, 2008 Eighth International Conference on Web Engineering.

[3]  Tian Xia Extracting Multi-Records from Web Pages , 2008, 2008 Fourth International Conference on Semantics, Knowledge and Grid.

[4]  Arie van Deursen,et al.  An Architectural Style for Ajax , 2006, 2007 Working IEEE/IFIP Conference on Software Architecture (WICSA'07).

[5]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[6]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[7]  Tian Xia An Edit Distance Algorithm with Block Swap , 2008, 2008 The 9th International Conference for Young Computer Scientists.

[8]  Jesse James Garrett Ajax: A New Approach to Web Applications , 2007 .