Structured AJAX Data Extraction Based on Agricultural Ontology

Abstract More web pages are widely applying AJAX (Asynchronous JavaScript XML) due to the rich interactivity and incremental communication. By observing, it is found that the AJAX contents, which could not be seen by traditional crawler, are well-structured and belong to one specific domain generally. Extracting the structured data from AJAX contents and annotating its semantic are very significant for further applications. In this paper, a structured AJAX data extraction method for agricultural domain based on agricultural ontology was proposed. Firstly, Crawljax, an open AJAX crawling tool, was overridden to explore and retrieve the AJAX contents; secondly, the retrieved contents were partitioned into items and then classified by combining with agricultural ontology. HTML tags and punctuations were used to segment the retrieved contents into entity items. Finally, the entity items were clustered and the semantic annotation was assigned to clustering results according to agricultural ontology. By experimental evaluation, the proposed approach was proved effectively in resource exploring, entity extraction, and semantic annotation.

[1]  Daoliang Li,et al.  IFIP Advances in information and communication technology , 2007 .

[2]  Tian Xia Extracting Structured Data from Ajax Site , 2009, 2009 First International Workshop on Database Technology and Applications.

[3]  Yong-qi Huang,et al.  Research on Development of Agricultural Geographic Information Ontology , 2012 .

[4]  Arie van Deursen,et al.  Regression Testing Ajax Applications: Coping with Dynamism , 2010, 2010 Third International Conference on Software Testing, Verification and Validation.

[5]  Arie van Deursen,et al.  An Architectural Style for Ajax , 2006, 2007 Working IEEE/IFIP Conference on Software Architecture (WICSA'07).

[6]  Liu Zhi,et al.  An ontology-based Web information extraction approach , 2010, 2010 2nd International Conference on Future Computer and Communication.

[7]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[8]  Dickson Lukose,et al.  World-Wide Semantic Web of Agriculture Knowledge , 2012 .

[9]  Paul Buitelaar,et al.  Ontology-based information extraction and integration from heterogeneous data sources , 2008, Int. J. Hum. Comput. Stud..

[10]  Charles Schafer,et al.  Bootstrapping Information Extraction from Semi-structured Web Pages , 2008, ECML/PKDD.

[11]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[12]  Clement T. Yu,et al.  Automatic extraction of dynamic record sections from search engine result pages , 2006, VLDB.

[13]  Claus Rick,et al.  Efficient Computation of All Longest Common Subsequences , 2000, SWAT.

[14]  Gerhard Friedrich,et al.  xCrawl: a high-recall crawling method for Web mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[15]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[16]  Paolo Tonella,et al.  State-Based Testing of Ajax Web Applications , 2008, 2008 1st International Conference on Software Testing, Verification, and Validation.

[17]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[18]  Arie van Deursen,et al.  Crawling AJAX by Inferring User Interface State Changes , 2008, 2008 Eighth International Conference on Web Engineering.

[19]  Steven Walczak,et al.  Adaptive web information extraction , 2006, CACM.

[20]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[21]  Thomas A. Berson,et al.  Differential Cryptanalysis Mod 2^32 with Applications to MD5 , 1992, EUROCRYPT.

[22]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[23]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[24]  Xue Wang,et al.  From Web Resources to Agricultural Ontology: a Method for Semi-Automatic Construction , 2012 .

[25]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[26]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[27]  Jesse James Garrett Ajax: A New Approach to Web Applications , 2007 .

[28]  Dejing Dou,et al.  Ontology-based information extraction: An introduction and a survey of current approaches , 2010, J. Inf. Sci..