Graph Grammar Based Web Data Extraction

Web data extraction becomes a hot topic after the invention of World Wide Web, because the large amount of information on the Web makes it challenging to retrieve useful information. Due to the diverse designs and presentations of information on different Web sites, it is hard to implement a general solution to extract data across different Web sites. This paper presents a novel method based on graph grammar to extract the same type of information from different Web sites without the need of training or adjustment. Our approach formalizes a common Web pattern as a graph grammar. Then, based on the visual layout and HTML DOM structure, a Web page is abstracted as a spatial graph that highlights the essential spatial relations between information objects. According to the defined graph grammar, a spatial parsing is performed on the spatial graph to extract structured records. We have evaluated our approach on twenty one different Web sites, and achieved the F1-score as 97.49% which shows promising performance.

[1]  Jinlin Chen,et al.  Perception-oriented online news extraction , 2008, JCDL '08.

[2]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[3]  Eduardo Sany Laber,et al.  A fast and simple method for extracting relevant content from news webpages , 2009, CIKM.

[4]  Bing Liu,et al.  Extracting Web Data Using Instance-Based Learning , 2005, World Wide Web.

[5]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[6]  Ben Shneiderman,et al.  Designing the User Interface: Strategies for Effective Human-Computer Interaction , 1998 .

[7]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[8]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[9]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[10]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[11]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[12]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[13]  Jun Kong,et al.  Spatial graph grammars for graphical user interfaces , 2006, TCHI.

[14]  Ji-Rong Wen,et al.  Template-Independent News Extraction Based on Visual Consistency , 2007, AAAI.

[15]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[16]  Kang Zhang,et al.  Constructing VEGGIE: Machine Learning for Context-Sensitive Graph Grammars , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[17]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[18]  Jane Yung-jen Hsu,et al.  Tree-Structured Template Generation for Web Pages , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[19]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .