Specification and discovery of web patterns: a graph grammar approach

Finding useful information from the Web becomes increasingly difficult as the volume of Web data rapidly grows. To facilitate effective Web browsing, Web designers usually display the same type of information with a consistent layout (referred to as a Web pattern). Discovering Web patterns can benefit many applications, such as extracting structured data. This paper presents a generic framework for discovering Web patterns and recognizing their instances (i.e., structured data) based on graph grammars. In our framework, a Web pattern is visually yet formally specified as a graph grammar, which is automatically induced through a grammar induction engine. The grammar induction engine is featured by converting the problem of (2-dimensional) graph grammar induction to (1-dimensional) string induction. Based on the induced pattern, matching instances are recognized from Web pages through a graph parsing process. We have evaluated the framework on twenty-one e-commerce Web sites. The evaluation results are promising with a high F1-score.

[1]  Fidel Cacheda,et al.  Finding and Extracting Data Records from Web Pages , 2007, EUC.

[2]  Kang Zhang,et al.  Constructing VEGGIE: Machine Learning for Context-Sensitive Graph Grammars , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[3]  Jing Liu,et al.  Automatic extraction of web data records containing user-generated content , 2010, CIKM.

[4]  Philip T. Cox,et al.  Building Environments for Visual Programming of Robots by Demonstration , 2000, J. Vis. Lang. Comput..

[5]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[6]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[7]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[8]  Jun Kong,et al.  User-centric adaptation of Web information for small screens , 2012, J. Vis. Lang. Comput..

[9]  Alfred Bork,et al.  Multimedia in Learning , 2001 .

[10]  Hartmut Ehrig,et al.  Handbook of graph grammars and computing by graph transformation: vol. 3: concurrency, parallelism, and distribution , 1999 .

[11]  Rafael Corchuelo,et al.  A Survey on Region Extractors from Web Documents , 2013, IEEE Transactions on Knowledge and Data Engineering.

[12]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[13]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[14]  B. Schneirdeman,et al.  Designing the User Interface: Strategies for Effective Human-Computer Interaction , 1998 .

[15]  Lidong Bing,et al.  Robust detection of semi-structured web records using a DOM structure-knowledge-driven model , 2013, TWEB.

[16]  Hasan M. Jamil,et al.  An Efficient Web-Based Wrapper and Annotator for Tabular Data , 2010, Int. J. Softw. Eng. Knowl. Eng..

[17]  Maurice Bruynooghe,et al.  Sub Node Extraction with Tree Based Wrappers , 2008, ECAI.

[18]  Sendren Sheng-Dong Xu,et al.  Object–image-based quality-on-demand energy saving schemes for OLED displays , 2014 .

[19]  Xuanjing Huang,et al.  Template-independent wrapper for web forums , 2009, SIGIR.

[20]  Jane Yung-jen Hsu,et al.  Tree-Structured Template Generation for Web Pages , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[21]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[22]  Przemyslaw Grzegorzewski,et al.  Retrieving Informative Content from Web Pages with Conditional Learning of Support Vector Machines and Semantic Analysis , 2012, ICAISC.

[23]  Lawrence B. Holder,et al.  Graph Grammar Induction on Structural Data for Visual Programming , 2006, 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06).

[24]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[25]  Yanchun Zhang,et al.  Leveraging Visual Features and Hierarchical Dependencies for Conference Information Extraction , 2013, APWeb.

[26]  Raymond K. Wong,et al.  Grouping hyperlinks for improved voice/mobile accessibility , 2008, W4A '08.

[27]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[28]  Wee Sun Lee,et al.  Understanding the function of web elements for mobile content delivery using random walk models , 2005, WWW '05.

[29]  Massimo Ruffolo,et al.  SILA: a spatial instance learning approach for deep webpages , 2011, CIKM '11.

[30]  Ji-Rong Wen,et al.  Template-Independent News Extraction Based on Visual Consistency , 2007, AAAI.

[31]  Khaled Shaalan,et al.  FiVaTech: Page-Level Web Data Extraction from Template Pages , 2007 .

[32]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[33]  Francesco Archetti,et al.  Enhancing web page classification through image-block importance analysis , 2008, Inf. Process. Manag..

[34]  Jun Hong,et al.  Visually extracting data records from the deep web , 2013, WWW.

[35]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[36]  Sam Liu,et al.  Web document text and images extraction using DOM analysis and natural language processing , 2009, DocEng '09.

[37]  Jinlin Chen,et al.  Perception-oriented online news extraction , 2008, JCDL '08.

[38]  Sunita Sarawagi Automation in Information Extraction and Data Integration , 2002, VLDB.

[39]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[40]  Xing Xie,et al.  Browsing on small displays by transforming Web pages into hierarchically structured subpages , 2009, TWEB.

[41]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[42]  Jun Kong,et al.  Spatial graph grammars for graphical user interfaces , 2006, TCHI.

[43]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[44]  Veljko M. Milutinovic,et al.  Recognition of common areas in a Web page using visual information: a possible application in a page classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[45]  Bing Liu,et al.  Extracting Web Data Using Instance-Based Learning , 2007, World Wide Web.

[46]  Jun Ding,et al.  Automatic Web Information Extraction Based on Rules , 2011, WISE.

[47]  Lidong Bing,et al.  Towards a unified solution: data record region detection and segmentation , 2011, CIKM '11.

[48]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[49]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[50]  Eduardo Sany Laber,et al.  A fast and simple method for extracting relevant content from news webpages , 2009, CIKM.

[51]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[52]  Jun Yang,et al.  AUTOBIB: automatic extraction of bibliographic information on the Web , 2004, Proceedings. International Database Engineering and Applications Symposium, 2004. IDEAS '04..

[53]  Jun Kong,et al.  Web Interface Interpretation Using Graph Grammars , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[54]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[55]  Jun Kong,et al.  Efficient web browsing on small screens , 2008, AVI '08.

[56]  Grzegorz Rozenberg,et al.  Handbook of Graph Grammars and Computing by Graph Transformations, Volume 1: Foundations , 1997 .

[57]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[58]  Rafael Corchuelo,et al.  TEX: An efficient and effective unsupervised Web information extractor , 2013, Knowl. Based Syst..

[59]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[60]  Jun Kong,et al.  Graph Grammar Based Web Data Extraction , 2011, SEKE.

[61]  Wee Sun Lee,et al.  Using link analysis to improve layout on mobile devices , 2004, WWW '04.

[62]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[63]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[64]  Chia-Hui Chang,et al.  FiVaTech: Page-Level Web Data Extraction from Template Pages , 2010, IEEE Trans. Knowl. Data Eng..