STEM: a suffix tree-based method for web data records extraction

To automatically extract data records from Web pages, the data record extraction algorithm is required to be robust and efficient. However, most of existing algorithms are not robust enough to cope with rich information or noisy data. In this paper, we propose a novel suffix tree-based extraction method (STEM) for this challenging task. First, we extract a sequence of identifiers from the tag paths of Web pages. Then, a suffix tree is built on top of this sequence and four refining filters are proposed to screen out data regions that might not contain data records. To evaluate model performance, we define an evaluation metric called pattern similarity and perform rigorous experiments on five real data sets. The promising experimental results have demonstrated that the proposed STEM is superior to the state-of-the-art algorithms like MDR, TPC and CTVS with respect to precision, recall and pattern similarity. Moreover, the time complexity of STEM is linear to the total number of HTML tags contained in Web pages, which indicates the potential applicability of STEM in a wide range of Web-scale data record extraction applications.

[1]  Gail E. Kaiser,et al.  Automating Content Extraction of HTML Documents , 2005, World Wide Web.

[2]  Georg Gottlob,et al.  Scalable Web Data Extraction for Online Market Intelligence , 2009, Proc. VLDB Endow..

[3]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[4]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[5]  Lidong Bing,et al.  Towards a unified solution: data record region detection and segmentation , 2011, CIKM '11.

[6]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[7]  Gerhard Weikum,et al.  Combining information extraction and human computing for crowdsourced knowledge acquisition , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[8]  Lidong Bing,et al.  Robust detection of semi-structured web records using a DOM structure-knowledge-driven model , 2013, TWEB.

[9]  Bing Liu,et al.  NET - A System for Extracting Web Data from Flat and Nested Data Records , 2005, WISE.

[10]  Tim Furche,et al.  DIADEM: Thousands of Websites to a Single Database , 2014, Proc. VLDB Endow..

[11]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[12]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[13]  Yunming Ye,et al.  Detecting hot topics from Twitter: A multiview approach , 2014, J. Inf. Sci..

[14]  Sachio Hirokawa,et al.  Testbed for information extraction from deep web , 2004, WWW Alt. '04.

[15]  Clement T. Yu,et al.  Automatic extraction of dynamic record sections from search engine result pages , 2006, VLDB.

[16]  Rafael Corchuelo,et al.  A Survey on Region Extractors from Web Documents , 2013, IEEE Transactions on Knowledge and Data Engineering.

[17]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[18]  Tim Weninger,et al.  Text Extraction from the Web via Text-to-Tag Ratio , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[19]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[20]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[21]  Jiawei Han,et al.  CETR: content extraction via tag ratios , 2010, WWW '10.

[22]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[23]  Mohammed Kayed Peer Matrix Alignment: A New Algorithm , 2012, PAKDD.

[24]  Dan Roth,et al.  Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.

[25]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[26]  Lejian Liao,et al.  DOM based content extraction via text density , 2011, SIGIR.

[27]  Lejian Liao,et al.  A hybrid approach for content extraction with text density and visual importance of DOM nodes , 2013, Knowledge and Information Systems.

[28]  Ji-Rong Wen,et al.  Efficient record-level wrapper induction , 2009, CIKM.

[29]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[30]  Woong-Kee Loh,et al.  A Storage-Efficient Suffix Tree Construction Algorithm for Human Genome Sequences , 2011, IEICE Trans. Inf. Syst..

[31]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[32]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[33]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[34]  Shanchan Wu,et al.  Automatic Web Content Extraction by Combination of Learning and Grouping , 2015, WWW.

[35]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[36]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[37]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[38]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[39]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[40]  Donato Malerba,et al.  HyLiEn: a hybrid approach to general list extraction on the web , 2011, WWW.

[41]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[42]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[43]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[44]  Xiaotie Deng,et al.  A new suffix tree similarity measure for document clustering , 2007, WWW '07.

[45]  Valter Crescenzi,et al.  ALFRED: crowd assisted data extraction , 2013, WWW '13 Companion.

[46]  Arbee L. P. Chen,et al.  Efficient frequent sequence mining by a dynamic strategy switching algorithm , 2008, The VLDB Journal.

[47]  Li Li,et al.  Extracting data records from web using suffix tree , 2012, MDS '12.

[48]  Gail E. Kaiser,et al.  DOM-based content extraction of HTML documents , 2003, WWW '03.

[49]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[50]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[51]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[52]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[53]  Ronald I. Greenberg Bounds on the Number of Longest Common Subsequences , 2003, ArXiv.

[54]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[55]  Roberto Grossi,et al.  Suffix trees and their applications in string algorithms , 1993 .

[56]  Yi Liu,et al.  Combining Tag and Value Similarity for Data Extraction and Alignment , 2012, IEEE Transactions on Knowledge and Data Engineering.