Synergic Data Extraction and Crawling for Large Web Sites

Data collected from data-intensive web sites is widely used today in various applications and online services. We present a new methodology for a synergic specification of crawling and wrapping tasks on large data-intensive web sites, allowing the execution of wrappers while the crawler is collecting pages at the different levels of the derived web site structure. It is supported by a working system devoted to non-expert users, built over a semi-automatic inference engine. By tracking and learning from the browsing activity of the non-expert user, the system derives a model that describes the topological structures of the site's navigational paths as well as the inner structures of the HTML pages. This model allows the system to generate and execute crawling and wrapping definitions in an interleaved process. To collect a representative sample set that feeds the inference engine, we propose in this context a solution to an often neglected problem, called the Sampling Problem. An extensive experimental evaluation shows that our system and the underlying methodology can successfully operate on most of the structured sites available on the Web.

[1]  Babak Bagheri Hariri,et al.  A Method for Focused Crawling Using Combination of Link Structure and Content Similarity , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[2]  Tim Furche,et al.  Taking the OXPath down the deep web , 2011, EDBT/ICDT '11.

[3]  Valter Crescenzi,et al.  Clustering Web pages based on their structure , 2005, Data Knowl. Eng..

[4]  Valter Crescenzi,et al.  WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES , 2008, Appl. Artif. Intell..

[5]  Valter Crescenzi,et al.  Crawling programs for wrapper-based applications , 2008, 2008 IEEE International Conference on Information Reuse and Integration.

[6]  Ming-Syan Chen,et al.  Mining Web informative structures and contents based on entropy analysis , 2004, IEEE Transactions on Knowledge and Data Engineering.

[7]  ZhaiYanhong,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006 .

[8]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[9]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[10]  Ee-Peng Lim,et al.  An Automated Algorithm for Extracting Website Skeleton , 2004, DASFAA.

[11]  J. Carme,et al.  WEB WRAPPER SPECIFICATION USING COMPOUND FILTER LEARNING , 2006 .

[12]  Edleno Silva de Moura,et al.  GoGetIt!: a tool for generating structure-driven web crawlers , 2006, WWW '06.

[13]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[14]  Elio Masciari,et al.  Web wrapper induction: a brief survey , 2004, AI Commun..

[15]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[17]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.