xCrawl: a high-recall crawling method for Web mining

Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its “recall”, i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.

[1]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[2]  Hans-Peter Kriegel,et al.  Accurate and Efficient Crawling for Relevant Websites , 2004, VLDB.

[3]  Idit Keidar,et al.  Do not crawl in the DUST: different URLs with similar text , 2006, WWW.

[4]  Gerhard Friedrich,et al.  AllRight: Automatic Ontology Instantiation from Tabular Web Documents , 2007, ISWC/ASWC.

[5]  Boris Chidlovskii,et al.  Crawling for domain-specific hidden Web resources , 2003, Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003. WISE 2003..

[6]  Anirban Dasgupta,et al.  The discoverability of the web , 2007, WWW '07.

[7]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[9]  Hans-Peter Kriegel,et al.  Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies , 2001 .

[10]  Stephen E. Robertson,et al.  On Term Selection for Query Expansion , 1991, J. Documentation.

[11]  C. Lee Giles,et al.  DEADLINER: building a new niche search engine , 2000, CIKM '00.

[12]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[13]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[14]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[15]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[16]  Gerhard Friedrich,et al.  Clustering web documents with tables for information extraction , 2007, K-CAP '07.

[17]  Luis Gravano,et al.  Querying text databases for efficient information extraction , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[18]  Soumen Chakrabarti,et al.  Mining the web - discovering knowledge from hypertext data , 2002 .

[19]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[20]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[21]  Gerhard Friedrich,et al.  An Integrated Environment for the Development of Knowledge-Based Recommender Applications , 2006, Int. J. Electron. Commer..

[22]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[23]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[24]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[25]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[26]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[27]  Arie van Deursen,et al.  Crawling AJAX by Inferring User Interface State Changes , 2008, 2008 Eighth International Conference on Web Engineering.

[28]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[29]  Gerhard Friedrich,et al.  Automated ontology instantiation from tabular web sources - The AllRight system , 2009, J. Web Semant..

[30]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[31]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[32]  Idit Keidar,et al.  Do not crawl in the DUST: Different URLs with similar text , 2009, ACM Trans. Web.

[33]  Jian Hu,et al.  Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[34]  Wanli Zuo,et al.  SVM based adaptive learning method for text classification from positive and unlabeled documents , 2008, Knowledge and Information Systems.

[35]  Christos Faloutsos,et al.  Random walk with restart: fast solutions and applications , 2008, Knowledge and Information Systems.