Fast and Scalable Pattern Mining for Media-Type Focused Crawling

Search engines targeting content other than hypertext documents require a crawler that discovers resources identifying files of certain media types. Naive crawling approaches do not guarantee a sufficient supply of new URIs (Uniform Resource Identifiers) to visit; effective and scalable mechanisms for discovering and crawling targeted resources are needed. One promising approach is to use data mining techniques to identify the media type of a resource without the need for downloading the content of the resource. The idea is to use a learning approach on features derived from patterns occuring in the resource identifier. We present a focused crawler as a use case for fast and scalable data mining and discuss classification and pattern mining techniques suited for selecting resources satisfying specified media types. We show that we can process an average of 17,000 URIs/second and still detect the media type of resources with a precision of more than 80% and a recall of over 65% for all media types.

[1]  Giovanni Soda,et al.  Evaluation Methods for Focused Crawling , 2001, AI*IA.

[2]  Jürgen Umbrich,et al.  Four Heuristics to Guide Structured Content Crawling , 2008, 2008 Eighth International Conference on Web Engineering.

[3]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[4]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[5]  Gerhard Weikum,et al.  The BINGO! focused crawler: from bookmarks to archetypes , 2002, Proceedings 18th International Conference on Data Engineering.

[6]  Jun Li,et al.  Focused crawling by exploiting anchor text using decision tree , 2005, WWW '05.

[7]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[8]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[9]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[10]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[11]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[12]  Marcel Karnstedt,et al.  Mining Data Streams under Dynamicly Changing Resource Constraints , 2006, LWA.

[13]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[14]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[15]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[16]  Evangelos E. Milios,et al.  PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING , 2004, WIDM '04.

[17]  A. Karimi,et al.  Master‟s thesis , 2011 .

[18]  Hans-Peter Kriegel,et al.  Accurate and Efficient Crawling for Relevant Websites , 2004, VLDB.

[19]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.