CALA: An unsupervised URL-based web page classification system

Unsupervised web page classification refers to the problem of clustering the pages in a web site so that each cluster includes a set of web pages that can be classified using a unique class. The existing proposals to perform web page classification do not fulfill a number of requirements that would make them suitable for enterprise web information integration, namely: to be based on a lightweight crawling, so as to avoid interfering with the normal operation of the web site, to be unsupervised, which avoids the need for a training set of pre-classified pages, or to use features from outside the page to be classified, which avoids having to download it. In this article, we propose CALA, a new automated proposal to generate URL-based web page classifiers. Our proposal builds a number of URL patterns that represent the different classes of pages in a web site, so further pages can be classified by matching their URLs to the patterns. Its salient features are that it fulfills all of the previous requirements, and it has been validated by a number of experiments using real-world, top-visited web sites. Our validation proves that CALA is very effective and efficient in practice.

[1]  Monika Henzinger,et al.  A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification , 2011, TWEB.

[2]  Alberto Pan,et al.  Web Navigation Sequences Automation in Modern Websites , 2009, DEXA.

[3]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[4]  Loredana Afanasiev,et al.  Harnessing the Deep Web: Present and Future , 2009, CIDR.

[5]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[6]  Juliana Freire,et al.  Automating Web navigation with the WebVCR , 2000, Comput. Networks.

[7]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[8]  Juliana Freire,et al.  A fast and robust method for web page template detection and removal , 2006, CIKM '06.

[9]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[10]  Rafael Corchuelo,et al.  Towards Discovering Conceptual Models behind Web Sites , 2012, ER.

[11]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[12]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[13]  Jong-Hyeok Lee,et al.  Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..

[14]  Rafael Corchuelo,et al.  A Survey on Region Extractors from Web Documents , 2013, IEEE Transactions on Knowledge and Data Engineering.

[15]  Ali Selamat,et al.  Web page feature selection and classification using neural networks , 2004, Inf. Sci..

[16]  Idit Keidar,et al.  Do not crawl in the DUST: Different URLs with similar text , 2009, ACM Trans. Web.

[17]  Jianping Zhang,et al.  The Role of URLs in Objectionable Web Content Categorization , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[18]  Yihong Gong,et al.  Combining content and link for classification using matrix factorization , 2007, SIGIR.

[19]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[20]  Hendrik Purwins,et al.  Dynamical hierarchical self‐organization of harmonic and motivic musical categories , 2008 .

[21]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[22]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[23]  Luis M. de Campos,et al.  Probabilistic Methods for Link-Based Classification at INEX 2008 , 2009, INEX.

[24]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[25]  Veljko M. Milutinovic,et al.  Recognition of common areas in a Web page using visual information: a possible application in a page classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[26]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[27]  Alicia Ageno,et al.  Adaptive information extraction , 2006, CSUR.

[28]  Lorenzo Blanco,et al.  Structure and Semantics of Data-IntensiveWeb Pages: An Experimental Study on their Relationships , 2008, J. Univers. Comput. Sci..

[29]  Lorenzo Blanco,et al.  Highly efficient algorithms for structural clustering of large websites , 2011, WWW.

[30]  Edleno Silva de Moura,et al.  Structure-Based Crawling in the Hidden Web , 2008, J. Univers. Comput. Sci..

[31]  Francesco Archetti,et al.  Enhancing web page classification through image-block importance analysis , 2008, Inf. Process. Manag..

[32]  Rafael Corchuelo,et al.  A statistical approach to URL-based web page clustering , 2012, WWW.

[33]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[34]  Wei Xie,et al.  Using Links to Aid Web Classification , 2007, 6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007).

[35]  Idit Keidar,et al.  Do not crawl in the DUST: different URLs with similar text , 2006, WWW.

[36]  Steffen Staab,et al.  Ontology-based Text Document Clustering , 2002, Künstliche Intell..

[37]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[38]  Hema Swetha Koppula,et al.  Learning URL patterns for webpage de-duplication , 2010, WSDM '10.

[39]  Wallace Koehler,et al.  An Analysis of Web Page and Web Site Constancy and Permanence , 1999, J. Am. Soc. Inf. Sci..

[40]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[41]  Xiaoqin Zhang,et al.  User oriented link function classification , 2008, WWW.

[42]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[43]  I. Hacking An Introduction to Probability and Inductive Logic , 2001 .

[44]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[45]  Rafael Corchuelo,et al.  Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction , 2014, IEEE Transactions on Knowledge and Data Engineering.

[46]  Ricard Marxer,et al.  Dynamical Hierarchical Self-Organization of Harmonic, Motivic, and Pitch Categories , 2007, NIPS 2007.

[47]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[48]  Ben Taskar,et al.  Probabilistic Models of Text and Link Structure for Hypertext Classification , 2001 .

[49]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[50]  David R. Karger,et al.  Using urls and table layout for web classification tasks , 2004, WWW '04.

[51]  Victor Carneiro,et al.  A Workflow Language for Web Automation , 2008, J. Univers. Comput. Sci..

[52]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[53]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[54]  Alon Y. Halevy,et al.  Enterprise information integration: successes, challenges and controversies , 2005, SIGMOD '05.

[55]  Graham Cormode,et al.  Applying link-based classification to label blogs , 2007, WebKDD/SNA-KDD '07.

[56]  KoehlerWallace An analysis of Web page and Web site constancy and permanence , 1999 .

[57]  Mark A. Pitt,et al.  Advances in Minimum Description Length: Theory and Applications , 2005 .

[58]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.