URL-Based Web Page Classification: With n-Gram Language Models

There are some situations these days in which it is important to have an efficient and reliable classification of a web-page from the information contained in the Uniform Resource Locator (URL) only, without the need to visit the page itself. For example, a social media website may need to quickly identify status updates linking to malicious websites to block them. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Methods proposed for this task, for example, the all-grams approach which extracts all possible sub-strings as features, provide reasonable accuracy but do not scale well to large datasets.

[1]  Robert Wing Pong Luk,et al.  A Generative Theory of Relevance , 2008, The Information Retrieval Series.

[2]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[3]  Mark Craven,et al.  Combining Statistical and Relational Methods for Learning in Hypertext Domains , 1998, ILP.

[4]  Franco Salvetti,et al.  Efficient spam analysis for weblogs through URL segmentation , 2007 .

[5]  Beatriz de la Iglesia,et al.  URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models , 2014, KDIR.

[6]  Sofia Stamou,et al.  Keyword Identification within Greek URLs , 2011, Polytech. Open Libr. Int. Bull. Inf. Technol. Sci..

[7]  David Vilar,et al.  Dialogue act classification using a Bayesian approach ∗ , 2004 .

[8]  Masaru Kitsuregawa,et al.  Topic Classification of Spam Host based on URLs , 2010 .

[9]  Monika Henzinger,et al.  Web page language identification based on URLs , 2008, Proc. VLDB Endow..

[10]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[11]  Monika Henzinger,et al.  A Comprehensive Study of Techniques for URL-Based Web Page Language Classification , 2013, TWEB.

[12]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[13]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[14]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[15]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[16]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[17]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[18]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[19]  Min-Yen Kan Web page classification without the web page , 2004, WWW Alt. '04.

[20]  Steven C. H. Hoi,et al.  Cost-sensitive online active learning with application to malicious URL detection , 2013, KDD.

[21]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[22]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[23]  Monika Henzinger,et al.  A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification , 2011, TWEB.

[24]  William S. Cooper,et al.  Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval , 1995, TOIS.

[25]  Dale Schuurmans,et al.  Text Classification in Asian Languages without Word Segmentation , 2003, IRAL.

[26]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[27]  Egidio L. Terra Simple Language Models for Spam Detection , 2005, TREC.