论文信息 - URL-Based Web Page Classification: With n-Gram Language Models

URL-Based Web Page Classification: With n-Gram Language Models

There are some situations these days in which it is important to have an efficient and reliable classification of a web-page from the information contained in the Uniform Resource Locator (URL) only, without the need to visit the page itself. For example, a social media website may need to quickly identify status updates linking to malicious websites to block them. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Methods proposed for this task, for example, the all-grams approach which extracts all possible sub-strings as features, provide reasonable accuracy but do not scale well to large datasets.

Beatriz de la Iglesia | Tarek Amr Abdallah

[1] Robert Wing Pong Luk,et al. A Generative Theory of Relevance , 2008, The Information Retrieval Series.

[2] Stephen E. Robertson,et al. A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[3] Mark Craven,et al. Combining Statistical and Relational Methods for Learning in Hypertext Domains , 1998, ILP.

[4] Franco Salvetti,et al. Efficient spam analysis for weblogs through URL segmentation , 2007 .

[5] Beatriz de la Iglesia,et al. URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models , 2014, KDIR.

[6] Sofia Stamou,et al. Keyword Identification within Greek URLs , 2011, Polytech. Open Libr. Int. Bull. Inf. Technol. Sci..

[7] David Vilar,et al. Dialogue act classification using a Bayesian approach ∗ , 2004 .

[8] Masaru Kitsuregawa,et al. Topic Classification of Spam Host based on URLs , 2010 .

[9] Monika Henzinger,et al. Web page language identification based on URLs , 2008, Proc. VLDB Endow..

[10] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[11] Monika Henzinger,et al. A Comprehensive Study of Techniques for URL-Based Web Page Language Classification , 2013, TWEB.