Fast webpage classification using URL features

We demonstrate the usefulness of the uniform resource locator (URL) alone in performing web page classification. This approach is faster than typical web page classification, as the pages do not have to be fetched and analyzed. Our approach segments the URL into meaningful chunks and adds component, sequential and orthographic features to model salient patterns. The resulting features are used in supervised maximum entropy modeling. We analyze our approach's effectiveness on two standardized domains. Our results show that in certain scenarios, URL-based methods approach the performance of current state-of-the-art full-text and link-based methods.

[1]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[2]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[3]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[4]  Mark Craven,et al.  Combining Statistical and Relational Methods for Learning in Hypertext Domains , 1998, ILP.

[5]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[6]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[7]  Christopher S. G. Khoo,et al.  A new statistical formula for Chinese text segmentation incorporating contextual information , 1999, SIGIR '99.

[8]  Thomas de Quincey [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[9]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[10]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[11]  Jakob Nielsen,et al.  Homepage Usability: 50 Websites Deconstructed , 2001 .

[12]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[13]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[14]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .

[15]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[16]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[17]  David R. Karger,et al.  Using urls and table layout for web classification tasks , 2004, WWW '04.

[18]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[19]  Min-Yen Kan Web page classification without the web page , 2004, WWW Alt. '04.