论文信息 - UnURL: Unsupervised Learning from URLs

UnURL: Unsupervised Learning from URLs

Web pages are identified by their URLs. For authoritative web pages, pages that are focused on a specific topic, webmasters tend to use URLs which summarize the page. URL information is good for clustering because, they are small and ubiquitous, making techniques based on just URL information magnitudes faster than those which make use of the text content as well. We present a system that makes use of only URL information to perform clustering of web search result sets, clustering of general web document corpora and topic identification of topical URL corpora. This research prototype which we call UnURL is, to the best of our knowledge, the first attempt on using unsupervised machine learning techniques on URLs.

Deepak Khemani | P Deepak

[1] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[2] Deepak Khemani,et al. Unsupervised Learning from URL Corpora , 2006, COMAD.

[3] Mark Hahn,et al. Uniform Resource Locators , 1995 .

[4] Peter Willett,et al. Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[5] Min-Yen Kan. Web page classification without the web page , 2004, WWW Alt. '04.

[6] Min-Yen Kan,et al. Fast webpage classification using URL features , 2005, CIKM '05.

[7] Tim Berners-Lee,et al. Uniform Resource Locators , 1994 .

[8] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .