UnURL: Unsupervised Learning from URLs

Web pages are identified by their URLs. For authoritative web pages, pages that are focused on a specific topic, webmasters tend to use URLs which summarize the page. URL information is good for clustering because, they are small and ubiquitous, making techniques based on just URL information magnitudes faster than those which make use of the text content as well. We present a system that makes use of only URL information to perform clustering of web search result sets, clustering of general web document corpora and topic identification of topical URL corpora. This research prototype which we call UnURL is, to the best of our knowledge, the first attempt on using unsupervised machine learning techniques on URLs.