SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers

Abstract Crawlers, also called robots and spiders, are programs that browse the World Wide Web autonomously. This paper describes SPHINX, a Java toolkit and interactive development environment for Web crawlers. Unlike other crawler development systems, SPHINX is geared towards developing crawlers that are Web-site-specific, personally customized, and relocatable. SPHINX allows site-specific crawling rules to be encapsulated and reused in content analyzers, known as classifiers: Personal crawling tasks can be performed (often without programming) in the Crawler Workbench, an interactive environment for crawler development and testing. For efficiency, relocatable crawlers developed using SPHINX can be uploaded and executed on a remote Web server.