A Vertical Search Engine for School Information Based on Heritrix and Lucene
暂无分享,去创建一个
The contents on the web are increasing exponentially as the rapid development of the Internet applications and services continues to expand. A problem in obtaining useful information from vast contents quickly and accurately is facing us while people are enjoying the convenience of the Internet. The immediate response to this problem is a Web Search Engine. We developed a vertical search engine for a certain domain like university. The search engine consists of Crawler, Indexer, and Searcher. The crawler component is implemented with Heritrix crawler based on the mechanism of recursion and archiving. A reusable, extensible index establishment and management subsystem are designed and implemented by open-source package named Lucene in the indexer component. An experiment has been done for Chungbuk National University web sites, and the number of documents the system retrieves is more than 4 hundred times on the average for typical keywords set than those from Google or university's search engines.
[1] Michael Chau,et al. Comparison of Three Vertical Search Spiders , 2003, Computer.
[2] Kevin Curran,et al. Vertical Search Engines , 2007 .
[3] Soumen Chakrabarti,et al. Analyzing Fine-grained Hypertext Features for Enhanced Crawling and Topic Distillation , 2002, IEEE Data Eng. Bull..
[4] Hector Garcia-Molina,et al. Efficient Crawling Through URL Ordering , 1998, Comput. Networks.