A Scalable Topic-Based Open Source Search Engine

Site-based or topic-specific search engines work with mixed success because of the general difficulty of the information retrieval task, and the lack of good link information to allow authorities to be identified. We are advocating an open source approach to the problem due to its scope and need for software components. We have adopted a topic-based search engine because it represents the next generation of capability. This paper outlines our scalable system for site-based or topic-specific search, and demonstrates the developing system on a small 250,000 document collection of EU and UN web pages.

[1]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[2]  Andrzej Skowron,et al.  Proceedings of the 2005 IEEE / WIC / ACM International Conference on Web Intelligence , 2005 .

[3]  Sriram Raghavan,et al.  Building a distributed full-text index for the Web , 2001, WWW '01.

[4]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[5]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[6]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[7]  Aleks Jakulin,et al.  Applying Discrete PCA in Data Analysis , 2004, UAI.

[8]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.

[9]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC-12: HARD track , 2003, TREC.

[10]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[11]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Wray L. Buntine Variational Extensions to EM and Multinomial PCA , 2002, ECML.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[15]  Amit Singhal,et al.  A case study in web search using TREC algorithms , 2001, WWW '01.

[16]  Jeffrey Bennett,et al.  Clairvoyance Corporation Experiments in the TREC 2003 High Accuracy Retrieval from Douments (HARD) Track , 2003, TREC.

[17]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[18]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[19]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.