Searching and Archiving the Web with Tumba

In the past, Internet archives and Web search engines have always been conceived and implemented as independent information systems. This paper shows how Web search and archival functions are supported with tumba!, an information system that can serve national interests in preserving its Web data as cultural heritage, obtaining knowledge about the preferences and interests of its society in the information age and also in intelligence gathering. The tumba! search engine has a new repository architecture and uses innovative ranking and presentation algorithms optimised for this Web.

[1]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[2]  Randy H. Katz,et al.  Toward a unified framework for version modeling in engineering databases , 1990, CSUR.

[3]  Michael Day,et al.  Collecting and preserving the world wide web , 2003 .

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Brewster Kahle,et al.  Preserving the Internet , 1997 .

[6]  Daniel Gomes,et al.  Versus: A Web Repository , 2002 .

[7]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[8]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[9]  Mário J. Silva,et al.  Avaliação de Sistemas de Recuperação de Informação da Web em Português: Uma Proposta Inicial à Comunidade , 2003 .

[10]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[11]  José Luis Borbinha,et al.  A Deposit for Digital Collections , 2001, ECDL.

[12]  Mário J. Silva,et al.  An Initial Proposal for Cooperative Evaluation on Information Retrieval in Portuguese , 2003, PROPOR.

[13]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[14]  Miguel Costa,et al.  Ranking no Motor de Busca TUMBA , 2001 .

[15]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[16]  Marc Najork,et al.  High-performance Web Crawling High-performance Web Crawling Publication History , 2001 .