I/O-Conscious Data Preparation for Large-Scale Web Search Engines

Given that commercial search engines cover billions of web pages, efficiently managing the corresponding volumes of disk-resident data needed to answer user queries quickly is a formidable data manipulation challenge. We present a general technique for efficiently carrying out large sets of simple transformation or querying operations over external-memory data tables. It greatly reduces the number of performed disk accesses and seeks by maximizing the temporal locality of data access and organizing most of the necessary disk accesses into long sequential reads or writes of data that is reused many times while in memory. This technique is based on our experience from building a functionally complete and fully operational web search engine called Yuntis. As such, it is in particular well suited for most data manipulation tasks in a modern web search engine and is employed throughout Yuntis. The key idea of this technique is co-ordinated partitioning of related data tables and corresponding partitioning and delayed batched execution of the transformation and querying operations that work with the data. This data and processing partitioning is naturally compatible with distributed data storage and parallel execution on a cluster of workstations. Empirical measurements on the Yuntis prototype demonstrate that our technique can improve the performance of external-memory data preparation runs by a factor of 100 versus a straightforward implementation.

[1]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[2]  Maxim Lifantsev Rank Computation Methods for Web Documents , 2000 .

[3]  Tzi-cker Chiueh,et al.  I/O-Conscious Volume Rendering , 2001, VisSym.

[4]  Berthier A. Ribeiro-Neto,et al.  Efficient distributed algorithms to build inverted files , 1999, SIGIR '99.

[5]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[6]  Xiang Yu,et al.  Trading capacity for performance in a disk array , 2000, OSDI.

[7]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[8]  Jyh-Jong Tsay,et al.  External-memory computational geometry , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[9]  K. Salem,et al.  Placing Replicated Data to Reduce Seek Delays Y Placing Replicated Data to Reduce Seek Delays , 1991 .

[10]  Maxim Lifantsev Voting Model for Ranking Web Pages , 2000, International Conference on Internet Computing.

[11]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[12]  Margo I. Seltzer,et al.  Disk Scheduling Revisited , 1990 .

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[15]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[16]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[17]  Krishna Bharat,et al.  The Term Vector Database: fast access to indexing terms for Web pages , 2000, Comput. Networks.

[18]  Sriram Raghavan,et al.  Building a distributed full-text index for the Web , 2001, WWW '01.

[19]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[20]  Scott D. Carson,et al.  A system for adaptive disk rearrangement , 1990, Softw. Pract. Exp..

[21]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..