Efficient Updates for Web-Scale Indexes over the Cloud

In this paper, we present a distributed system which enables fast and frequent updates on web-scale Inverted Indexes. The proposed update technique allows incremental processing of new or modified data and minimizes the changes required to the index, significantly reducing the update time which is now independent of the existing index size. By utilizing Hadoop MapReduce, for parallelizing the update operations, and HBase, for distributing the Inverted Index, we create a high-performance, fully distributed index creation and update system. To the best of our knowledge, this is the first open source system that creates, updates and serves large-scale indexes in a distributed fashion. Experiments with over 23 million Wikipedia documents demonstrate the speed and robustness of our implementation: It scales linearly with the size of the updates and the degree of change in the documents and demonstrates a constant update time regardless of the size of the underlying index. Moreover, our approach significantly increases its performance as more computational resources are acquired: It incorporates a 15.4GB update batch to a 64.2GB indexed dataset in about 21 minutes using just 12 commodity nodes, 3.3 times faster compared to using two nodes.