COSTA: Adaptive Indexing for Terms in a Large-scale Distributed System

We introduce COSTA, for content-based search using term aggregation. Besides advantages shared with other P2P-based information retrieval systems, the system has several characteristics that distinguish itself from other systems: First, an adaptive indexing scheme which can dynamically identify important terms is used. Important terms are indexed in a chord-like ring, while other terms are aggregated in a balanced tree. We argue that this architecture is more flexible and suitable for term indexing than DHT-based methods. Furthermore, this structure allows to eliminate the requirement of maintaining global knowledge, and hence we can avoid the difficulty in maintaining such knowledge. Term aggregation is useful not only for performance enhancement, but also for improving the quality of search, by using of the term statistics information obtained via the aggregation. Traditional IR techniques such as query expansion can be utilized based on the information. Therefore, COSTA finely integrates distributed indexing with information retrieval. Advanced techniques, such as node clustering, caching and workload balance, are employed. We show that more existing optimization techniques can be adopted for further improvement of the system's performance.