We introduce COSTA, for content-based search using term aggregation. Besides advantages shared with other P2P-based information retrieval systems, the system has several characteristics that distinguish itself from other systems: First, an adaptive indexing scheme which can dynamically identify important terms is used. Important terms are indexed in a chord-like ring, while other terms are aggregated in a balanced tree. We argue that this architecture is more flexible and suitable for term indexing than DHT-based methods. Furthermore, this structure allows to eliminate the requirement of maintaining global knowledge, and hence we can avoid the difficulty in maintaining such knowledge. Term aggregation is useful not only for performance enhancement, but also for improving the quality of search, by using of the term statistics information obtained via the aggregation. Traditional IR techniques such as query expansion can be utilized based on the information. Therefore, COSTA finely integrates distributed indexing with information retrieval. Advanced techniques, such as node clustering, caching and workload balance, are employed. We show that more existing optimization techniques can be adopted for further improvement of the system's performance.
[1]
Robert Morris,et al.
Chord: A scalable peer-to-peer lookup service for internet applications
,
2001,
SIGCOMM 2001.
[2]
Scott Shenker,et al.
Enhancing P2P File-Sharing with an Internet-Scale Query Processor
,
2004,
VLDB.
[3]
Aoying Zhou,et al.
SIPPER: Selecting Informative Peers in Structured P2P Environment for Content-Based Retrieval
,
2006,
22nd International Conference on Data Engineering (ICDE'06).
[4]
Gerard Salton,et al.
A vector space model for automatic indexing
,
1975,
CACM.
[5]
Sandhya Dwarkadas,et al.
Peer-to-peer information retrieval using self-organizing semantic overlay networks
,
2003,
SIGCOMM '03.
[6]
Gerhard Weikum,et al.
MINERVA: Collaborative P2P Search
,
2005,
VLDB.
[7]
David R. Karger,et al.
Chord: A scalable peer-to-peer lookup service for internet applications
,
2001,
SIGCOMM '01.