A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval

The exponential growth of data demands scalable and adaptable infrastructures for indexing and searching a huge amount of data sources with high accuracy and efficiency. Existing centralized search engines are not scalable and suffer from single-point-offailures. The recent work on P2P index construction partitions the document vectors either randomly or statically, making it difficult to tradeoff between search efficiency and accuracy. In this position paper, we propose a peer-to-peer (P2P) IR framework (termed as P2PIR) that leverages a novel two-phase distributed semantic indexing on top of distributed hash tables (DHT). The distributed semantic clustering of P2PIR leads to good semantic locality on index placement so that the indices of similar documents are placed together or near to each other. The semantic locality enables smoother tradeoff between search accuracy and efficiency, as well as incremental adaptation to document and semantics changes. In addition, P2PIR allows for sophisticated retrieval techniques, e.g., query refinement, feedback and personalized search for better usability. A prototype of P2PIR is currently under development, which can be applied for general web retrieval and domain-specific applications such as a distributed electric medical records system.

[1]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[2]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[3]  Artur Andrzejak,et al.  Scalable, efficient range queries for grid information services , 2002, Proceedings. Second International Conference on Peer-to-Peer Computing,.

[4]  Guangwen Yang,et al.  Making Peer-to-Peer Keyword Searching Feasible Using Multi-level Partitioning , 2004, IPTPS.

[5]  Gerard Salton,et al.  Improving Retrieval Performance by Relevance Feedback , 1997 .

[6]  Edith Cohen,et al.  Replication strategies in unstructured peer-to-peer networks , 2002, SIGCOMM.

[7]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[8]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[9]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[10]  David R. Karger,et al.  On the Feasibility of Peer-to-Peer Web Indexing and Search , 2003, IPTPS.

[11]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[12]  Sharad Mehrotra,et al.  An Approach to Integrating Query Refinement in SQL , 2002, EDBT.

[13]  Hal R. Varian,et al.  Reprint: How Much Information? , 2000 .

[14]  Scott Shenker,et al.  Making gnutella-like P2P systems scalable , 2003, SIGCOMM '03.

[15]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[16]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[17]  Sharad Mehrotra,et al.  Evaluating refined queries in top-k retrieval systems , 2004, IEEE Transactions on Knowledge and Data Engineering.

[18]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[19]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[20]  John Shepherd,et al.  Document Classification via Structure Synopses , 2003, ADC.

[21]  Dragoş-Anton Manolescu Feature Extraction—A Pattern for Information Retrieval , 1998 .

[22]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[23]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[24]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[25]  Hector Garcia-Molina,et al.  Routing indices for peer-to-peer systems , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[26]  Thu D. Nguyen,et al.  Text-Based Content Search and Retrieval in Ad-hoc P2P Communities , 2002, NETWORKING Workshops.

[27]  Sandhya Dwarkadas,et al.  Hybrid Global-Local Indexing for Efficient Peer-to-Peer Information Retrieval , 2004, NSDI.

[28]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[29]  Randy H. Katz,et al.  Quantifying Network Denial of Service: A Location Service Case Study , 2001, ICICS.

[30]  Edith Cohen,et al.  Associative search in peer to peer networks: harnessing latent semantics , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[31]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[32]  David A. Evans,et al.  Design and Evaluation of the CLARIT-TREC-2 System , 1993, TREC.

[33]  Edward Y. Chang,et al.  Clustering for Approximate Similarity Search in High-Dimensional Spaces , 2002, IEEE Trans. Knowl. Data Eng..

[34]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[35]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[36]  J Allan,et al.  Readings in information retrieval. , 1998 .

[37]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[38]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[39]  Michael J. Franklin,et al.  A Fast Index for Semistructured Data , 2001, VLDB.

[40]  Sandhya Dwarkadas,et al.  Peer-to-peer information retrieval using self-organizing semantic overlay networks , 2003, SIGCOMM '03.

[41]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[42]  George Karypis,et al.  Concept Indexing: A Fast Dimensionality Reduction Algorithm With Applications to Document Retrieval and Categorization , 2000 .

[43]  Chen Li,et al.  NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms , 2004, EDBT.