Clustering documents into classes is an important task in many Information Retrieval (IR) systems. This achieved grouping enables a description of the contents of the document collection in terms of the classes the documents fall into. The com- pactness of such a description is even more desirable in cases where the document collection is spread across different computers and locations; document classes can then be used to describe each partial document collection in a conveniently short form that can easily be exchanged with other nodes on the network. Unfortunately, most clustering schemes cannot easily be distributed. Additionally, the costs of transferring all data to a central clustering service are prohibitive in large-scale systems. In this paper, we introduce an approach which is capable of classifying documents that are distributed across a Peer-to-Peer (P2P) network. We present measurements taken on a P2P network using synthetic and real-world data sets. 1 Motivation In P2P IR systems and Distributed IR Systems, an important problem is the routing of queries or the selection of sources which might contain relevant documents. Some ap- proaches in this respect are based on compact representations of adjacent peers' document collections or the documents maintained on the reachable source nodes. In this context, an interesting approach is to determine a consistent clustering of all documents in the net- work, and to acquire knowledge about which node contains how many documents falling into each cluster. Then, for a given query, the querying node can identify the most promis- ing clusters by that compact representation of other peers' collections; the query can then be routed precisely to nodes potentially containing relevant documents.
[1]
Ernest J. H. Chang,et al.
Echo Algorithms: Depth Parallel Operations on General Graphs
,
1982,
IEEE Transactions on Software Engineering.
[2]
Paul S. Bradley,et al.
Refining Initial Points for K-Means Clustering
,
1998,
ICML.
[3]
Inderjit S. Dhillon,et al.
A Data-Clustering Algorithm on Distributed Memory Multiprocessors
,
1999,
Large-Scale Parallel Data Mining.
[4]
David B. Skillicorn,et al.
The case for datacentric grids
,
2002,
Proceedings 16th International Parallel and Distributed Processing Symposium.
[5]
Mark P. Sinka,et al.
A Large Benchmark Dataset for Web Document Clustering
,
2002
.
[6]
Klaus R. Dittrich,et al.
INFORMATIK 2003 – Innovative Informatikanwendungen, Band 1, Beiträge der 33. Jahrestagung der Gesellschaft für Informatik e.V. (GI)
,
2003
.