Secured distributed document clustering & keyphrase extraction algorithm in structured Peer to Peer networks

A secured Hierarchically Distributed Peer-to-Peer (HDP2PC) architecture and Clustering algorithm is used to overcome the scalability problem in structured peer to peer networks. It is possible to incorporate any number of layers of nodes. The architecture is based on a multilayer overlay network of peer neighbourhoods. Supernodes, which act as representatives of neighbourhoods, are iteratively grouped to form higher level neighbourhoods. Within a certain level of the hierarchy, peers cooperate within their respective neighbourhoods to perform P2P clustering. A novel approach is proposed while indexing the documents in to various nodes arranged in hierarchy. A hashing mechanism is used to index the documents. A number of filters are applied as parameters thereby reducing the number of comparisons required to extract keyphrases. Distributed key phrase extraction algorithm is used to extract patterns by interpreting clusters stored in the neighbour workstations. The query can be applied for loosely structured format also. Speedup is provided by manipulating the neighbourhood size and height parameters. Privacy is also provided to data inside the peers. No data is shared between the peer nodes. Security can be enforced in the peers while clustering is performed.

[1]  Matthias Klusch,et al.  Distributed data mining and agents , 2005, Eng. Appl. Artif. Intell..

[2]  Mohamed S. Kamel,et al.  Collaborative Document Clustering , 2006, SDM.

[3]  Mohamed S. Kamel,et al.  CorePhrase: Keyphrase Extraction for Document Clustering , 2005, MLDM.

[4]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[5]  Hillol Kargupta,et al.  K-Means Clustering Over a Large, Dynamic Network , 2006, SDM.

[6]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[7]  Mohamed S. Kamel,et al.  Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization , 2009, IEEE Transactions on Knowledge and Data Engineering.

[8]  I. Hamzaoglu H. Kargupta,et al.  Distributed Data Mining Using An Agent Based Architecture , 1997, KDD 1997.

[9]  Hillol Kargupta,et al.  Approximate Distributed K-Means Clustering over a Peer-to-Peer Network , 2009, IEEE Transactions on Knowledge and Data Engineering.

[10]  Mohamed S. Kamel,et al.  Distributed collaborative Web document clustering using cluster keyphrase summaries , 2008, Inf. Fusion.

[11]  Mohamed S. Kamel,et al.  Document Similarity Using a Phrase Indexing Graph Model , 2003, Knowledge and Information Systems.

[12]  H. Kargupta,et al.  K-Means Clustering over Peer-to-peer Networks , 2005 .

[13]  Ah-Hwee Tan,et al.  On Quantitative Evaluation of Clustering Systems , 2003, Clustering and Information Retrieval.

[14]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[15]  Joydeep Ghosh,et al.  Privacy-preserving distributed clustering using generative models , 2003, Third IEEE International Conference on Data Mining.