A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks

In this paper we develop a local distributed privacy preserving algorithm for feature selection in a large peer-to-peer environment. Feature selection is often used in machine learning for data compaction and efficient learning by eliminating the curse of dimensionality. There exist many solutions for feature selection when the data are located at a central location. However, it becomes extremely challenging to perform the same when the data are distributed across a large number of peers or machines. Centralizing the entire dataset or portions of it can be very costly and impractical because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, dynamic nature of the data/network, and privacy concerns. The solution proposed in this paper allows us to perform feature selection in an asynchronous fashion with a low communication overhead where each peer can specify its own privacy constraints. The algorithm works based on local interactions among participating nodes. We present results on real-world dataset in order to test the performance of the proposed algorithm.

[1]  Yelena Yesha,et al.  Data Mining: Next Generation Challenges and Future Directions , 2004 .

[2]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[3]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[4]  Wenliang Du,et al.  A hybrid multi-group approach for privacy-preserving data mining , 2009, Knowledge and Information Systems.

[5]  Kun Liu,et al.  Client-side web mining for community formation in peer-to-peer environments , 2006, SKDD.

[6]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .

[7]  BERNARD M. WAXMAN,et al.  Routing of multipoint connections , 1988, IEEE J. Sel. Areas Commun..

[8]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[9]  Ran Wolff,et al.  k-TTP: a new privacy model for large-scale distributed environments , 2004, KDD.

[10]  Chris Clifton,et al.  Tools for privacy preserving distributed data mining , 2002, SKDD.

[11]  Haralabos C. Papadopoulos,et al.  Distributed computation of averages over ad hoc networks , 2005, IEEE Journal on Selected Areas in Communications.

[12]  Ran Wolff,et al.  Association rule mining in peer-to-peer systems , 2003, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Vincent Cho,et al.  Distributed Mining of Classification Rules , 2002, Knowledge and Information Systems.

[14]  Ran Wolff,et al.  A high-performance distributed algorithm for mining association rules , 2004, Knowledge and Information Systems.

[15]  Hillol Kargupta,et al.  A local distributed peer-to-peer algorithm using multi-party optimization based privacy preservation for data mining primitive computation , 2009, 2009 IEEE Ninth International Conference on Peer-to-Peer Computing.

[16]  Alexandre V. Evfimievski,et al.  Limiting privacy breaches in privacy preserving data mining , 2003, PODS.

[17]  Peter Scheuermann,et al.  Distributed Web Log Mining Using Maximal Large Itemsets , 2001, Knowledge and Information Systems.

[18]  Kun Liu,et al.  Distributed Identification of Top-l Inner Product Elements and its Application in a Peer-to-Peer Network , 2008, IEEE Transactions on Knowledge and Data Engineering.

[19]  Kun Liu,et al.  Multi-party, Privacy-Preserving Distributed Data Mining Using a Game Theoretic Framework , 2007, PKDD.

[20]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[21]  Ujjwal Maulik,et al.  SAFE: An Efficient Feature Extraction Technique , 2001, Knowledge and Information Systems.

[22]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[23]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[24]  Rong Chen,et al.  Collective Mining of Bayesian Networks from Distributed Heterogeneous Data , 2004, Knowl. Inf. Syst..

[25]  Ran Wolff,et al.  Distributed Decision-Tree Induction in Peer-to-Peer Systems , 2008 .

[26]  Hillol Kargupta,et al.  K-Means Clustering Over a Large, Dynamic Network , 2006, SDM.

[27]  Jason J. Jung Consensus-based evaluation framework for distributed information retrieval systems , 2009, Knowledge and Information Systems.