Privacy-preserving feature selection: A survey and proposing a new set of protocols

Feature selection is the process of sieving features, in which informative features are separated from the redundant and irrelevant ones. This process plays an important role in machine learning, data mining and bioinformatics. However, traditional feature selection methods are only capable of processing centralized datasets and are not able to satisfy today's distributed data processing needs. These needs require a new category of data processing algorithms called privacy-preserving feature selection, which protects users' data by not revealing any part of the data neither in the intermediate processing nor in the final results. This is vital for the datasets which contain individuals' data, such as medical datasets. Therefore, it is rational to either modify the existing algorithms or propose new ones to not only introduce the capability of being applied to distributed datasets, but also act responsibly in handling users' data by protecting their privacy. In this paper, we will review three privacy-preserving feature selection methods and provide suggestions to improve their performance when any gap is identified. We will also propose a privacy-preserving feature selection method based on the rough set feature selection. The proposed method is capable of processing both horizontally and vertically partitioned datasets in two- and multi-parties scenarios.

[1]  Andrzej Skowron,et al.  Rough Sets: A Tutorial , 1998 .

[2]  Qiang Shen,et al.  New Approaches to Fuzzy-Rough Feature Selection , 2009, IEEE Transactions on Fuzzy Systems.

[3]  Louis D. Brandeis,et al.  The Right to Privacy , 1890 .

[4]  J. Rubenfeld The Right of Privacy , 1989 .

[5]  Felix Hueber,et al.  Hyperspectral Imaging Techniques For Spectral Detection And Classification , 2016 .

[6]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[7]  Madhushri Banerjee,et al.  Privacy preserving feature selection for distributed data using virtual dimension , 2011, CIKM '11.

[8]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[9]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[10]  Stan Matwin,et al.  Privacy-aware filter-based feature selection , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[11]  Hillol Kargupta,et al.  A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks , 2009, Knowledge and Information Systems.

[12]  Javad Rahimipour Anaraki,et al.  Improving fuzzy-rough quick reduct for feature selection , 2011, 2011 19th Iranian Conference on Electrical Engineering.

[13]  Javad Rahimipour Anaraki,et al.  Rough set based feature selection: A Review , 2013, The 5th Conference on Information and Knowledge Technology.

[14]  Theresa Beaubouef,et al.  Rough Sets , 2019, Lecture Notes in Computer Science.

[15]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .

[16]  Haralabos C. Papadopoulos,et al.  Distributed computation of averages over ad hoc networks , 2005, IEEE Journal on Selected Areas in Communications.

[17]  HolmesGeoffrey,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003 .

[18]  Vitaly Shmatikov,et al.  How To Break Anonymity of the Netflix Prize Dataset , 2006, ArXiv.

[19]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[20]  James Bennett,et al.  The Netflix Prize , 2007 .