Data streams and privacy: Two emerging issues in data classification

Several real-world applications generate data streams where the opportunity to examine each instance is concise. Effective classification of such data streams is an emerging issue in data mining. However, such classification can cause severe threats to privacy. There are several applications like credit card fraud detection, disease outbreak or biological attack detection, loan approval, etc. where the data is homogeneously distributed among different parties. These parties may wish to collaboratively build a classifier to obtain certain global patterns but will be reluctant to disclose their private data. Privacy-preserving classification of such homogeneously distributed data is a challenging issue too. In this paper, we present a brief review of the work carried out in data stream classification and privacy-preserving classification of homogeneously distributed data; followed by an empirical evaluation and performance comparison of some methods in both these areas. We also propose and evaluate an approach of creating an ensemble of anonymous decision trees to classify homogeneously distributed data in a privacy-preserving manner. We further identify the need to develop efficient methods for privacy-preserving classification of homogeneously distributed data streams and propose a suitable approach for the same.

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  Ran Wolff,et al.  The VLDB Journal manuscript No. (will be inserted by the editor) Providing k-Anonymity in Data Mining , 2022 .

[3]  Vahida Attar,et al.  Classifier Ensemble for Imbalanced Data Stream Classification , 2012, CUBE.

[4]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[5]  Bhavani M. Thuraisingham,et al.  Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints , 2011, IEEE Transactions on Knowledge and Data Engineering.

[6]  Albert Carles Bifet Figuerol,et al.  Adaptive parameter-free learning from evolving data streams , 2009 .

[7]  Sabrina De Capitani di Vimercati,et al.  k -Anonymous Data Mining: A Survey , 2008, Privacy-Preserving Data Mining.

[8]  Albert Bifet,et al.  DATA STREAM MINING A Practical Approach , 2009 .

[9]  Justin Zhijun Zhan,et al.  Privacy-preserving collaborative data mining , 2007, IEEE Computational Intelligence Magazine.

[10]  Ali Miri,et al.  Privacy preserving ID3 using Gini Index over horizontally partitioned data , 2008, 2008 IEEE/ACS International Conference on Computer Systems and Applications.

[11]  Philip S. Yu,et al.  A framework for on-demand classification of evolving data streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[13]  Wei Zhao,et al.  A new scheme on privacy-preserving data classification , 2005, KDD '05.

[14]  Vijay Ukani,et al.  An empirical analysis of multiclass classification techniques in data mining , 2011, 2011 Nirma University International Conference on Engineering.

[15]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[16]  David B. Skillicorn,et al.  Classification Using Streaming Random Forests , 2011, IEEE Transactions on Knowledge and Data Engineering.

[17]  Zhengxin Chen,et al.  Privacy-Preserving Data Mining for Medical Data: Application of Data Partition Methods , 2008, Communications and Discoveries from Multidisciplinary Data.

[18]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[19]  Xun Yi,et al.  Classification of Privacy-preserving Distributed Data Mining protocols , 2011, 2011 Sixth International Conference on Digital Information Management.

[20]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[21]  Chris Clifton,et al.  Privacy-Preserving Data Mining , 2006, Encyclopedia of Database Systems.

[22]  Richard Brendon Kirkby,et al.  Improving Hoeffding Trees , 2007 .

[23]  Divyakant Agrawal,et al.  Privacy preserving decision tree learning over multiple parties , 2007, Data Knowl. Eng..

[24]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[25]  Jaideep Vaidya,et al.  Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data , 2006, SAC.

[26]  Chris Clifton,et al.  Tools for privacy preserving distributed data mining , 2002, SKDD.

[27]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[28]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[29]  Ling Liu,et al.  Mining multiple private databases using a kNN classifier , 2007, SAC '07.