ClustCrypt: Privacy-Preserving Clustering of Unstructured Big Data in the Cloud

Security and confidentiality of big data stored in the cloud are important concerns for many organizations to adopt cloud services. One common approach to address the concerns is client-side encryption where data is encrypted on the client machine before being stored in the cloud. Having encrypted data in the cloud, however, limits the ability of data clustering, which is a crucial part of many data analytics applications, such as search systems. To overcome the limitation, in this paper, we present an approach named ClustCrypt for efficient topic-based clustering of encrypted unstructured big data in the cloud. ClustCrypt dynamically estimates the optimal number of clusters based on the statistical characteristics of encrypted data. It also provides clustering approach for encrypted data. We deploy ClustCrypt within the context of a secure cloud-based semantic search system (S3BD). Experimental results obtained from evaluating ClustCrypt on three datasets demonstrate on average 60% improvement on clusters' coherency. ClustCrypt also decreases the search-time overhead by up to 78% and increases the accuracy of search results by up to 35%.

[1]  Md. Enamul Haque,et al.  NoCS2: Topic-Based Clustering of Big Data Text Corpus in the Cloud , 2018, 2018 21st International Conference of Computer and Information Technology (ICCIT).

[2]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[3]  Mohsen Amini Salehi,et al.  S3BD: Secure semantic search over encrypted big data in the cloud , 2018, Concurr. Comput. Pract. Exp..

[4]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[5]  M. C. Ortiz,et al.  Selecting variables for k-means cluster analysis by using a genetic algorithm that optimises the silhouettes , 2004 .

[6]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[7]  Ping Wang,et al.  Offline Dictionary Attack on Password Authentication Schemes Using Smart Cards , 2013, ISC.

[8]  P. Hammersley Editorial – Information and Information Systems , 1989 .

[9]  Chih Lee,et al.  PCA-based population structure inference with generic clustering algorithms , 2009, BMC Bioinformatics.

[10]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[11]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[12]  Saman A. Zonouz,et al.  RESeED: A secure regular‐expression search tool for storage clouds , 2017, Softw. Pract. Exp..

[13]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[14]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[15]  Cong Wang,et al.  Privacy-Preserving Multi-Keyword Ranked Search over Encrypted Cloud Data , 2014 .

[16]  Michael N. Vrahatis,et al.  The New k-Windows Algorithm for Improving the k-Means Clustering Algorithm , 2002, J. Complex..

[17]  Chenyue W. Hu,et al.  Progeny Clustering: A Method to Identify Biological Phenotypes , 2015, Scientific Reports.

[18]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[19]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[20]  Mohsen Amini Salehi,et al.  Edge Computing for User-Centric Secure Search on Cloud-Based Encrypted Big Data , 2019, 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[21]  R. M. Suresh,et al.  A Comparative Study on the Effectiveness of Semantic Search Engine over Keyword Search Engine using TSAP Measure , 2012, CloudCom 2012.

[22]  Swati Aggarwal,et al.  Performance Analysis of Uncertain K-means Clustering Algorithm Using Different Distance Metrics , 2019 .

[23]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[24]  Mohsen Amini Salehi,et al.  S3C: An architecture for space-efficient semantic search over encrypted data in the cloud , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Luis Gravano,et al.  k-Shape: Efficient and Accurate Clustering of Time Series , 2016, SGMD.

[27]  Fazli Can,et al.  Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases , 1990, TODS.