Privacy-preserving cox regression for survival analysis

Privacy-preserving data mining (PPDM) is an emergent research area that addresses the incorporation of privacy preserving concerns to data mining techniques. In this paper we propose a privacy-preserving (PP) Cox model for survival analysis, and consider a real clinical setting where the data is horizontally distributed among different institutions. The proposed model is based on linearly projecting the data to a lower dimensional space through an optimal mapping obtained by solving a linear programming problem. Our approach differs from the commonly used random projection approach since it instead finds a projection that is optimal at preserving the properties of the data that are important for the specific problem at hand. Since our proposed approach produces an sparse mapping, it also generates a PP mapping that not only projects the data to a lower dimensional space but it also depends on a smaller subset of the original features (it provides explicit feature selection). Real data from several European healthcare institutions are used to test our model for survival prediction of non-small-cell lung cancer patients. These results are also confirmed using publicly available benchmark datasets. Our experimental results show that we are able to achieve a near-optimal performance without directly sharing the data across different data sources. This model makes it possible to conduct large-scale multi-centric survival analysis without violating privacy-preserving requirements.

[1]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2]  Glenn Fung,et al.  Learning sparse metrics via linear programming , 2006, KDD '06.

[3]  Taneli Mielikäinen,et al.  Cryptographically private support vector machines , 2006, KDD '06.

[4]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[5]  Olvi L. Mangasarian,et al.  Privacy-Preserving Classification of Horizontally Partitioned Data via Random Kernels , 2008, DMIN.

[6]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .

[7]  Jie Wang,et al.  Wavelet-Based Data Distortion for Privacy-Preserving Collaborative Analysis , 2007 .

[8]  L. Goldman,et al.  The SUPPORT Prognostic Model: Objective Estimates of Survival for Seriously Ill Hospitalized Adults , 1995, Annals of Internal Medicine.

[9]  P. Lambin,et al.  Tumor volume combined with number of positive lymph node stations is a more important prognostic factor than TNM stage for survival of non-small-cell lung cancer patients treated with (chemo)radiotherapy. , 2008, International journal of radiation oncology, biology, physics.

[10]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[11]  Hongzhe Li,et al.  Kernel Cox Regression Models for Linking Gene Expression Profiles to Censored Survival Data , 2002, Pacific Symposium on Biocomputing.

[12]  Yunghsiang Sam Han,et al.  Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification , 2004, SDM.

[13]  Osmar R. Zaïane,et al.  Achieving Privacy Preservation when Sharing Data for Clustering , 2004, Secure Data Management.

[14]  Kai Han,et al.  Privacy Preserving ID3 Algorithm over Horizontally Partitioned Data , 2005, Sixth International Conference on Parallel and Distributed Computing Applications and Technologies (PDCAT'05).

[15]  Jaideep Vaidya,et al.  Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data , 2006, SAC.

[16]  Keke Chen,et al.  Privacy preserving data classification with rotation perturbation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.

[18]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .