COINSTAC: A Privacy Enabled Model and Prototype for Leveraging and Processing Decentralized Brain Imaging Data

The field of neuroimaging has embraced the need for sharing and collaboration. Data sharing mandates from public funding agencies and major journal publishers have spurred the development of data repositories and neuroinformatics consortia. However, efficient and effective data sharing still faces several hurdles. For example, open data sharing is on the rise but is not suitable for sensitive data that are not easily shared, such as genetics. Current approaches can be cumbersome (such as negotiating multiple data sharing agreements). There are also significant data transfer, organization and computational challenges. Centralized repositories only partially address the issues. We propose a dynamic, decentralized platform for large scale analyses called the Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation (COINSTAC). The COINSTAC solution can include data missing from central repositories, allows pooling of both open and “closed” repositories by developing privacy-preserving versions of widely-used algorithms, and incorporates the tools within an easy-to-use platform enabling distributed computation. We present an initial prototype system which we demonstrate on two multi-site data sets, without aggregating the data. In addition, by iterating across sites, the COINSTAC model enables meta-analytic solutions to converge to “pooled-data” solutions (i.e., as if the entire data were in hand). More advanced approaches such as feature generation, matrix factorization models, and preprocessing can be incorporated into such a model. In sum, COINSTAC enables access to the many currently unavailable data sets, a user friendly privacy enabled interface for decentralized analysis, and a powerful solution that complements existing data sharing solutions.

[1]  Matthew J. McAuliffe,et al.  Sharing Heterogeneous Data: The National Database for Autism Research , 2012, Neuroinformatics.

[2]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[3]  Nick C Fox,et al.  The Alzheimer's disease neuroimaging initiative (ADNI): MRI methods , 2008, Journal of magnetic resonance imaging : JMRI.

[4]  Daniel S. Marcus,et al.  The extensible neuroimaging archive toolkit , 2007, Neuroinformatics.

[5]  Thomas E. Nichols,et al.  Common genetic variants influence human subcortical brain structures , 2015, Nature.

[6]  A. Ozdaglar,et al.  Distributed Subgradient Methods , 2009 .

[7]  C. Jack,et al.  Alzheimer's Disease Neuroimaging Initiative , 2008 .

[8]  Xiaoqian Jiang,et al.  EXpectation Propagation LOgistic REgRession (EXPLORER): Distributed privacy-preserving online model learning , 2013, J. Biomed. Informatics.

[9]  Oluwasanmi Koyejo,et al.  Toward open sharing of task-based fMRI data: the OpenfMRI project , 2013, Front. Neuroinform..

[10]  Vince D. Calhoun,et al.  SimTB, a simulation toolbox for fMRI data under a model of spatiotemporal separability , 2012, NeuroImage.

[11]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[12]  Yin Yang,et al.  Low-Rank Mechanism: Optimizing Batch Queries under Differential Privacy , 2012, Proc. VLDB Endow..

[13]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[14]  T. Insel,et al.  Wesleyan University From the SelectedWorks of Charles A . Sanislow , Ph . D . 2010 Research Domain Criteria ( RDoC ) : Toward a New Classification Framework for Research on Mental Disorders , 2018 .

[15]  Raef Bassily,et al.  Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds , 2014, 1405.7085.

[16]  Thomas E. Nichols,et al.  The ENIGMA Consortium: large-scale collaborative analyses of neuroimaging and genetic data , 2014, Brain Imaging and Behavior.

[17]  D. Stott Parker,et al.  Neuroimaging Study Designs, Computational Analyses and Data Provenance Using the LONI Pipeline , 2010, PloS one.

[18]  R W Francis,et al.  ViPAR: a software platform for the Virtual Pooling and Analysis of Research Data , 2016, International journal of epidemiology.

[19]  Prateek Jain,et al.  (Near) Dimension Independent Risk Bounds for Differentially Private Learning , 2014, ICML.

[20]  Kamalika Chaudhuri,et al.  Sample Complexity Bounds for Differentially Private Learning , 2011, COLT.

[21]  Adam D. Smith,et al.  Efficient, Differentially Private Point Estimators , 2008, ArXiv.

[22]  Anand D. Sarwate,et al.  Signal Processing and Machine Learning with Differential Privacy: Algorithms and Challenges for Continuous Data , 2013, IEEE Signal Processing Magazine.

[23]  Oliver Butters,et al.  DataSHIELD: taking the analysis to the data, not the data to the analysis , 2014, International journal of epidemiology.

[24]  C. Bustamante,et al.  Privacy Risks from Genomic Data-Sharing Beacons , 2015, American journal of human genetics.

[25]  Adam D. Smith,et al.  Privacy-preserving statistical estimation with optimal convergence rates , 2011, STOC '11.

[26]  S. Shalev-Shwartz,et al.  Stochastic Gradient Descent , 2014 .

[27]  Melissa A. Basford,et al.  Ethical and practical challenges of sharing data from genome-wide association studies: the eMERGE Consortium experience. , 2011, Genome research.

[28]  J B Woodward,et al.  The Functional Magnetic Resonance Imaging Data Center (fMRIDC): the challenges and rewards of large-scale databasing of neuroimaging studies. , 2001, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[29]  Angelia Nedic,et al.  Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization , 2008, J. Optim. Theory Appl..

[30]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[31]  Oili Salonen,et al.  Contributions of genetic risk and fetal hypoxia to hippocampal volume in patients with schizophrenia or schizoaffective disorder, their unaffected siblings, and healthy unrelated volunteers. , 2002, The American journal of psychiatry.

[32]  Anand D. Sarwate,et al.  Learning from Data with Heterogeneous Noise using SGD , 2014, AISTATS.

[33]  Sofya Raskhodnikova,et al.  What Can We Learn Privately? , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[34]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[35]  Latanya Sweeney,et al.  Sharing Sensitive Data with Confidence: The Datatags System , 2015 .

[36]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[37]  Satrajit S. Ghosh,et al.  Data sharing in neuroimaging research , 2012, Front. Neuroinform..

[38]  Alípio Mário Jorge,et al.  Ensemble approaches for regression: A survey , 2012, CSUR.

[39]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[40]  Ling Huang,et al.  Learning in a Large Function Space: Privacy-Preserving Mechanisms for SVM Learning , 2009, J. Priv. Confidentiality.

[41]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[42]  I. Melle,et al.  Subcortical brain volume abnormalities in 2028 individuals with schizophrenia and 2540 healthy controls via the ENIGMA consortium , 2016, Molecular Psychiatry.

[43]  Anand D. Sarwate,et al.  Privacy-preserving source separation for distributed data using independent component analysis , 2016, 2016 Annual Conference on Information Science and Systems (CISS).

[44]  Gal Chechik,et al.  Euclidean Embedding of Co-occurrence Data , 2004, J. Mach. Learn. Res..

[45]  Stephen M. Smith,et al.  Multiplexed Echo Planar Imaging for Sub-Second Whole Brain FMRI and Fast Diffusion Imaging , 2010, PloS one.

[46]  Eric Moreau,et al.  Self-adaptive source separation .I. Convergence analysis of a direct linear network controlled by the Herault-Jutten algorithm , 1997, IEEE Trans. Signal Process..

[47]  Xiaoqian Jiang,et al.  Differentially private distributed logistic regression using private and public data , 2014, BMC Medical Genomics.

[48]  Franklin T. Luk,et al.  Principal Component Analysis for Distributed Data Sets with Updating , 2005, APPT.

[49]  Fbirn,et al.  A multi-site resting state fMRI study on the amplitude of low frequency fluctuations in schizophrenia , 2013, Front. Neurosci..

[50]  Moritz Hardt Robust subspace iteration and privacy-preserving spectral analysis , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[51]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[52]  Kamalika Chaudhuri,et al.  Convergence Rates for Differentially Private Statistical Estimation , 2012, ICML.

[53]  Anand D. Sarwate,et al.  CometCloudCare (C3): Distributed Machine LearningPlatform-as-a-Service with Privacy Preservation , 2014 .

[54]  Martin A. Lindquist,et al.  Evaluating dynamic bivariate correlations in resting-state fMRI: A comparison study and a new approach , 2014, NeuroImage.

[55]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[56]  Daniel Kifer,et al.  Private Convex Empirical Risk Minimization and High-dimensional Regression , 2012, COLT 2012.

[57]  Jessica A. Turner,et al.  COINS Data Exchange: An open platform for compiling, curating, and disseminating neuroimaging data , 2015, NeuroImage.

[58]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[59]  Martin J. Wainwright,et al.  Information-theoretic lower bounds for distributed statistical estimation with communication constraints , 2013, NIPS.

[60]  R. Cameron Craddock,et al.  Clinical applications of the functional connectome , 2013, NeuroImage.

[61]  Anders M. Dale,et al.  ENIGMA and the individual: Predicting factors that affect the brain in 35 countries worldwide , 2017, NeuroImage.

[62]  Zhenqi Huang,et al.  Differentially Private Distributed Optimization , 2014, ICDCN.

[63]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[64]  Bharat B. Biswal,et al.  Making data sharing work: The FCP/INDI experience , 2013, NeuroImage.

[65]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[66]  Or Sheffet,et al.  Differentially Private Ordinary Least Squares , 2015, ICML.

[67]  Jessica A. Turner,et al.  COINS: An Innovative Informatics and Neuroimaging Tool Suite Built for Large Heterogeneous Datasets , 2011, Front. Neuroinform..

[68]  Anand D. Sarwate,et al.  Stochastic gradient descent with differentially private updates , 2013, 2013 IEEE Global Conference on Signal and Information Processing.

[69]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[70]  Anand D. Sarwate,et al.  Large scale collaboration with autonomy: Decentralized data ICA , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[71]  Vivek S. Borkar,et al.  Distributed Asynchronous Incremental Subgradient Methods , 2001 .

[72]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[73]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[74]  Jessica A. Turner,et al.  Sharing the wealth: Neuroimaging data repositories , 2016, NeuroImage.

[75]  Mary Ivory,et al.  Federal Interagency Traumatic Brain Injury Research (FITBIR) bioinformatics platform for the advancement of collaborative traumatic brain injury research and analysis , 2015 .

[76]  Jessica A. Turner,et al.  A Tool for Interactive Data Visualization: Application to Over 10,000 Brain Imaging and Phantom MRI Data Sets , 2016, Front. Neuroinform..

[77]  M. Tobin,et al.  DataSHIELD: resolving a conflict in contemporary bioscience—performing a pooled analysis of individual-level data without sharing the data , 2010, International journal of epidemiology.

[78]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[79]  J. Ford,et al.  Widespread cortical dysfunction in schizophrenia: the FBIRN imaging consortium. , 2009, Schizophrenia bulletin.

[80]  Anand D. Sarwate,et al.  NEUROINFORMATICS Sharing privacy-sensitive access to neuroimaging and genetics data : a review and preliminary validation , 2018 .