Eliminating accidental deviations to minimize generalization error: applications in connectomics and genomics

Abstract The cost of data collection and processing is becoming prohibitively expensive for many research groups across disciplines, a problem that is exacerbated by the dependence of ever larger sample sizes to obtain reliable inferences for increasingly subtle questions. And yet, as more data is available and open access, more researchers desire to analyze it for different questions, often including previously unforeseen questions. To further increase sample sizes, existing datasets are often amalgamated. These reference datasets—datasets that serve to answer many disparate questions for different individuals—are increasingly common and important. Reference pipelines efficiently and flexibly analyze on all the datasets. How can one optimally design these reference datasets and pipelines to yield derivative data that are simultaneously useful for many different tasks? We propose an approach to experimental design that leverages multiple measurements for each distinct item (for example, an individual). The key insight is that each measurement of the same item should be more similar to other measurements of that item, as compared to measurements of any other item. In other words, we seek to optimally discriminate one item from another. We formalize the notion of discriminability, and introduce both a non-parameteric and parametric statistic to quantify the discriminability of potentially multivariate or non-Euclidean datasets. With this notion, one can make optimal decisions—either with regard to acquisition or analysis of data—by maximizing discriminability. Crucially, this optimization can be performed in the absence of any task-specific (or supervised) information. We show that optimizing decisions with respect to discriminability yields improved performance on subsequent inference tasks. We apply this strategy to a brain imaging dataset built by the “Consortium for Reliability and Reproducability” which consists of 24 disparate magnetic resonance imaging datasets, each with up to hundreds of individuals that were imaged multiple times. We show that by optimizing pipelines with respect to discriminability, we improve performance on multiple subsequent inference tasks, even though discriminability does not consider the tasks whatsoever.

[1]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[2]  Kevin Murphy,et al.  Towards a consensus regarding global signal regression for resting state functional connectivity MRI , 2017, NeuroImage.

[3]  R. Paley,et al.  On some series of functions, (3) , 1930, Mathematical Proceedings of the Cambridge Philosophical Society.

[4]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[5]  Edward G. Carmines,et al.  Reliability and Validity Assessment , 1979 .

[6]  Jack L. Lancaster,et al.  The Talairach Daemon a database server for talairach atlas labels , 1997 .

[7]  Li Qingyang,et al.  Towards Automated Analysis of Connectomes: The Configurable Pipeline for the Analysis of Connectomes (C-PAC) , 2013 .

[8]  Anders M. Dale,et al.  An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest , 2006, NeuroImage.

[9]  Guy B. Williams,et al.  QuickBundles, a Method for Tractography Simplification , 2012, Front. Neurosci..

[10]  Cencheng Shen,et al.  Decision Forests Induce Characteristic Kernels , 2018, ArXiv.

[11]  Christian Windischberger,et al.  Toward discovery science of human brain function , 2010, Proceedings of the National Academy of Sciences.

[12]  Maria L. Rizzo,et al.  Energy distance , 2016 .

[13]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[14]  Bing Chen,et al.  An open science resource for establishing reliability and reproducibility in functional connectomics , 2014, Scientific Data.

[15]  NeuroData,et al.  Towards Automated Analysis of Connectomes: The Configurable Pipeline for the Analysis of Connectomes , 2015 .

[16]  Cencheng Shen,et al.  mgcpy: A Comprehensive High Dimensional Independence Testing Python Package , 2019, ArXiv.

[17]  Maria L. Rizzo,et al.  DISCO analysis: A nonparametric extension of analysis of variance , 2010, 1011.2288.

[18]  C. Sripada,et al.  Modality-Spanning Deficits in Attention-Deficit/Hyperactivity Disorder in Functional Networks, Gray Matter, and White Matter , 2014, The Journal of Neuroscience.

[19]  D. Louis Collins,et al.  Application of Information Technology: A Four-Dimensional Probabilistic Atlas of the Human Brain , 2001, J. Am. Medical Informatics Assoc..

[20]  Thomas T. Liu,et al.  The global signal in fMRI: Nuisance or Information? , 2017, NeuroImage.

[21]  Mark W. Woolrich,et al.  Optimising network modelling methods for fMRI , 2019, NeuroImage.

[22]  Xi-Nian Zuo,et al.  Reliable intrinsic connectivity networks: Test–retest evaluation using ICA and dual regression approach , 2010, NeuroImage.

[23]  Xi-Nian Zuo,et al.  Harnessing reliability for neuroscience research , 2019, Nature Human Behaviour.

[24]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[25]  David J. Hand,et al.  Measurement: A Very Short Introduction , 2016 .

[26]  S. Wakana,et al.  MRI Atlas of Human White Matter , 2005 .

[27]  Martin A. Lindquist,et al.  On statistical tests of functional connectome fingerprinting , 2018, bioRxiv.

[28]  Mark W. Woolrich,et al.  Advances in functional and structural MR image analysis and implementation as FSL , 2004, NeuroImage.

[29]  Rex E. Jung,et al.  Computing scalable multivariate glocal invariants of large (brain-) graphs , 2013, 2013 IEEE Global Conference on Signal and Information Processing.

[30]  Carey E. Priebe,et al.  From Distance Correlation to Multiscale Generalized Correlation , 2017 .

[31]  Joshua T. Vogelstein,et al.  Standardizing human brain parcellations , 2019, Scientific Data.

[32]  Kevin Murphy,et al.  The impact of global signal regression on resting state correlations: Are anti-correlated networks introduced? , 2009, NeuroImage.

[33]  Maxime Descoteaux,et al.  Dipy, a library for the analysis of diffusion MRI data , 2014, Front. Neuroinform..

[34]  Dustin Scheinost,et al.  Can brain state be manipulated to emphasize individual differences in functional connectivity? , 2017, NeuroImage.

[35]  Mark W. Woolrich,et al.  Bayesian analysis of neuroimaging data in FSL , 2009, NeuroImage.

[36]  M. B. Nebel,et al.  Quantifying the reliability of image replication studies: The image intraclass correlation coefficient (I2C2) , 2013, Cognitive, affective & behavioral neuroscience.

[37]  N. Makris,et al.  Decreased volume of left and total anterior insular lobule in schizophrenia , 2006, Schizophrenia Research.

[38]  Keith Heberlein,et al.  Imaging human connectomes at the macroscale , 2013, Nature Methods.

[39]  Maria L. Rizzo,et al.  Energy statistics: A class of statistics based on distances , 2013 .

[40]  N. Tzourio-Mazoyer,et al.  Automated Anatomical Labeling of Activations in SPM Using a Macroscopic Anatomical Parcellation of the MNI MRI Single-Subject Brain , 2002, NeuroImage.

[41]  C. Sripada,et al.  Lag in maturation of the brain’s intrinsic functional architecture in attention-deficit/hyperactivity disorder , 2014, Proceedings of the National Academy of Sciences.