High-sensitivity pattern discovery in large, paired multiomic datasets

Modern biological screens yield enormous numbers of measurements, and identifying and interpreting statistically significant associations among features is essential. Here, we present a novel hierarchical framework, HAllA (Hierarchical All-against-All association testing), for structured association discovery between paired high-dimensional datasets. HAllA efficiently integrates hierarchical hypothesis testing with false discovery rate correction to reveal significant linear and non-linear block-wise relationships among continuous and/or categorical data. We optimized and evaluated HAllA using heterogeneous synthetic datasets of known association structure, where HAllA outperformed all-against-all and other block testing approaches across a range of common similarity measures. We then applied HAllA to a series of real-world multi-omics datasets, revealing new associations between gene expression and host immune activity, the microbiome and host transcriptome, metabolomic profiling, and human health phenotypes. An open-source implementation of HAllA is freely available at http://huttenhower.sph.harvard.edu/halla along with documentation, demo datasets, and a user group. Author Summary Modern scientific datasets increasingly include multiple measurements of many complementary data types. Here, we present HAllA, a method and implementation that overcomes the statistical challenges presented by data of this type by using feature similarity within each dataset to find statistically significant groups of features between them. We applied HAllA to simulated and real datasets, showing that HAllA outperformed existing procedures and identified compelling biological relationships. HAllA is widely applicable to diverse data structures and presents the user with grouped results that are easier to interpret than traditional methods.

[1]  S. Chatterjee A New Coefficient of Correlation , 2019, Journal of the American Statistical Association.

[2]  Jing Wang,et al.  LinkedOmics: analyzing multi-omics data within and across 32 cancer types , 2017, Nucleic Acids Res..

[3]  M. Walter,et al.  FOXC1, the new player in the cancer sandbox , 2015, Oncotarget.

[4]  Judea Pearl,et al.  Bayesian Networks , 1998, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[5]  Xiang Zhan,et al.  A fast small‐sample kernel independence test for microbiome community‐level association analysis , 2017, Biometrics.

[6]  T. Dinan,et al.  Bifidobacterium breve with α-linolenic acid alters the composition, distribution and transcription factor activity associated with metabolism and absorption of fat , 2017, Scientific Reports.

[7]  J. Metcalf,et al.  Transcriptional Classification and Functional Characterization of Human Airway Macrophage and Dendritic Cell Subsets , 2017, The Journal of Immunology.

[8]  Timothy L. Tickle,et al.  Associations between host gene expression, the mucosal microbiome, and clinical outcome in the pelvic pouch of patients with inflammatory bowel disease , 2015, Genome Biology.

[9]  Tommi Vatanen,et al.  The dynamics of the human infant gut microbiome in development and in progression toward type 1 diabetes. , 2015, Cell host & microbe.

[10]  R. Tibshirani,et al.  Comment on "Detecting Novel Associations In Large Data Sets" by Reshef Et Al, Science Dec 16, 2011 , 2014, 1401.7645.

[11]  J. Kinney,et al.  Equitability, mutual information, and the maximal information coefficient , 2013, Proceedings of the National Academy of Sciences.

[12]  Jong-Hyeon Jeong,et al.  Predicting degree of benefit from adjuvant trastuzumab in NSABP trial B-31. , 2013, Journal of the National Cancer Institute.

[13]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[14]  H. Nittono,et al.  Modulation of the fecal bile acid profile by gut microbiota in cirrhosis. , 2013, Journal of Hepatology.

[15]  Guy N Brock,et al.  Interrogating differences in expression of targeted gene sets to predict breast cancer outcome , 2013, BMC Cancer.

[16]  Timothy L. Tickle,et al.  Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment , 2012, Genome Biology.

[17]  C. Smith Diagnostic tests (1) – sensitivity and specificity , 2012, Phlebology.

[18]  Somu Bala Nageswara Rao,et al.  Evaluation of maize grain and polyunsaturated fatty acid (PUFA) as energy sources for breeding rams based on hormonal, sperm functional parameters and fertility. , 2012, Reproduction, fertility, and development.

[19]  Cristin G. Print,et al.  Cyclin E2 Overexpression Is Associated with Endocrine Resistance but not Insensitivity to CDK2 Inhibition in Human Breast Cancer Cells , 2012, Molecular Cancer Therapeutics.

[20]  Igor Jurisica,et al.  Optimized application of penalized regression methods to diverse genomic data , 2011, Bioinform..

[21]  S. Schuster,et al.  Integrative analysis of environmental sequences using MEGAN4. , 2011, Genome research.

[22]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[23]  S. O'toole,et al.  42. PI3K pathway activation in breast cancer is associated with the basal-like phenotype and cancer-specific mortality , 2011 .

[24]  Philippe Leray,et al.  A hierarchical Bayesian network approach for linkage disequilibrium modeling and data-dimensionality reduction prior to genome-wide association studies , 2011, BMC Bioinformatics.

[25]  Anastasia Lykou,et al.  Sparse CCA using a Lasso with positivity constraints , 2010, Comput. Stat. Data Anal..

[26]  I. Martínez,et al.  Depletion of luminal iron alters the gut microbiota and prevents Crohn's disease-like ileitis , 2010, Gut.

[27]  Philippe Leray,et al.  Learning Hierarchical Bayesian Networks for Genome-Wide Association Studies , 2010, COMPSTAT.

[28]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.

[29]  Robert L Sutherland,et al.  PI3K pathway activation in breast cancer is associated with the basal‐like phenotype and cancer‐specific mortality , 2010, International journal of cancer.

[30]  K. Lange,et al.  Prioritizing GWAS results: A review of statistical methods and recommendations for their application. , 2010, American journal of human genetics.

[31]  H. Abdi Partial least squares regression and projection on latent structure regression (PLS Regression) , 2010 .

[32]  S. O'toole,et al.  PI3K Pathway Activation in Breast Cancer Is Associated with the Basal-Like Phenotype and Cancer-Specific Mortality. , 2009 .

[33]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[34]  I. Johnstone,et al.  Statistical challenges of high-dimensional data , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[35]  Jieping Ye,et al.  Finite Domain Constraint Solver Learning , 2009, IJCAI.

[36]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[37]  M. Furuhashi,et al.  Fatty acid-binding proteins: role in metabolic diseases and potential as drug targets , 2008, Nature Reviews Drug Discovery.

[38]  D. Yekutieli Hierarchical False Discovery Rate–Controlling Methodology , 2008 .

[39]  Alain Baccini,et al.  CCA: An R Package to Extend Canonical Correlation Analysis , 2008 .

[40]  James W Baurley,et al.  Hierarchical Bayes prioritization of marker associations from a genome‐wide association scan for further investigation , 2007, Genetic epidemiology.

[41]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[42]  Pascal G. P. Martin,et al.  Novel aspects of PPARα‐mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study , 2007, Hepatology.

[43]  T. Hashimoto,et al.  Acyl-CoA thioesterases belong to a novel gene family of peroxisome proliferator-regulated enzymes involved in lipid metabolism , 2000, Cell Biochemistry and Biophysics.

[44]  Ron S. Kenett,et al.  Encyclopedia of statistics in quality and reliability , 2007 .

[45]  Philip S Rosenberg,et al.  Multiple hypothesis testing strategies for genetic case–control association studies , 2006, Statistics in medicine.

[46]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[47]  L. Corrado Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models , 2005 .

[48]  Stan Lipovetsky,et al.  Generalized Latent Variable Modeling: Multilevel,Longitudinal, and Structural Equation Models , 2005, Technometrics.

[49]  A. Donovan,et al.  The iron exporter ferroportin/Slc40a1 is essential for iron homeostasis. , 2005, Cell metabolism.

[50]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[51]  M. Hubert,et al.  Robust methods for partial least squares regression , 2003 .

[52]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[53]  Ruud H. Koning,et al.  Large data sets , 2003 .

[54]  Jonathan P. Bollback,et al.  Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology , 2001, Science.

[55]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[56]  Marti J. Anderson,et al.  A new method for non-parametric multivariate analysis of variance in ecology , 2001 .

[57]  Brian H. McArdle,et al.  FITTING MULTIVARIATE MODELS TO COMMUNITY DATA: A COMMENT ON DISTANCE‐BASED REDUNDANCY ANALYSIS , 2001 .

[58]  Y. Benjamini,et al.  Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics , 1999 .

[59]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[60]  Wynne W. Chin The partial least squares approach for structural equation modeling. , 1998 .

[61]  C. Lynch,et al.  Role of hepatic carbonic anhydrase in de novo lipogenesis. , 1995, The Biochemical journal.

[62]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[63]  C. Goodall Procrustes methods in the statistical analysis of shape , 1991 .

[64]  R. L. Winkler The Assessment of Prior Distributions in Bayesian Analysis , 1967 .