Statistical Enrichment Analysis of Samples: A General-Purpose Tool to Annotate Metadata Neighborhoods of Biological Samples

Unsupervised learning techniques, such as clustering and embedding, have been increasingly popular to cluster biomedical samples from high-dimensional biomedical data. Extracting clinical data or sample meta-data shared in common among biomedical samples of a given biological condition remains a major challenge. Here, we describe a powerful analytical method called Statistical Enrichment Analysis of Samples (SEAS) for interpreting clustered or embedded sample data from omics studies. The method derives its power by focusing on sample sets, i.e., groups of biological samples that were constructed for various purposes, e.g., manual curation of samples sharing specific characteristics or automated clusters generated by embedding sample omic profiles from multi-dimensional omics space. The samples in the sample set share common clinical measurements, which we refer to as “clinotypes,” such as age group, gender, treatment status, or survival days. We demonstrate how SEAS yields insights into biological data sets using glioblastoma (GBM) samples. Notably, when analyzing the combined The Cancer Genome Atlas (TCGA)—patient-derived xenograft (PDX) data, SEAS allows approximating the different clinical outcomes of radiotherapy-treated PDX samples, which has not been solved by other tools. The result shows that SEAS may support the clinical decision. The SEAS tool is publicly available as a freely available software package at https://aimed-lab.shinyapps.io/SEAS/.

[1]  J. Chen,et al.  Linking clinotypes to phenotypes and genotypes from laboratory test results in comprehensive physical exams , 2021, BMC Medical Informatics and Decision Making.

[2]  S. Falcon,et al.  Hypergeometric Testing Used for Gene Set Enrichment Analysis , 2008 .

[3]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[4]  Philip Sedgwick,et al.  Multiple hypothesis testing and Bonferroni’s correction , 2014, BMJ : British Medical Journal.

[5]  C Ohmann,et al.  Future Developments of Medical Informatics from the Viewpoint of Networked Clinical Research , 2009, Methods of Information in Medicine.

[6]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[7]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[8]  A Burgun,et al.  Accessing and Integrating Data and Knowledge for Biomedical Research , 2008, Yearbook of Medical Informatics.

[9]  Hye Hyeon Kim,et al.  Clinical MetaData ontology: a simple classification scheme for data elements of clinical data based on semantics , 2019, BMC Medical Informatics and Decision Making.

[10]  Mohammed J. Zaki Data Mining and Analysis: Fundamental Concepts and Algorithms , 2014 .

[11]  Griffin M. Weber,et al.  Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) , 2010, J. Am. Medical Informatics Assoc..

[12]  J. Kai,et al.  Can machine-learning improve cardiovascular risk prediction using routine clinical data? , 2017, PloS one.

[13]  Shaun J. Grannis,et al.  Health-Terrain: Visualizing Large Scale Health Data , 2014 .

[14]  George Hripcsak,et al.  Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records , 2018, Scientific Data.

[15]  Y. Liu,et al.  Mining TCGA database for genes of prognostic value in glioblastoma microenvironment , 2018, Aging.

[16]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[17]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[18]  Cherie Noteboom,et al.  Enhancing Traceability in Clinical Research Data through a Metadata Framework. , 2020, Methods of information in medicine.