论文信息 - TDAstats: R pipeline for computing persistent homology in topological data analysis

TDAstats: R pipeline for computing persistent homology in topological data analysis

Summary High-dimensional datasets are becoming more common in a variety of scientific fields. Well-known examples include next-generation sequencing in biology, patient health status in medicine, and computer vision in deep learning. Dimension reduction, using methods like principal component analysis (PCA), is a common preprocessing step for such datasets. However, while dimension reduction can save computing and human resources, it comes with the cost of significant information loss. Topological data analysis (TDA) aims to analyze the “shape” of high-dimensional datasets, without dimension reduction, by extracting features that are robust to small perturbations in data. Persistent features of a dataset can be used to describe it, and to compare it to other datasets. Visualization of persistent features can be done using topological barcodes or persistence diagrams (Figure 1). Application of TDA methods has granted greater insight into high-dimensional data (Lakshmikanth et al., 2017); one prominent example of this is its use to characterize a clinically relevant subgroup of breast cancer patients (Nicolau, Levine, & Carlsson, 2011). This is a particularly salient study as Nicolau et al. (2011) used a topological method, termed Progression Analysis of Disease, to identify a patient subgroup with 100% survival using that remains invisible to other clustering methods.

[1] E. Fredlund,et al. Mass Cytometry and Topological Data Analysis Reveal Immune Parameters Associated with Complications after Allogeneic Stem Cell Transplantation. , 2017, Cell reports.

[2] Mason A. Porter,et al. A roadmap for the computation of persistent homology , 2015, EPJ Data Science.

[3] G. Carlsson,et al. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival , 2011, Proceedings of the National Academy of Sciences.

[4] Katharine Turner,et al. Hypothesis testing for topological data analysis , 2013, J. Appl. Comput. Topol..

[5] Anton Nekrutenko,et al. Ten Simple Rules for Reproducible Computational Research , 2013, PLoS Comput. Biol..

[6] Cedric E. Ginestet. ggplot2: Elegant Graphics for Data Analysis , 2011 .

[7] Dirk Eddelbuettel,et al. Rcpp: Seamless R and C++ Integration , 2011 .