Topology and data

An important feature of modern science and engineering is that data of various kinds is being produced at an unprecedented rate. This is so in part because of new experimental methods, and in part because of the increase in the availability of high powered computing technology. It is also clear that the nature of the data we are obtaining is significantly different. For example, it is now often the case that we are given data in the form of very long vectors, where all but a few of the coordinates turn out to be irrelevant to the questions of interest, and further that we don’t necessarily know which coordinates are the interesting ones. A related fact is that the data is often very high-dimensional, which severely restricts our ability to visualize it. The data obtained is also often much noisier than in the past and has more missing information (missing data). This is particularly so in the case of biological data, particularly high throughput data from microarray or other sources. Our ability to analyze this data, both in terms of quantity and the nature of the data, is clearly not keeping pace with the data being produced. In this paper, we will discuss how geometry and topology can be applied to make useful contributions to the analysis of various kinds of data. Geometry and topology are very natural tools to apply in this direction, since geometry can be regarded as the study of distance functions, and what one often works with are distance functions on large finite sets of data. The mathematical formalism which has been developed for incorporating geometric and topological techniques deals with point clouds, i.e. finite sets of points equipped with a distance function. It then adapts tools from the various branches of geometry to the study of point clouds. The point clouds are intended to be thought of as finite samples taken from a geometric object, perhaps with noise. Here are some of the key points which come up when applying these geometric methods to data analysis. • Qualitative information is needed: One important goal of data analysis is to allow the user to obtain knowledge about the data, i.e. to understand how it is organized on a large scale. For example, if we imagine that we are looking at a data set constructed somehow from diabetes patients, it would be important to develop the understanding that there are two types of the disease, namely the juvenile and adult onset forms. Once that is established, one of course wants to develop quantitative methods for distinguishing them, but the first insight about the distinct forms of the disease is key.

[1]  Herbert Edelsbrunner,et al.  Triangulating topological spaces , 1994, SCG '94.

[2]  Stephen Smale,et al.  Finding the Homology of Submanifolds with High Confidence from Random Samples , 2008, Discret. Comput. Geom..

[3]  P. Gabriel,et al.  Representations of Finite-Dimensional Algebras , 1992 .

[4]  Gunnar E. Carlsson,et al.  Topological estimation using witness complexes , 2004, PBG.

[5]  S. Lane Categories for the Working Mathematician , 1971 .

[6]  Herbert Edelsbrunner,et al.  Protein-protein interfaces: properties, preferences, and projections. , 2007, Journal of proteome research.

[7]  A. Baddeley,et al.  Stochastic Geometry: Lectures given at the C.I.M.E. Summer School held in Martina Franca, Italy, September 13-18, 2004 , 2006 .

[8]  David Cohen-Steiner,et al.  Stability of Persistence Diagrams , 2005, Discret. Comput. Geom..

[9]  Daryl J. Daley,et al.  An Introduction to the Theory of Point Processes , 2013 .

[10]  B. Sturmfels,et al.  Combinatorial Commutative Algebra , 2004 .

[11]  Herbert Edelsbrunner,et al.  Topological Persistence and Simplification , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[12]  Vin de Silva,et al.  Coverage in sensor networks via persistent homology , 2007 .

[13]  Afra Zomorodian,et al.  Localized Homology , 2007, IEEE International Conference on Shape Modeling and Applications 2007 (SMI '07).

[14]  R. Ho Algebraic Topology , 2022 .

[15]  J. H. Hateren,et al.  Independent component filters of natural images compared with simple cells in primary visual cortex , 1998 .

[16]  Mikhail Belkin,et al.  Consistency of spectral clustering , 2008, 0804.0678.

[17]  Wolfgang Weil,et al.  Spatial Point Processes and their Applications , 2007 .

[18]  K. Brown,et al.  Graduate Texts in Mathematics , 1982 .

[19]  Ann B. Lee The Nonlinear Statistics of High-Contrast Patches in Natural Images , 2003 .

[20]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[21]  Regina Y. Liu,et al.  Multivariate analysis by data depth: descriptive statistics, graphics and inference, (with discussion and a rejoinder by Liu and Singh) , 1999 .

[22]  Gunnar Carlsson,et al.  Persistent Clustering and a Theorem of J. Kleinberg , 2008, 0808.2241.

[23]  D. R. Cox,et al.  Discussion: Projection Pursuit , 1985 .

[24]  R. Cowan An introduction to the theory of point processes , 1978 .

[25]  Leonidas J. Guibas,et al.  A Barcode Shape Descriptor for Curve Point Cloud Data , 2004, PBG.

[26]  P. McCullagh What is a statistical model , 2002 .

[27]  Mathew D. Penrose,et al.  Random Geometric Graphs , 2003 .

[28]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[29]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[30]  Patrizio Frosini,et al.  Size theory as a topological tool for computer vision , 1999 .

[31]  Afra Zomorodian,et al.  Computing Persistent Homology , 2005, Discret. Comput. Geom..

[32]  A. Grinvald,et al.  Linking spontaneous activity of single cortical neurons and the underlying functional architecture. , 1999, Science.

[33]  Henry Adams,et al.  On the Nonlinear Statistics of Range Image Patches , 2009, SIAM J. Imaging Sci..

[34]  J. V. van Hateren,et al.  Independent component filters of natural images compared with simple cells in primary visual cortex , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[35]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[36]  Erik Carlsson,et al.  c ○ World Scientific Publishing Company AN ALGEBRAIC TOPOLOGICAL METHOD FOR FEATURE IDENTIFICATION , 2022 .

[37]  Afra Zomorodian,et al.  The Theory of Multidimensional Persistence , 2007, SCG '07.

[38]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[39]  Edward B Curtis,et al.  Simplicial homotopy theory , 1968 .

[40]  Leonidas J. Guibas,et al.  Persistence barcodes for shapes , 2004, SGP '04.

[41]  Kim Steenstrup Pedersen,et al.  The Nonlinear Statistics of High-Contrast Patches in Natural Images , 2003, International Journal of Computer Vision.

[42]  David A. Cox,et al.  Using Algebraic Geometry , 1998 .

[43]  James W. Vick,et al.  Singular Homology Theory , 1994 .

[44]  Ann B. Lee,et al.  Diffusion maps and coarse-graining: a unified framework for dimensionality reduction, graph partitioning, and data set parameterization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  John Greenlees,et al.  SIMPLICIAL HOMOTOPY THEORY (Progress in Mathematics 174) , 2001 .

[46]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[47]  ScienceDirect Computational geometry : theory and applications. , 1991 .

[48]  Leonidas J. Guibas,et al.  Structural Insight into RNA Hairpin Folding Intermediates , 2008, Journal of the American Chemical Society.

[49]  Leonidas J. Guibas,et al.  A Barcode Shape Descriptor for Curve Point Cloud Data , 2022 .

[50]  Paul G. Goerss,et al.  Simplicial Homotopy Theory , 2009, Modern Birkhäuser Classics.

[51]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[52]  A. Grinvald,et al.  Spontaneously emerging cortical representations of visual attributes , 2003, Nature.

[53]  Herbert Edelsbrunner,et al.  Protein-protein interfaces: properties, preferences, and projections. , 2007 .

[54]  Facundo Mémoli,et al.  Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition , 2007, PBG@Eurographics.

[55]  A. Björner Topological methods , 1996 .

[56]  David Mumford,et al.  The Dawning of the Age of Stochasticity , 2000 .

[57]  David Cohen-Steiner,et al.  Lipschitz Functions Have Lp-Stable Persistence , 2010, Found. Comput. Math..

[58]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[59]  Jean-Guillaume Dumas,et al.  Computing Simplicial Homology Based on Efficient Smith Normal Form Algorithms , 2003, Algebra, Geometry, and Software Systems.

[60]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[61]  Vin de Silva,et al.  On the Local Behavior of Spaces of Natural Images , 2007, International Journal of Computer Vision.

[62]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[63]  H. O. Foulkes Abstract Algebra , 1967, Nature.

[64]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[65]  Afra Zomorodian,et al.  Computing Multidimensional Persistence , 2009, J. Comput. Geom..

[66]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[67]  James R. Munkres,et al.  Topology; a first course , 1974 .

[68]  Trevor F. Cox,et al.  Metric multidimensional scaling , 2000 .

[69]  Jon P. May Simplicial objects in algebraic topology , 1993 .

[70]  Adrian Baddeley,et al.  Spatial Point Processes and their Applications , 2007 .