Clustering: how much bias do we need?

Scientific investigations in medicine and beyond increasingly require observations to be described by more features than can be simultaneously visualized. Simply reducing the dimensionality by projections destroys essential relationships in the data. Similarly, traditional clustering algorithms introduce data bias that prevents detection of natural structures expected from generic nonlinear processes. We examine how these problems can best be addressed, where in particular we focus on two recent clustering approaches, Phenograph and Hebbian learning clustering, applied to synthetic and natural data examples. Our results reveal that already for very basic questions, minimizing clustering bias is essential, but that results can benefit further from biased post-processing. This article is part of the themed issue ‘Mathematical methods in medicine: neuroscience, cardiology and pathology’.

[1]  K. Popper Logik der Forschung : zur erkenntnistheorie der modernen naturwissenschaft , 1936 .

[2]  A Goldbeter,et al.  Birhythmicity, chaos, and other patterns of temporal self-organization in a multiply regulated biochemical system. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Sean C. Bendall,et al.  Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis , 2015, Cell.

[4]  Ruedi Stoop,et al.  Encounter with Chaos , 1992 .

[5]  Florian Gomez,et al.  Universal dynamical properties preclude standard clustering in a large class of biochemical data , 2014, Bioinform..

[6]  R Stoop,et al.  Mesocopic comparison of complex networks based on periodic orbits. , 2011, Chaos.

[7]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[8]  Anders Eriksson,et al.  Highlighting nonlinear patterns in population genetics datasets , 2015, Scientific Reports.

[9]  Joshua B. Tenenbaum,et al.  Global Versus Local Methods in Nonlinear Dimensionality Reduction , 2002, NIPS.

[10]  Juan Carlos Fernández,et al.  Multiobjective evolutionary algorithms to identify highly autocorrelated areas: the case of spatial distribution in financially compromised farms , 2014, Ann. Oper. Res..

[11]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[12]  Mark D. Robinson,et al.  Comparison of Clustering Methods for High-Dimensional Single-Cell Flow and Mass Cytometry Data , 2016, bioRxiv.

[13]  A. Jacquin A fractal theory of iterated Markov operators with applications to digital image coding , 1989 .

[14]  Michael F. Barnsley,et al.  Fractals everywhere , 1988 .

[15]  Cvitanovic,et al.  Invariant measurement of strange sets in terms of cycles. , 1988, Physical review letters.

[16]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[17]  Leonid A. Bunimovich,et al.  Complexity of Dynamics as Variability of Predictability , 2004 .

[18]  Greg Finak,et al.  Critical assessment of automated flow cytometry data analysis techniques , 2013, Nature Methods.

[19]  R. Stoop,et al.  Big data naturally rescaled , 2016 .

[20]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[21]  Ruedi Stoop,et al.  Real-world existence and origins of the spiral organization of shrimp-shaped domains. , 2010, Physical review letters.

[22]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[23]  Ruedi Stoop,et al.  Sequential Superparamagnetic Clustering for Unbiased Classification of High‐Dimensional Chemical Data. , 2004 .

[24]  On the convergence of the thermodynamic averages in dissipative dynamical systems , 1991 .

[25]  Edgar Jacoby,et al.  An Ontology for Pharmaceutical Ligands and Its Application for in Silico Screening and Library Design. , 2010 .

[26]  K. Popper,et al.  Logik der Forschung , 1935 .

[27]  Ruedi Stoop,et al.  Periodic orbit analysis demonstrates genetic constraints, variability, and switching in Drosophila courtship behavior. , 2008, Chaos.

[28]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[29]  K. Gödel Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I , 1931 .

[30]  Ruedi Stoop,et al.  Evaluation of probabilistic and dynamical invariants from finite symbolic substrings—comparison between two approaches , 1992 .

[31]  K. Gödel,et al.  Diskussion zur Grundlegung der Mathematik , 1931 .

[32]  R Stoop,et al.  Natural computation measured as a reduction of complexity. , 2004, Chaos.

[33]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[34]  Nikolai F Rulkov,et al.  Modeling of spiking-bursting neural behavior using two-dimensional map. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[35]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[36]  Ruedi Stoop,et al.  Hebbian Self-Organizing Integrate-and-Fire Networks for Data Clustering , 2010, Neural Computation.