Trace your sources in large-scale data: one ring to find them all

An important preprocessing step in most data analysis pipelines aims to extract a small set of sources that explain most of the data. Currently used algorithms for blind source separation (BSS), however, often fail to extract the desired sources and need extensive cross-validation. In contrast, their rarely used probabilistic counterparts can get away with little cross-validation and are more accurate and reliable but no simple and scalable implementations are available. Here we present a novel probabilistic BSS framework (DECOMPOSE) that can be flexibly adjusted to the data, is extensible and easy to use, adapts to individual sources and handles large-scale data through algorithmic efficiency. DECOMPOSE encompasses and generalises many traditional BSS algorithms such as PCA, ICA and NMF and we demonstrate substantial improvements in accuracy and robustness on artificial and real data.

[1]  Lihua Zhang,et al.  A Unified Joint Matrix Factorization Framework for Data Integration , 2017, ArXiv.

[2]  Binghui Zheng,et al.  Source apportionment of pollution in groundwater source area using factor analysis and positive matrix factorization methods , 2017 .

[3]  R Bro,et al.  Cross-validation of component models: A critical look at current methods , 2008, Analytical and bioanalytical chemistry.

[4]  Richard Christen,et al.  Belliella baltica gen. nov., sp. nov., a novel marine bacterium of the Cytophaga-Flavobacterium-Bacteroides group isolated from surface water of the central Baltic Sea. , 2004, International journal of systematic and evolutionary microbiology.

[5]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[6]  Ahn Large-Scale Distributed Bayesian Matrix Factorization using Stochastic Gradient MCMC , 2015 .

[7]  D. Rubin,et al.  ML ESTIMATION OF THE t DISTRIBUTION USING EM AND ITS EXTENSIONS, ECM AND ECME , 1999 .

[8]  Wilfried N. Gansterer,et al.  libNMF - A Library for Nonnegative Matrix Factorization , 2011, Comput. Informatics.

[9]  Ole Winther,et al.  Bayesian Non-negative Matrix Factorization , 2009, ICA.

[10]  Richard Christen,et al.  Aquiflexum balticum gen. nov., sp. nov., a novel marine bacterium of the Cytophaga-Flavobacterium-Bacteroides group isolated from surface water of the central Baltic Sea. , 2004, International journal of systematic and evolutionary microbiology.

[11]  Jim Thurmond,et al.  FlyBase 101 – the basics of navigating FlyBase , 2011, Nucleic Acids Res..

[12]  Alioune Ngom,et al.  The non-negative matrix factorization toolbox for biological data mining , 2013, Source Code for Biology and Medicine.

[13]  Ying Nian Wu,et al.  Decoding the encoding of functional brain networks: An fMRI classification comparison of non-negative matrix factorization (NMF), independent component analysis (ICA), and sparse coding algorithms , 2016, Journal of Neuroscience Methods.

[14]  Philipp J. Keller,et al.  Whole-brain functional imaging at cellular resolution using light-sheet microscopy , 2013, Nature Methods.

[15]  Louiqa Raschid,et al.  Tensor Factors to Monitor the Co-Movement of Equity Prices , 2017, DSMM@SIGMOD.

[16]  Michael I. Jordan,et al.  Bayesian Nonnegative Matrix Factorization with Stochastic Variational Inference , 2014, Handbook of Mixed Membership Models and Their Applications.

[17]  J. Besag On the Statistical Analysis of Dirty Pictures , 1986 .

[18]  Richard Szeliski,et al.  A Comparative Study of Energy Minimization Methods for Markov Random Fields with Smoothness-Based Priors , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Simon X. Chen,et al.  Emergence of reproducible spatiotemporal activity during motor learning , 2014, Nature.

[20]  Konrad P Kording,et al.  How advances in neural recording affect data analysis , 2011, Nature Neuroscience.

[21]  Sergey L. Gratiy,et al.  Fully integrated silicon probes for high-density recording of neural activity , 2017, Nature.

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  Suzanne Pfeffer,et al.  TIP47 is a key effector for Rab9 localization , 2006, The Journal of cell biology.

[24]  Navid Lambert-Shirzad,et al.  On identifying kinematic and muscle synergies: a comparison of matrix factorization methods using experimental data from the healthy population. , 2017, Journal of neurophysiology.

[25]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[26]  F. B. Abdalla,et al.  Erasing the Milky Way: new cleaning technique applied to GBT intensity mapping data , 2015, 1510.05453.

[27]  Eric J. Topol,et al.  A prospective randomized trial examining health care utilization in individuals using multiple smartphone-enabled biosensors , 2015, bioRxiv.

[28]  Roberto Carniel,et al.  Characterization of volcanic regimes and identification of significant transitions using geophysical data: a review , 2014, Bulletin of Volcanology.

[29]  Nicolas Chopin,et al.  Fast simulation of truncated Gaussian distributions , 2011, Stat. Comput..

[30]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[31]  Andrzej Cichocki,et al.  Fast Local Algorithms for Large Scale Nonnegative Matrix and Tensor Factorizations , 2009, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[32]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[33]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.