论文信息 - SPARTan: Scalable PARAFAC2 for Large & Sparse Data

SPARTan: Scalable PARAFAC2 for Large & Sparse Data

In exploratory tensor mining, a common problem is how to analyze a set of variables across a set of subjects whose observations do not align naturally. For example, when modeling medical features across a set of patients, the number and duration of treatments may vary widely in time, meaning there is no meaningful way to align their clinical records across time points for analysis purposes. To handle such data, the state-of-the-art tensor model is the so-called PARAFAC2, which yields interpretable and robust output and can naturally handle sparse data. However, its main limitation up to now has been the lack of efficient algorithms that can handle large-scale datasets. In this work, we fill this gap by developing a scalable method to compute the PARAFAC2 decomposition of large and sparse datasets, called SPARTan. Our method exploits special structure within PARAFAC2, leading to a novel algorithmic reformulation that is both faster (in absolute time) and more memory-efficient than prior work. We evaluate SPARTan on both synthetic and real datasets, showing 22X performance gains over the best previous implementation and also handling larger problem instances for which the baseline fails. Furthermore, we are able to apply SPARTan to the mining of temporally-evolving phenotypes on data taken from real and medically complex pediatric patients. The clinical meaningfulness of the phenotypes identified in this process, as well as their temporal evolution over time for several patients, have been endorsed by clinical experts.

[1] Rasmus Bro,et al. The N-way Toolbox for MATLAB , 2000 .

[2] Jimeng Sun,et al. Two Heads Better Than One: Pattern Discovery in Time-Evolving Multi-aspect Data , 2008, ECML/PKDD.

[3] R. Bro,et al. PARAFAC2—Part I. A direct fitting algorithm for the PARAFAC2 model , 1999 .

[4] Tamara G. Kolda,et al. On Tensors, Sparsity, and Nonnegative Factorizations , 2011, SIAM J. Matrix Anal. Appl..

[5] R. Bro. PARAFAC. Tutorial and applications , 1997 .

[6] Nikos D. Sidiropoulos,et al. ParCube: Sparse Parallelizable CANDECOMP-PARAFAC Tensor Decomposition , 2015, ACM Trans. Knowl. Discov. Data.

[7] Licia Capra,et al. Temporal diversity in recommender systems , 2010, SIGIR.

[8] J. H. Choi,et al. DFacTo: Distributed Factorization of Tensors , 2014, NIPS.

[9] Robert H. Halstead,et al. Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[10] Fei Wang,et al. DensityTransfer: A Data Driven Approach for Imputing Electronic Health Records , 2014, 2014 22nd International Conference on Pattern Recognition.

[11] Bülent Yener,et al. Unsupervised Multiway Data Analysis: A Literature Survey , 2009, IEEE Transactions on Knowledge and Data Engineering.

[12] Yan Liu,et al. SPALS: Fast Alternating Least Squares via Implicit Leverage Scores Sampling , 2016, NIPS.

[13] Fei Wang,et al. A Framework for Mining Signatures from Event Sequences and Its Applications in Healthcare Data , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14] Jimeng Sun,et al. Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization , 2014, KDD.

[15] V. N. Slee,et al. The International Classification of Diseases: ninth revision (ICD-9) , 1978, Annals of internal medicine.

[16] J. Berge,et al. Some uniqueness results for PARAFAC2 , 1996 .

[17] Claus A. Andersson,et al. PARAFAC2—Part II. Modeling chromatographic data with retention time shifts , 1999 .

[18] Nikos D. Sidiropoulos,et al. Tensors for Data Mining and Data Fusion , 2016, ACM Trans. Intell. Syst. Technol..

[19] R. Harshman,et al. Uniqueness proof for a family of models sharing features of Tucker's three-mode factor analysis and PARAFAC/candecomp , 1996 .

[20] R. Bro,et al. A fast non‐negativity‐constrained least squares algorithm , 1997 .

[21] R. Harshman. The differences between analysis of covariance and correlation , 2001 .

[22] Nikos D. Sidiropoulos,et al. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[23] Tamara G. Kolda,et al. Tensor Decompositions and Applications , 2009, SIAM Rev..

[24] N. Sidiropoulos,et al. On the uniqueness of multilinear decomposition of N‐way arrays , 2000 .

[25] Rasmus Bro,et al. MULTI-WAY ANALYSIS IN THE FOOD INDUSTRY Models, Algorithms & Applications , 1998 .

[26] Tam T. T. Lam,et al. Multi-set factor analysis by means of Parafac2. , 2016, The British journal of mathematical and statistical psychology.

[27] J. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics , 1977 .

[28] Christos Faloutsos,et al. GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries , 2012, KDD.

[29] L. Trefethen,et al. Numerical linear algebra , 1997 .

[30] Tamara G. Kolda,et al. Cross-language information retrieval using PARAFAC2 , 2007, KDD '07.

[31] J. Chang,et al. Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[32] F. L. Hitchcock. The Expression of a Tensor or a Polyadic as a Sum of Products , 1927 .

[33] Gene H. Golub,et al. Matrix computations , 1983 .

[34] Nathaniel E. Helwig,et al. The Special Sign Indeterminacy of the Direct-Fitting Parafac2 Model: Some Implications, Cautions, and Recommendations for Simultaneous Component Analysis , 2013, Psychometrika.

[35] Tamara G. Kolda,et al. Efficient MATLAB Computations with Sparse and Factored Tensors , 2007, SIAM J. Sci. Comput..

[36] Jimeng Sun,et al. Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics , 2015, KDD.

[37] Jimeng Sun,et al. Limestone: High-throughput candidate phenotype generation via tensor factorization , 2014, J. Biomed. Informatics.

[38] Nikos D. Sidiropoulos,et al. Tensor Decomposition for Signal Processing and Machine Learning , 2016, IEEE Transactions on Signal Processing.

[39] Fei Wang,et al. From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records , 2014, KDD.

[40] Philipp Birken,et al. Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[41] Richard A. Harshman,et al. Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[42] Jimeng Sun,et al. Clinical phenotyping in selected national networks: demonstrating the need for high-throughput, portable, and computational methods , 2016, Artif. Intell. Medicine.