COPA: Constrained PARAFAC2 for Sparse & Large Datasets

PARAFAC2 has demonstrated success in modeling irregular tensors, where the tensor dimensions vary across one of the modes. An example scenario is modeling treatments across a set of patients with the varying number of medical encounters over time. Despite recent improvements on unconstrained PARAFAC2, its model factors are usually dense and sensitive to noise which limits their interpretability. As a result, the following open challenges remain: a) various modeling constraints, such as temporal smoothness, sparsity and non-negativity, are needed to be imposed for interpretable temporal modeling and b) a scalable approach is required to support those constraints efficiently for large datasets. To tackle these challenges, we propose a COnstrained PARAFAC2 (COPA) method, which carefully incorporates optimization constraints such as temporal smoothness, sparsity, and non-negativity in the resulting factors. To efficiently support all those constraints, COPA adopts a hybrid optimization framework using alternating optimization and alternating direction method of multiplier (AO-ADMM). As evaluated on large electronic health record (EHR) datasets with hundreds of thousands of patients, COPA achieves significant speedups (up to 36 times faster) over prior PARAFAC2 approaches that only attempt to handle a subset of the constraints that COPA enables. Overall, our method outperforms all the baselines attempting to handle a subset of the constraints in terms of speed, while achieving the same level of accuracy. Through a case study on temporal phenotyping of medically complex children, we demonstrate how the constraints imposed by COPA reveal concise phenotypes and meaningful temporal profiles of patients. The clinical interpretation of both the phenotypes and the temporal profiles was confirmed by a medical expert.

[1]  Jimeng Sun,et al.  Federated Tensor Factorization for Computational Phenotyping , 2017, KDD.

[2]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[3]  Nathaniel E. Helwig,et al.  Estimating latent trends in multivariate longitudinal data via Parafac2 with functional and structural constraints , 2017, Biometrical journal. Biometrische Zeitschrift.

[4]  Gene H. Golub,et al.  Matrix computations , 1983 .

[5]  Jimeng Sun,et al.  MetaFac: community discovery via relational hypergraph factorization , 2009, KDD.

[6]  Christos Faloutsos,et al.  FUNNEL: automatic mining of spatially coevolving epidemics , 2014, KDD.

[7]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[8]  Barry M. Wise,et al.  Application of PARAFAC2 to fault detection and diagnosis in semiconductor etch , 2001 .

[9]  J. Ramsay Monotone Regression Splines in Action , 1988 .

[10]  Tamara G. Kolda,et al.  Efficient MATLAB Computations with Sparse and Factored Tensors , 2007, SIAM J. Sci. Comput..

[11]  Vaidy S. Sunderam,et al.  CP-ORTHO: An Orthogonal Tensor Factorization Framework for Spatio-Temporal Data , 2017, SIGSPATIAL/GIS.

[12]  Jimeng Sun,et al.  Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization , 2014, KDD.

[13]  P. Schönemann,et al.  A generalized solution of the orthogonal procrustes problem , 1966 .

[14]  Tamara G. Kolda,et al.  Link Prediction on Evolving Data Using Matrix and Tensor Factorizations , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[15]  Jimeng Sun,et al.  SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping , 2018, KDD.

[16]  Tamara G. Kolda,et al.  Cross-language information retrieval using PARAFAC2 , 2007, KDD '07.

[17]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[18]  R. Bro,et al.  PARAFAC2—Part I. A direct fitting algorithm for the PARAFAC2 model , 1999 .

[19]  V. N. Slee,et al.  The International Classification of Diseases: ninth revision (ICD-9) , 1978, Annals of internal medicine.

[20]  Claus A. Andersson,et al.  PARAFAC2—Part II. Modeling chromatographic data with retention time shifts , 1999 .

[21]  Nikos D. Sidiropoulos,et al.  A Flexible and Efficient Algorithmic Framework for Constrained Matrix and Tensor Factorization , 2015, IEEE Transactions on Signal Processing.

[22]  Jimeng Sun,et al.  Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics , 2015, KDD.

[23]  Jimeng Sun,et al.  Limestone: High-throughput candidate phenotype generation via tensor factorization , 2014, J. Biomed. Informatics.

[24]  Marieke E. Timmerman,et al.  Three-way component analysis with smoothness constraints , 2002 .

[25]  R. Harshman The differences between analysis of covariance and correlation , 2001 .

[26]  Fei Wang,et al.  SPARTan: Scalable PARAFAC2 for Large & Sparse Data , 2017, KDD.

[27]  Frank H. Clarke,et al.  A New Approach to Lagrange Multipliers , 1976, Math. Oper. Res..