Stochastic EM for Shuffled Linear Regression

We consider the problem of inference in a linear regression model in which the relative ordering of the input features and output labels is not known. Such datasets naturally arise from experiments in which the samples are shuffled or permuted during the protocol. In this work, we propose a framework that treats the unknown permutation as a latent variable. We maximize the likelihood of observations using a stochastic expectation-maximization (EM) approach. We compare this to the dominant approach in the literature, which corresponds to hard EM in our framework. We show on synthetic data that the stochastic EM algorithm we develop has several advantages, including lower parameter error, less sensitivity to the choice of initialization, and significantly better performance on datasets that are only partially shuffled. We conclude by performing two experiments on real datasets that have been partially shuffled, in which we show that the stochastic EM algorithm can recover the weights with modest error.

[1]  Dong Liu,et al.  $\propto$SVM for learning with label proportions , 2013, ICML 2013.

[2]  Katharina Morik,et al.  Learning from Label Proportions by Optimizing Cluster Model Selection , 2011, ECML/PKDD.

[3]  Martin J. Wainwright,et al.  Linear regression with an unknown permutation: Statistical and computational limits , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[4]  James Y. Zou,et al.  Linear Regression with Shuffled Labels , 2017, 1705.01342.

[5]  Rick S. Blum,et al.  Maximum Likelihood Signal Amplitude Estimation Based on Permuted Blocks of Differently Binary Quantized Observations of a Signal in Noise , 2017, ArXiv.

[6]  Martin J. Wainwright,et al.  Denoising linear models with permuted data , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[7]  Michael B. Miller Linear Regression Analysis , 2013 .

[8]  O. C. Blair,et al.  Practical Flow Cytometry , 1985, The Yale Journal of Biology and Medicine.

[9]  D. Kell,et al.  Array-based evolution of DNA aptamers allows modelling of an explicit sequence-fitness landscape , 2008, Nucleic acids research.

[10]  Zhi-Hua Zhou,et al.  On the relation between multi-instance learning and semi-supervised learning , 2007, ICML '07.

[11]  Rémi Emonet,et al.  beta-risk: a New Surrogate Risk for Learning from Weakly Labeled Data , 2016, NIPS.

[12]  Philip David,et al.  SoftPOSIT: Simultaneous Pose and Correspondence Determination , 2002, International Journal of Computer Vision.

[13]  J. Stanton Galton, Pearson, and the Peas: A Brief History of Linear Regression for Statistics Instructors , 2001 .

[14]  Michael J. Shaw,et al.  Protection of health information in data mining , 2004 .

[15]  Daniel P. Huttenlocher,et al.  Weakly Supervised Learning of Part-Based Spatial Models for Visual Object Recognition , 2006, ECCV.

[16]  Ivor W. Tsang,et al.  Convex and scalable weakly labeled SVMs , 2013, J. Mach. Learn. Res..

[17]  Xiaorui Sun,et al.  Linear regression without correspondence , 2017, NIPS.

[18]  W. S. Robinson A Method for Chronologically Ordering Archaeological Deposits , 1951, American Antiquity.

[19]  Rick S. Blum,et al.  Signal Amplitude Estimation and Detection From Unlabeled Binary Quantized Samples , 2018, IEEE Transactions on Signal Processing.

[20]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .