Deep Learning Meets Projective Clustering

A common approach for compressing NLP networks is to encode the embedding layer as a matrix $A\in\mathbb{R}^{n\times d}$, compute its rank-$j$ approximation $A_j$ via SVD, and then factor $A_j$ into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer. Geometrically, the rows of $A$ represent points in $\mathbb{R}^d$, and the rows of $A_j$ represent their projections onto the $j$-dimensional subspace that minimizes the sum of squared distances ("errors") to the points. In practice, these rows of $A$ may be spread around $k>1$ subspaces, so factoring $A$ based on a single subspace may lead to large errors that turn into large drops in accuracy. Inspired by \emph{projective clustering} from computational geometry, we suggest replacing this subspace by a set of $k$ subspaces, each of dimension $j$, that minimizes the sum of squared distances over every point (row in $A$) to its \emph{closest} subspace. Based on this approach, we provide a novel architecture that replaces the original embedding layer by a set of $k$ small layers that operate in parallel and are then recombined with a single fully-connected layer. Extensive experimental results on the GLUE benchmark yield networks that are both more accurate and smaller compared to the standard matrix factorization (SVD). For example, we further compress DistilBERT by reducing the size of the embedding layer by $40\%$ while incurring only a $0.5\%$ average drop in accuracy over all nine GLUE tasks, compared to a $2.8\%$ drop using the existing SVD approach. On RoBERTa we achieve $43\%$ compression of the embedding layer with less than a $0.8\%$ average drop in accuracy as compared to a $3\%$ drop previously. Open code for reproducing and extending our results is provided.

[1]  Murad Tukan,et al.  Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation , 2020, ArXiv.

[2]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[3]  Rahul Goel,et al.  Online Embedding Compression for Text Classification using Low Rank Matrix Factorization , 2018, AAAI.

[4]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[5]  Murad Tukan,et al.  Coresets for Near-Convex Functions , 2020, NeurIPS.

[6]  Zhixun Su,et al.  Fixed-rank representation for unsupervised visual learning , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[9]  Ahmed Hassan Awadallah,et al.  Distilling BERT into Simple Neural Networks with Unlabeled Transfer Data , 2019 .

[10]  Klemens Böhm,et al.  One-Class Active Learning for Outlier Detection with Multiple Subspaces , 2019, CIKM.

[11]  Jimmy J. Lin,et al.  MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models , 2019, 1911.03588.

[12]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[13]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[14]  J. Scott McCarley,et al.  Structured Pruning of a BERT-based Question Answering Model , 2019 .

[15]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[16]  Harry Shum,et al.  Concurrent subspaces analysis , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[17]  Yanzhi Wang,et al.  Reweighted Proximal Pruning for Large-Scale Language Representation , 2019, ArXiv.

[18]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[19]  Kasturi R. Varadarajan,et al.  No Coreset, No Cry: II , 2005, FSTTCS.

[20]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[21]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[22]  Ziheng Wang,et al.  Structured Pruning of Large Language Models , 2019, EMNLP.

[23]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[24]  Pascal Frossard,et al.  Dictionary Learning , 2011, IEEE Signal Processing Magazine.

[25]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[26]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[27]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[28]  Xiaokang Yang,et al.  Learning dictionary via subspace segmentation for sparse representation , 2011, 2011 18th IEEE International Conference on Image Processing.

[29]  Dacheng Tao,et al.  On Compressing Deep Models by Low Rank and Sparse Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[31]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[32]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[33]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[34]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[35]  Jimmy J. Lin,et al.  Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models , 2019, ArXiv.

[36]  Ahmed Hassan Awadallah,et al.  Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data , 2019, arXiv.org.

[37]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[38]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[39]  David P. Woodruff,et al.  Input Sparsity and Hardness for Robust Subspace Approximation , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[40]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[41]  Mitchell A. Gordon,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, REPL4NLP.

[42]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[43]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[44]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[45]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[46]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[47]  J. Scott McCarley,et al.  Pruning a BERT-based Question Answering Model , 2019, ArXiv.