Semi-Supervised Knowledge Amalgamation for Sequence Classification

Sequence classification is essential for domains from medical diagnosis to online advertising. In these settings, data are typically proprietary and annotations are expensive to acquire. Often times, so few annotations are available that training a robust model from scratch is impractical. Recently, knowledge amalgamation (KA) has emerged as a promising strategy for training models without this hard-to-come-by labeled training dataset. To achieve this, KA methods combine the knowledge of multiple pre-trained teacher models (trained on different classification tasks and proprietary datasets) into one student model that becomes an expert on the union of all teachers’ classes. However, we demonstrate that the state-ofthe-art solutions fail in the presence of overconfident teachers, which make confident but incorrect predictions for instances from classes upon which they were not trained. Additionally, to-date no work has explored KA for sequence models. Therefore, we propose and then solve the open problem of semi-supervised KA for sequence classification (SKA). Our SKA approach first learns to estimate how trustworthy each teacher is for a given instance, then rescales the predicted probabilities from all teachers to supervise a student model. Our solution overcomes overconfident teachers through careful use of a very small amount of labeled instances. We demonstrate that this approach beats eight state-of-the-art alternatives on four real-world datasets by on average 15% in accuracy with as little as 2% of training data being annotated.

[1]  Aram Galstyan,et al.  Multitask learning and benchmarking with clinical time series data , 2017, Scientific Data.

[2]  Bhuvana Ramabhadran,et al.  Efficient Knowledge Distillation from an Ensemble of Teachers , 2017, INTERSPEECH.

[3]  R. J. Alcock Time-Series Similarity Queries Employing a Feature-Based Approach , 1999 .

[4]  Mingli Song,et al.  Knowledge Amalgamation from Heterogeneous Networks by Common Feature Learning , 2019, IJCAI.

[5]  David Sontag,et al.  Multi-task Prediction of Disease Onsets from Longitudinal Laboratory Tests , 2016, MLHC.

[6]  Elke A. Rundensteiner,et al.  Recurrent Halting Chain for Early Multi-label Classification , 2020, KDD.

[7]  Jun Zhang,et al.  A Two-Teacher Framework for Knowledge Distillation , 2019, ISNN.

[8]  Martial Hebert,et al.  Semi-Supervised Self-Training of Object Detection Models , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[9]  Xu Tan,et al.  Progressive Blockwise Knowledge Distillation for Neural Network Acceleration , 2018, IJCAI.

[10]  Phongtharin Vinayavekhin,et al.  Unifying Heterogeneous Classifiers With Distillation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  David A. Forsyth,et al.  Swapout: Learning an ensemble of deep architectures , 2016, NIPS.

[13]  Jason Lines,et al.  Ensembles of Elastic Distance Measures for Time Series Classification , 2014, SDM.

[14]  Lovekesh Vig,et al.  TimeNet: Pre-trained deep recurrent neural network for time series classification , 2017, ESANN.

[15]  Abdul Kadar Muhammad Masum,et al.  Scrutiny of Mental Depression through Smartphone Sensors Using Machine Learning Approaches , 2020 .

[16]  Li Sun,et al.  Amalgamating Knowledge towards Comprehensive Classification , 2018, AAAI.

[17]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[18]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[19]  Yingwei Zhang,et al.  Instance-Wise Dynamic Sensor Selection for Human Activity Recognition , 2020, AAAI.

[20]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[21]  Li Sun,et al.  Customizing Student Networks From Heterogeneous Teachers via Adaptive Knowledge Amalgamation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[23]  Germain Forestier,et al.  Transfer learning for time series classification , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[24]  Davide Anguita,et al.  A Public Domain Dataset for Human Activity Recognition using Smartphones , 2013, ESANN.

[25]  Elke A. Rundensteiner,et al.  Adaptive-Halting Policy Network for Early Classification , 2019, KDD.

[26]  Mingli Song,et al.  Amalgamating Filtered Knowledge: Learning Task-customized Student from Multi-task Teachers , 2019, IJCAI.

[27]  Dacheng Tao,et al.  Learning from Multiple Teacher Networks , 2017, KDD.

[28]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Ebony Carter,et al.  Enhancing pedestrian mobility in Smart Cities using Big Data , 2020 .

[31]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..