Active Learning for Effectively Fine-Tuning Transfer Learning to Downstream Task

Language model (LM) has become a common method of transfer learning in Natural Language Processing (NLP) tasks when working with small labeled datasets. An LM is pretrained using an easily available large unlabelled text corpus and is fine-tuned with the labelled data to apply to the target (i.e., downstream) task. As an LM is designed to capture the linguistic aspects of semantics, it can be biased to linguistic features. We argue that exposing an LM model during fine-tuning to instances that capture diverse semantic aspects (e.g., topical, linguistic, semantic relations) present in the dataset will improve its performance on the underlying task. We propose a Mixed Aspect Sampling (MAS) framework to sample instances that capture different semantic aspects of the dataset and use the ensemble classifier to improve the classification performance. Experimental results show that MAS performs better than random sampling as well as the state-of-the-art active learning models to abuse detection tasks where it is hard to collect the labelled data for building an accurate classifier.

[1]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[2]  Yuefeng Li,et al.  A Framework for Automatic Personalised Ontology Learning , 2016, 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[3]  Heri Ramampiaro,et al.  Effective hate-speech detection in Twitter data using recurrent neural networks , 2018, Applied Intelligence.

[4]  Manali Sharma,et al.  Evidence-based uncertainty sampling for active learning , 2016, Data Mining and Knowledge Discovery.

[5]  Vasudeva Varma,et al.  Deep Learning for Hate Speech Detection in Tweets , 2017, WWW.

[6]  Richi Nayak,et al.  Regularising LSTM classifier by transfer learning for detecting misogynistic tweets with small training set , 2020, Knowledge and Information Systems.

[7]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[8]  Derek Greene,et al.  Unsupervised graph-based topic labelling using dbpedia , 2013, WSDM.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[11]  Xinge You,et al.  Diverse Expected Gradient Active Learning for Relative Attributes , 2014, IEEE Transactions on Image Processing.

[12]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[13]  Foster J. Provost,et al.  Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance , 2010, KDD.

[14]  Carolyn Penstein Rosé,et al.  Detecting offensive tweets via topical feature discovery over a large scale twitter corpus , 2012, CIKM.

[15]  Yuefeng Li,et al.  Random-Sets for Dealing with Uncertainties in Relevance Feature , 2018, Australasian Conference on Artificial Intelligence.

[16]  Amit P. Sheth,et al.  Cursing in English on twitter , 2014, CSCW.

[17]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[18]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[19]  Xiaoqin Zhang,et al.  Pair-based Uncertainty and Diversity Promoting Early Active Learning for Person Re-identification , 2020, ACM Trans. Intell. Syst. Technol..

[20]  I. Molchanov Theory of Random Sets , 2005 .

[21]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[22]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[23]  Yuefeng Li,et al.  Random Set to Interpret Topic Models in Terms of Ontology Concepts , 2017, Australasian Conference on Artificial Intelligence.

[24]  Christopher Ré,et al.  Data programming with DDLite: putting humans in a different part of the loop , 2016, HILDA '16.

[25]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Wenbin Cai,et al.  Active learning for ranking with sample density , 2015, Information Retrieval Journal.

[28]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[29]  Paolo Rosso,et al.  Overview of the Evalita 2018 Task on Automatic Misogyny Identification (AMI) , 2018, EVALITA@CLiC-it.

[30]  Pietro Perona,et al.  Tropel: Crowdsourcing Detectors with Minimal Training , 2015, HCOMP.

[31]  R. Kohn,et al.  On Gibbs sampling for state space models , 1994 .

[32]  Foster J. Provost,et al.  Inactive learning?: difficulties employing active learning in practice , 2011, SKDD.

[33]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[34]  Ilya Sutskever,et al.  Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[35]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[36]  Liang Zou,et al.  NULI at SemEval-2019 Task 6: Transfer Learning for Offensive Language Detection using Bidirectional Transformers , 2019, *SEMEVAL.

[37]  YeJieping,et al.  Batch Mode Active Sampling Based on Marginal Probability Distribution Matching , 2013 .

[38]  Jie Tang,et al.  Batch Mode Active Learning for Networked Data , 2012, TIST.

[39]  Tianshun Yao,et al.  Active Learning with Sampling by Uncertainty and Density for Word Sense Disambiguation and Text Classification , 2008, COLING.

[40]  Raymond Y. K. Lau,et al.  Finding Semantically Valid and Relevant Topics by Association-Based Topic Selection Model , 2017, ACM Trans. Intell. Syst. Technol..

[41]  Sanjoy Dasgupta,et al.  Hierarchical sampling for active learning , 2008, ICML '08.

[42]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[43]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[44]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[45]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[46]  Honglak Lee,et al.  An efficient framework for learning sentence representations , 2018, ICLR.

[47]  Mausam,et al.  Active Learning with Unbalanced Classes and Example-Generation Queries , 2018, HCOMP.

[48]  Sebastián Ventura,et al.  Evolutionary Strategy to Perform Batch-Mode Active Learning on Multi-Label Data , 2018, ACM Trans. Intell. Syst. Technol..

[49]  Tsuhan Chen,et al.  An active learning framework for content-based information retrieval , 2002, IEEE Trans. Multim..

[50]  Rudolf Kruse,et al.  Uncertainty and vagueness in knowledge based systems: numerical methods , 1991, Artificial intelligence.

[51]  Fabrício Benevenuto,et al.  Analyzing the Targets of Hate in Online Social Media , 2016, ICWSM.

[52]  Ziqi Zhang,et al.  Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail on Twitter , 2018, Semantic Web.

[53]  Rong Jin,et al.  Active Learning by Querying Informative and Representative Examples , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Yuefeng Li,et al.  Conceptual annotation of text patterns , 2017, Comput. Intell..

[55]  Dong Yu,et al.  Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global Entropy Reduction Maximization Criterion Computer Speech and Language Article in Press Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global E , 2022 .

[56]  Xiaojun Chang,et al.  Few-Shot Text and Image Classification via Analogical Transfer Learning , 2018, ACM Trans. Intell. Syst. Technol..

[57]  Tony Robinson,et al.  Scaling recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Deng Cai,et al.  Manifold Adaptive Experimental Design for Text Categorization , 2012, IEEE Transactions on Knowledge and Data Engineering.

[59]  Scott A. Hale,et al.  Detecting East Asian Prejudice on Social Media , 2020, ALW.

[60]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[61]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[62]  Yuefeng Li,et al.  Interpretation of text patterns , 2018, Data Mining and Knowledge Discovery.

[63]  Samuel R. Bowman,et al.  Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis , 2018, BlackboxNLP@EMNLP.

[64]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[65]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[66]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[67]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[68]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[69]  Min Wang,et al.  Active learning through density clustering , 2017, Expert Syst. Appl..

[70]  Martine De Cock,et al.  Detecting Hate Speech Against Women in English Tweets , 2018, EVALITA@CLiC-it.

[71]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[72]  In So Kweon,et al.  Learning Loss for Active Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Njagi Dennis Gitari,et al.  A Lexicon-based Approach for Hate Speech Detection , 2015, MUE 2015.

[74]  David Buttler,et al.  Latent topic feedback for information retrieval , 2011, KDD.

[75]  Ran Gilad-Bachrach Kernel Query By Committee ( KQBC ) , 2003 .

[76]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[77]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[78]  Padhraic Smyth,et al.  Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning , 2008, SEMWEB.

[79]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[80]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[81]  Richi Nayak,et al.  Misogynistic Tweet Detection: Modelling CNN with Small Datasets , 2018, AusDM.

[82]  Guillaume Bouchard,et al.  The Tradeoff Between Generative and Discriminative Classifiers , 2004 .

[83]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[84]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.