Characterizing and Measuring Linguistic Dataset Drift

NLP models often degrade in performance when real world data distributions differ markedly from training data. However, existing dataset drift metrics in NLP have generally not considered specific dimensions of linguistic drift that affect model performance, and they have not been validated in their ability to predict model performance at the individual example level, where such metrics are often used in practice. In this paper, we propose three dimensions of linguistic dataset drift: vocabulary, structural, and semantic drift. These dimensions correspond to content word frequency divergences, syntactic divergences, and meaning changes not captured by word frequencies (e.g. lexical semantic change). We propose interpretable metrics for all three drift dimensions, and we modify past performance prediction methods to predict model performance at both the example and dataset level for English sentiment classification and natural language inference. We find that our drift metrics are more effective than previous metrics at predicting out-of-domain model accuracies (mean 16.8% root mean square error decrease), particularly when compared to popular fine-tuned embedding distances (mean 47.7% error decrease). Fine-tuned embedding distances are much more effective at ranking individual examples by expected performance, but decomposing into vocabulary, structural, and semantic drift produces the best example rankings of all considered model-agnostic drift metrics (mean 6.7% ROC AUC increase).

[1]  Zohar S. Karnin,et al.  Amazon SageMaker Model Monitor: A System for Real-Time Insights into Deployed Machine Learning Models , 2021, KDD.

[2]  A. Nenkova,et al.  Temporal Effects on Pre-trained Models for Language Processing Tasks , 2021, Transactions of the Association for Computational Linguistics.

[3]  Patrick Lehnen,et al.  Predicting Temporal Performance Drop of Deployed Production Spoken Language Understanding Models , 2021, Interspeech.

[4]  H. Kashima,et al.  Re-evaluating Word Mover's Distance , 2021, ICML.

[5]  Tianwei Zhang,et al.  Sentence Similarity Based on Contexts , 2021, TACL.

[6]  Jonas Kuhn,et al.  Explaining and Improving BERT Performance on Lexical Semantic Change Detection , 2021, EACL.

[7]  Jinlan Fu,et al.  Towards More Fine-grained and Reliable NLP Performance Prediction , 2021, EACL.

[8]  Pang Wei Koh,et al.  WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.

[9]  Roger Zimmermann,et al.  Domain Divergences: A Survey and Empirical Analysis , 2020, NAACL.

[10]  Marine Carpuat,et al.  Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank , 2020, EMNLP.

[11]  Julia Taylor Rayz,et al.  Exploring BERT’s sensitivity to lexical cues using tests from semantic priming , 2020, FINDINGS.

[12]  J. Vallejo,et al.  Predictability , 2020, Just Words.

[13]  Shruti Rijhwani,et al.  Temporally-Informed Analysis of Named Entity Recognition , 2020, ACL.

[14]  Marco Del Tredici,et al.  Analysing Lexical Semantic Change with Contextualised Word Representations , 2020, ACL.

[15]  Sampo Pyysalo,et al.  Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection , 2020, LREC.

[16]  Ivan P. Yamshchikov,et al.  Style-transfer and Paraphrase: Looking for a Sensible Semantic Similarity Metric , 2020, AAAI.

[17]  Elahe Rahimtoroghi,et al.  What Happens To BERT Embeddings During Fine-tuning? , 2020, BLACKBOXNLP.

[18]  Matt J. Kusner,et al.  A Survey on Contextual Embeddings , 2020, ArXiv.

[19]  Matthias Gallé,et al.  To Annotate or Not? Predicting Performance Drop under Domain Shift , 2019, EMNLP.

[20]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[21]  Yizhou Sun,et al.  Few-Shot Representation Learning for Out-Of-Vocabulary Words , 2019, ACL.

[22]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[23]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[24]  Lars Borin,et al.  Survey of Computational Approaches to Lexical Semantic Change , 2018, 1811.06278.

[25]  Nianwen Xue,et al.  Translation Divergences in Chinese–English Machine Translation: An Empirical Investigation , 2017, CL.

[26]  Lemao Liu,et al.  Instance Weighting for Neural Machine Translation Domain Adaptation , 2017, EMNLP.

[27]  Morteza Dehghani,et al.  Conversation level syntax similarity metric , 2017, Behavior Research Methods.

[28]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[29]  John G. Breslin,et al.  Data Selection Strategies for Multi-Domain Sentiment Analysis , 2017, ArXiv.

[30]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[31]  Geoffrey I. Webb,et al.  Characterizing concept drift , 2015, Data Mining and Knowledge Discovery.

[32]  Steven Skiena,et al.  Statistically Significant Detection of Linguistic Change , 2014, WWW.

[33]  Decision of the European Court of Justice 11 July 2013 – Ca C-52111 “Amazon” , 2013, IIC - International Review of Intellectual Property and Competition Law.

[34]  Marco Baroni,et al.  A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. , 2011, GEMS.

[35]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[36]  Mohammad Abid Khan,et al.  Lexical-semantic divergence in Urdu-to-English Example Based Machine Translation , 2010, 2010 6th International Conference on Emerging Technologies (ICET).

[37]  Chengqing Wu,et al.  Compare diagnostic tests using transformation-invariant smoothed ROC curves(). , 2010, Journal of statistical planning and inference.

[38]  Tim Oates,et al.  We’re Not in Kansas Anymore: Detecting Domain Changes in Streams , 2010, EMNLP.

[39]  Eyal Sagi,et al.  Semantic Density Analysis: Comparing Word Meaning across Time and Phonetic Space , 2009 .

[40]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[41]  W. Wiersma,et al.  A Measure of Aggregate Syntactic Distance , 2006 .

[42]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[43]  Pushpak Bhattacharyya,et al.  Interlingua-based English–Hindi Machine Translation and Language Divergence , 2001, Machine Translation.

[44]  Sidney J. Segalowitz,et al.  Lexical Access of Function versus Content Words , 2000, Brain and Language.

[45]  Bonnie J. Dorr,et al.  Solving Thematic Divergences in Machine Translation , 1990, ACL.

[46]  Judith Gaspers,et al.  Distributionally Robust Finetuning BERT for Covariate Drift in Spoken Language Understanding , 2022, ACL.

[47]  Chris Potts Compositionality , 2022 .

[48]  Eyke Hüllermeier,et al.  Drift Detection in Text Data with Document Embeddings , 2021, IDEAL.

[49]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[50]  Jesper Bäck,et al.  Domain similarity metrics for predicting transfer learning performance , 2019 .

[51]  Sydney, Australia , 2019, The Statesman’s Yearbook Companion.

[52]  Sarah Armstrong,et al.  How do you Learn , 2016 .

[53]  Marianna J. Martindale,et al.  Class-based N-gram language difference models for data selection , 2015, IWSLT.

[54]  Jason M. Brenier,et al.  Predictability Effects on Durations of Content and Function Words in Conversational English , 2009 .