LMdiff: A Visual Diff Tool to Compare Language Models

While different language models are ubiquitous in NLP, it is hard to contrast their outputs and identify which contexts one can handle better than the other. To address this question, we introduce LMDIFF, a tool that visually compares probability distributions of two models that differ, e.g., through finetuning, distillation, or simply training with different parameter sizes. LMDIFF allows the generation of hypotheses about model behavior by investigating text instances token by token and further assists in choosing these interesting text instances by identifying the most interesting phrases from large corpora. We showcase the applicability of LMDIFF for hypothesis generation across multiple case studies. A demo is available at http://lmdiff.net.

[1]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[2]  Ioannis Xenarios,et al.  SourceData: a semantic platform for curating and searching figures , 2016, Nature Methods.

[3]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[4]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[5]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[6]  Emily Denton,et al.  Characterising Bias in Compressed Models , 2020, ArXiv.

[7]  Jesse Vig,et al.  A Multiscale Visualization of Attention in the Transformer Model , 2019, ACL.

[8]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[9]  Minsuk Kahng,et al.  FAIRVIS: Visual Analytics for Discovering Intersectional Bias in Machine Learning , 2019, 2019 IEEE Conference on Visual Analytics Science and Technology (VAST).

[10]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[11]  Alex Wang,et al.  jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models , 2020, ACL.

[12]  Jieyu Zhao,et al.  Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.

[13]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[14]  Yonatan Belinkov,et al.  NeuroX: A Toolkit for Analyzing Individual Neurons in Neural Networks , 2018, AAAI.

[15]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[16]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[17]  Alexander M. Rush,et al.  LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks , 2016, IEEE Transactions on Visualization and Computer Graphics.

[18]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[19]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[20]  Graham Neubig,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[21]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[22]  Tolga Bolukbasi,et al.  The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models , 2020, EMNLP.

[23]  Sebastian Gehrmann,et al.  exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models , 2019, ArXiv.

[24]  Sameer Singh,et al.  AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models , 2019, EMNLP.

[25]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Alexander M. Rush,et al.  GLTR: Statistical Detection and Visualization of Generated Text , 2019, ACL.

[28]  Alexander M. Rush,et al.  Seq2seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models , 2018, IEEE Transactions on Visualization and Computer Graphics.

[29]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[30]  Yonatan Belinkov,et al.  Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[31]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.