What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers

GPT-3 shows remarkable in-context learning ability of large-scale language models (LMs) trained on hundreds of billion scale data. Here we address some remaining issues less reported by the GPT-3 paper, such as a nonEnglish LM, the performances of different sized models, and the effect of recently introduced prompt optimization on in-context learning. To achieve this, we introduce HyperCLOVA, a Korean variant of 82B GPT-3 trained on a Korean-centric corpus of 560B tokens. Enhanced by our Korean-specific tokenization, HyperCLOVA with our training configuration shows state-of-the-art in-context zero-shot and few-shot learning performances on various downstream tasks in Korean. Also, we show the performance benefits of promptbased learning and demonstrate how it can be integrated into the prompt engineering pipeline. Then we discuss the possibility of materializing the No Code AI paradigm by providing AI prototyping capabilities to nonexperts of ML by introducing HyperCLOVA studio, an interactive prompt engineering interface. Lastly, we demonstrate the potential of our methods with three successful in-house applications.

[1]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[2]  Y-Lan Boureau,et al.  Controlling Style in Generated Dialogue , 2020, ArXiv.

[3]  Seongbo Jang,et al.  An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks , 2020, AACL.

[4]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[5]  James Zou,et al.  Persistent Anti-Muslim Bias in Large Language Models , 2021, AIES.

[6]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[7]  Nitika Mathur,et al.  Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.

[8]  Jason Weston,et al.  Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , 2020, ICLR.

[9]  Seungyoung Lim,et al.  KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension , 2019, ArXiv.

[10]  L. A. Ureña-López,et al.  A Survey on Bias in Deep NLP , 2021, Applied Sciences.

[11]  Laria Reynolds,et al.  Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , 2021, CHI Extended Abstracts.

[12]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[13]  Kaisheng M. Wang,et al.  PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation , 2021, ArXiv.

[14]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[15]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[16]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[17]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[18]  Jonathan May,et al.  WARP: Word-level Adversarial ReProgramming , 2021, ACL/IJCNLP.

[19]  Roi Reichart,et al.  PADA: A Prompt-based Autoregressive Approach for Adaptation to Unseen Domains , 2021, ArXiv.

[20]  Jason Weston,et al.  Retrieval Augmentation Reduces Hallucination in Conversation , 2021, EMNLP.

[21]  Hinrich Schutze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[22]  Timo Schick,et al.  Generating Datasets with Pretrained Language Models , 2021, EMNLP.

[23]  Alice H. Oh,et al.  KLUE: Korean Language Understanding Evaluation , 2021, NeurIPS Datasets and Benchmarks.

[24]  Kang Min Yoo,et al.  GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation , 2021, EMNLP.

[25]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[26]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[27]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[28]  Dat Quoc Nguyen,et al.  PhoBERT: Pre-trained language models for Vietnamese , 2020, FINDINGS.

[29]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[30]  Zhengxiao Du,et al.  GPT Understands, Too , 2021, AI Open.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Bettina Berendt,et al.  RobBERT: a Dutch RoBERTa-based Language Model , 2020, FINDINGS.

[33]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[34]  Miles Brundage,et al.  Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models , 2021, ArXiv.

[35]  Yejin Choi,et al.  Do Neural Language Models Overcome Reporting Bias? , 2020, COLING.

[36]  Jihwan Bang,et al.  Rainbow Memory: Continual Learning with a Memory of Diverse Samples , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Chen Liang,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[38]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[39]  Mary Williamson,et al.  Recipes for Building an Open-Domain Chatbot , 2020, EACL.

[40]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[41]  Dan Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[42]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.