Elementwise Language Representation

We propose a new technique for computational language representation called elementwise embedding, in which a material (semantic unit) is abstracted into a horizontal concatenation of lower-dimensional element (character) embeddings. While elements are always characters, materials are arbitrary levels of semantic units so it generalizes to any type of tokenization. To focus only on the important letters, the $n^{th}$ spellings of each semantic unit are aligned in $n^{th}$ attention heads, then concatenated back into original forms creating unique embedding representations; they are jointly projected thereby determining own contextual importance. Technically, this framework is achieved by passing a sequence of materials, each consists of $v$ elements, to a transformer having $h=v$ attention heads. As a pure embedding technique, elementwise embedding replaces the $w$-dimensional embedding table of a transformer model with $256$ $c$-dimensional elements (each corresponding to one of UTF-8 bytes) where $c=w/v$. Using this novel approach, we show that the standard transformer architecture can be reused for all levels of language representations and be able to process much longer sequences at the same time-complexity without"any"architectural modification and additional overhead. BERT trained with elementwise embedding outperforms its subword equivalence (original implementation) in multilabel patent document classification exhibiting superior robustness to domain-specificity and data imbalance, despite using $0.005\%$ of embedding parameters. Experiments demonstrate the generalizability of the proposed method by successfully transferring these enhancements to differently architected transformers CANINE and ALBERT.

[1]  Steffen Eger,et al.  ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models , 2022, ACL.

[2]  Eric Villemonte de la Clergerie,et al.  MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling , 2022, ArXiv.

[3]  Guillem Cucurull,et al.  Galactica: A Large Language Model for Science , 2022, ArXiv.

[4]  Sergio Gomez Colmenarejo,et al.  A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[5]  Omer Levy,et al.  Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens , 2021, NAACL.

[6]  Olivier J. H'enaff,et al.  Perceiver IO: A General Architecture for Structured Inputs & Outputs , 2021, ICLR.

[7]  Hyung Won Chung,et al.  Charformer: Fast Character Transformers via Gradient-based Subword Tokenization , 2021, ICLR.

[8]  Rami Al-Rfou,et al.  ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models , 2021, Transactions of the Association for Computational Linguistics.

[9]  Dan Garrette,et al.  Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation , 2021, TACL.

[10]  Wookey Lee,et al.  PatentNet: multi-label classification of patent documents using deep learning based language understanding , 2021, Scientometrics.

[11]  Alexander M. Rush,et al.  Block Pruning For Faster Transformers , 2021, EMNLP.

[12]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[13]  Naoaki Okazaki,et al.  Joint Optimization of Tokenization and Downstream Model , 2021, FINDINGS.

[14]  Graham Neubig,et al.  Multi-view Subword Regularization , 2021, NAACL.

[15]  Ting Liu,et al.  CharBERT: Character-aware Pre-trained Language Model , 2020, COLING.

[16]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[17]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[18]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[19]  Jieh Hsiang,et al.  Patent classification by fine-tuning BERT language model , 2020, World Patent Information.

[20]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[21]  Mohammad Norouzi,et al.  Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation , 2020, ACL.

[22]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[23]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[24]  Jure Leskovec,et al.  Learning to Simulate Complex Physics with Graph Networks , 2020, ICML.

[25]  Emma J. Chory,et al.  A Deep Learning Approach to Antibiotic Discovery , 2020, Cell.

[26]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[27]  Ivan Provilkov,et al.  BPE-Dropout: Simple and Effective Subword Regularization , 2019, ACL.

[28]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[29]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[30]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[31]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[32]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[33]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[34]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[35]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[36]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[37]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[40]  Roman Suvorov,et al.  Fast and Accurate Patent Classification in Search Engines , 2018, Journal of Physics: Conference Series.

[41]  Shaobo Li,et al.  DeepPatent: patent classification with convolutional neural networks and word embedding , 2018, Scientometrics.

[42]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[43]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[44]  Raquel Urtasun,et al.  The Reversible Residual Network: Backpropagation Without Storing Activations , 2017, NIPS.

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[46]  Sungjoo Lee,et al.  Forecasting and identifying multi-technology convergence based on patent data: the case of IT and BT industries in 2020 , 2017, Scientometrics.

[47]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[48]  Sora Lim,et al.  IPC Multi-label Classification Applying the Characteristics of Patent Documents , 2016, CSA/CUTE.

[49]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[50]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[51]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[52]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[53]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[54]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[55]  Sungjoo Lee,et al.  Technological Forecasting & Social Change Business planning based on technological capabilities : Patent analysis for technology-driven roadmapping ☆ , 2009 .

[56]  Trademark Office,et al.  Manual of patent examining procedure , 2004 .