论文信息 - Elementwise Language Representation

Elementwise Language Representation

We propose a new technique for computational language representation called elementwise embedding, in which a material (semantic unit) is abstracted into a horizontal concatenation of lower-dimensional element (character) embeddings. While elements are always characters, materials are arbitrary levels of semantic units so it generalizes to any type of tokenization. To focus only on the important letters, the $n^{th}$ spellings of each semantic unit are aligned in $n^{th}$ attention heads, then concatenated back into original forms creating unique embedding representations; they are jointly projected thereby determining own contextual importance. Technically, this framework is achieved by passing a sequence of materials, each consists of $v$ elements, to a transformer having $h=v$ attention heads. As a pure embedding technique, elementwise embedding replaces the $w$-dimensional embedding table of a transformer model with $256$ $c$-dimensional elements (each corresponding to one of UTF-8 bytes) where $c=w/v$. Using this novel approach, we show that the standard transformer architecture can be reused for all levels of language representations and be able to process much longer sequences at the same time-complexity without"any"architectural modification and additional overhead. BERT trained with elementwise embedding outperforms its subword equivalence (original implementation) in multilabel patent document classification exhibiting superior robustness to domain-specificity and data imbalance, despite using $0.005\%$ of embedding parameters. Experiments demonstrate the generalizability of the proposed method by successfully transferring these enhancements to differently architected transformers CANINE and ALBERT.

Jeeeun Kim | Du-Yeong Kim

[1] Steffen Eger,et al. ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models , 2022, ACL.

[2] Eric Villemonte de la Clergerie,et al. MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling , 2022, ArXiv.

[3] Guillem Cucurull,et al. Galactica: A Large Language Model for Science , 2022, ArXiv.

[4] Sergio Gomez Colmenarejo,et al. A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[5] Omer Levy,et al. Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens , 2021, NAACL.

[6] Olivier J. H'enaff,et al. Perceiver IO: A General Architecture for Structured Inputs & Outputs , 2021, ICLR.

[7] Hyung Won Chung,et al. Charformer: Fast Character Transformers via Gradient-based Subword Tokenization , 2021, ICLR.

[8] Rami Al-Rfou,et al. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models , 2021, Transactions of the Association for Computational Linguistics.

[9] Dan Garrette,et al. Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation , 2021, TACL.

[10] Wookey Lee,et al. PatentNet: multi-label classification of patent documents using deep learning based language understanding , 2021, Scientometrics.

[11] Alexander M. Rush,et al. Block Pruning For Faster Transformers , 2021, EMNLP.