MA-BERT: Learning Representation by Incorporating Multi-Attribute Knowledge in Transformers

Incorporating attribute information such as user and product features into deep neural networks has been shown to be useful in sentiment analysis. Previous works typically accomplished this in two ways: concatenating multiple attributes to word/text representation or treating them as a bias to adjust attention distribution. To leverage the advantages of both methods, this paper proposes a multi-attribute BERT (MA-BERT) to incorporate external attribute knowledge. The proposed method has two advantages. First, it applies multi-attribute transformer (MA-Transformer) encoders to incorporate multiple attributes into both input representation and attention distribution. Second, the MA-Transformer is implemented as a universal layer and stacked on a BERT-based model such that it can be initialized from a pre-trained checkpoint and fine-tuned for the downstream applications without extra pretraining costs. Experiments on three benchmark datasets show that the proposed method outperformed pre-trained BERT models and other methods incorporating external attribute knowledge.

[1]  Xuanjing Huang,et al.  Information Aggregation via Dynamic Routing for Sequence Encoding , 2018, COLING.

[2]  Zhen Wu,et al.  Improving Review Representations with User Attention and Product Attention for Sentiment Classification , 2018, AAAI.

[3]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[4]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[5]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[6]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[7]  Lei Zhang,et al.  Sentiment Analysis and Opinion Mining , 2017, Encyclopedia of Machine Learning and Data Mining.

[8]  Tao Chen,et al.  Learning User and Product Distributed Representations Using a Sequence Model for Sentiment Analysis , 2016, IEEE Computational Intelligence Magazine.

[9]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[10]  Junlin Zhang,et al.  FiBiNET: combining feature importance and bilinear feature interaction for click-through rate prediction , 2019, RecSys.

[11]  Reinald Kim Amplayo Rethinking Attribute Representation and Injection for Sentiment Classification , 2019, EMNLP.

[12]  Xiaokui Xiao,et al.  Coupled Multi-Layer Attentions for Co-Extraction of Aspect and Opinion Terms , 2017, AAAI.

[13]  Ting Liu,et al.  Learning Semantic Representations of Users and Products for Document Level Sentiment Classification , 2015, ACL.

[14]  Mirella Lapata,et al.  Learning to Generate Product Reviews from Attributes , 2017, EACL.

[15]  Lutz Prechelt,et al.  Automatic early stopping using cross validation: quantifying the criteria , 1998, Neural Networks.

[16]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[17]  Zhiyuan Liu,et al.  Neural Sentiment Classification with User and Product Attention , 2016, EMNLP.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Zi-Yi Dou,et al.  Capturing User and Product Information for Document Level Sentiment Analysis with Deep Memory Network , 2017, EMNLP.

[21]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[22]  Jes'us Villalba,et al.  Hierarchical Transformers for Long Document Classification , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23]  Rohan Ramanath,et al.  An Attentive Survey of Attention Models , 2019, ACM Trans. Intell. Syst. Technol..

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.