The Multilingual Amazon Reviews Corpus

We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale collection of Amazon reviews for multilingual text classification. The corpus contains reviews in English, Japanese, German, French, Spanish, and Chinese, which were collected between 2015 and 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID, and the coarse-grained product category (e.g., 'books', 'appliances', etc.) The corpus is balanced across the 5 possible star ratings, so each rating constitutes 20% of the reviews in each language. For each language, there are 200,000, 5,000, and 5,000 reviews in the training, development, and test sets, respectively. We report baseline results for supervised text classification and zero-shot cross-lingual transfer learning by fine-tuning a multilingual BERT model on reviews data. We propose the use of mean absolute error (MAE) instead of classification accuracy for this task, since MAE accounts for the ordinal nature of the ratings.

[1]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[2]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[3]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[4]  Núria Bel,et al.  Cross-Lingual Text Categorization , 2003, ECDL.

[5]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[6]  Gerard de Melo,et al.  Multilingual Text Classification Using Ontologies , 2007, ECIR.

[7]  Yichao Lu,et al.  On the Evaluation of Contextual Embeddings for Zero-Shot Cross-Lingual Transfer Learning , 2020, EMNLP.

[8]  Holger Schwenk,et al.  A Corpus for Multilingual Document Classification in Eight Languages , 2018, LREC.

[9]  Yichao Lu,et al.  A neural interlingua for multilingual machine translation , 2018, WMT.

[10]  Jianmo Ni,et al.  Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.

[11]  Yichao Lu,et al.  Adversarial Learning with Contextual Embeddings for Zero-resource Cross-lingual Classification and NER , 2019, EMNLP/IJCNLP.

[12]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[13]  Mark Steedman,et al.  Example Selection for Bootstrapping Statistical Parsers , 2003, NAACL.

[14]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[15]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[16]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[17]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[18]  Benno Stein,et al.  Cross-Language Text Classification Using Structural Correspondence Learning , 2010, ACL.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.