AgriBERT: Knowledge-Infused Agricultural Language Models for Matching Food and Nutrition

Pretraining domain-specific language models remains an important challenge which limits their applicability in various areas such as agriculture. This paper investigates the effectiveness of leveraging food related text corpora (e.g., food and agricultural literature) in pretraining transformer-based language models. We evaluate our trained language model, called AgriBERT, on the task of semantic matching, i.e., establishing mapping between food descriptions and nutrition data, which is a long-standing challenge in the agricultural domain. In particular, we formulate the task as an answer selection problem, fine-tune the trained language model with the help of an external source of knowledge (e.g., FoodOn ontology), and establish a baseline for this task. The experimental results reveal that our language model substantially outperforms other language models and baselines in the task of matching food description and nutrition.

[1]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[2]  Eduard Hovy,et al.  A Survey of Data Augmentation Approaches for NLP , 2021, FINDINGS.

[3]  Handong Zhao,et al.  Edge: Enriching Knowledge Graph Embeddings with External Text , 2021, NAACL.

[4]  Soroush Vosoughi,et al.  Few-Shot Text Classification with Triplet Networks, Data Augmentation, and Curriculum Learning , 2021, NAACL.

[5]  Kevin Gimpel,et al.  Substructure Substitution: Structured Data Augmentation for NLP , 2021, FINDINGS.

[6]  Teruko Mitamura,et al.  GenAug: Data Augmentation for Finetuning Text Generators , 2020, DEELIO.

[7]  Yu Wang,et al.  How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers? , 2020, FINDINGS.

[8]  Xiangji Huang,et al.  Contextualized Embeddings based Transformer Encoder for Sentence Similarity Modeling in Answer Selection Task , 2020, LREC.

[9]  Zhe Zhao,et al.  K-BERT: Enabling Language Representation with Knowledge Graph , 2019, AAAI.

[10]  Joey Tianyi Zhou,et al.  Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2019, AAAI.

[11]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[12]  Marcin Junczys-Dowmunt,et al.  Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data , 2019, BEA@ACL.

[13]  Iryna Gurevych,et al.  COALA: A Neural Coverage-Based Approach for Long Answer Selection with Small Data , 2019, AAAI.

[14]  Graham Neubig,et al.  Generalized Data Augmentation for Low-Resource Translation , 2019, ACL.

[15]  Jimmy J. Lin,et al.  End-to-End Open-Domain Question Answering with BERTserini , 2019, NAACL.

[16]  Kai Zou,et al.  EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , 2019, EMNLP.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Damion M. Dooley,et al.  FoodOn: a harmonized food ontology to increase global food traceability, quality control and data integration , 2018, npj Science of Food.

[19]  Jieyu Zhao,et al.  Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.

[20]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[21]  Marcus D. Bloice,et al.  Data Augmentation , 2017, Encyclopedia of Machine Learning and Data Mining.

[22]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[23]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[24]  Ming-Wei Chang,et al.  Question Answering Using Enhanced Lexical Semantic Models , 2013, ACL.

[25]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[26]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[27]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.