Retrieval-based Text Selection for Addressing Class-Imbalanced Data in Classification

This paper addresses the problem of selecting of a set of texts for annotation in text classification using retrieval methods when there are limits on the number of annotations due to constraints on human resources. An additional challenge addressed is dealing with binary categories that have a small number of positive instances, reflecting severe class imbalance. In our situation, where annotation occurs over a long time period, the selection of texts to be annotated can be made in batches, with previous annotations guiding the choice of the next set. To address these challenges, the paper proposes leveraging SHAP to construct a quality set of queries for Elasticsearch and semantic search, to try to identify optimal sets of texts for annotation that will help with class imbalance. The approach is tested on sets of cue texts describing possible future events, constructed by participants involved in studies aimed to help with the management of obesity and diabetes. We introduce an effective method for selecting a small set of texts for annotation and building high-quality classifiers. We integrate vector search, semantic search, and machine learning classifiers to yield a good solution. Our experiments demonstrate improved F1 scores for the minority classes in binary classification.

[1]  F. M. Nardini,et al.  ReNeuIR: Reaching Efficiency in Neural Information Retrieval , 2022, SIGIR.

[2]  Yuyang Dong,et al.  Table Enrichment System for Machine Learning , 2022, SIGIR.

[3]  Mi Kim,et al.  A Suggestion on the LDA-Based Topic Modeling Technique Based on ElasticSearch for Indexing Academic Research Results , 2022, Applied Sciences.

[4]  Iryna Gurevych,et al.  BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , 2021, NeurIPS Datasets and Benchmarks.

[5]  Zhihao Yang,et al.  Document Retrieval for Precision Medicine Using a Deep Learning Ensemble Method , 2021, JMIR medical informatics.

[6]  Chao Chen,et al.  On the Auto-Tuning of Elastic-search based on Machine Learning , 2020, CCRIS.

[7]  Iryna Gurevych,et al.  Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks , 2020, NAACL.

[8]  Iryna Gurevych,et al.  Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation , 2020, EMNLP.

[9]  Tie-Yan Liu,et al.  MPNet: Masked and Permuted Pre-training for Language Understanding , 2020, NeurIPS.

[10]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[11]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[12]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[13]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[14]  Ray Kurzweil,et al.  Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[15]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[16]  Nicolae Tapus,et al.  Systems Monitoring and Big Data Analysis Using the Elasticsearch System , 2019, 2019 22nd International Conference on Control Systems and Computer Science (CSCS).

[17]  Taghi M. Khoshgoftaar,et al.  Survey on deep learning with class imbalance , 2019, J. Big Data.

[18]  Daniel Matthew Cer,et al.  Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model , 2018, ArXiv.

[19]  Jeffrey S. Stein,et al.  Episodic future thinking reduces delay discounting and cigarette demand: an investigation of the good-subject effect , 2018, Journal of Behavioral Medicine.

[20]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[21]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[22]  Leonard H. Epstein,et al.  Unstuck in time: episodic future thinking reduces delay discounting and cigarette smoking , 2016, Psychopharmacology.

[23]  Björn Buchhold,et al.  Semantic Search on Text and Knowledge Bases , 2016, Found. Trends Inf. Retr..

[24]  Decision of the European Court of Justice 11 July 2013 – Ca C-52111 “Amazon” , 2013, IIC - International Review of Intellectual Property and Competition Law.

[25]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[26]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[27]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[28]  C. Atance,et al.  Episodic future thinking , 2001, Trends in Cognitive Sciences.

[29]  S. G,et al.  Enhancement of Natural Language to SQL Query Conversion using Machine Learning Techniques , 2020 .

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  James Pustejovsky,et al.  Natural Language Annotation for Machine Learning - a Guide to Corpus-Building for Applications , 2012 .

[32]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .