Data Augmentation Based on Adversarial Autoencoder Handling Imbalance for Learning to Rank

Data imbalance is a key limiting factor for Learning to Rank (LTR) models in information retrieval. Resampling methods and ensemble methods cannot handle the imbalance problem well since none of them incorporate more informative data into the training procedure of LTR models. We propose a data generation model based on Adversarial Autoencoder (AAE) for tackling the data imbalance in LTR via informative data augmentation. This model can be utilized for handling two types of data imbalance, namely, imbalance regarding relevance levels for a particular query and imbalance regarding the amount of relevance judgements in different queries. In the proposed model, relevance information is disentangled from the latent representations in this AAE-based model in order to reconstruct data with specific relevance levels. The semantic information of queries, derived from word embeddings, is incorporated in the adversarial training stage for regularizing the distribution of the latent representation. Two informative data augmentation strategies suitable for LTR are designed utilizing the proposed data generation model. Experiments on benchmark LTR datasets demonstrate that our proposed framework can significantly improve the performance of LTR models.