Using Negative Detectors for Identifying Adversarial Data Manipulation in Machine Learning

With the increased popularity of Machine Learning (ML) in real-world applications, adversarial attacks are emerging to subvert the ML-based decision support systems. It appears that the existing adversarial defenses are ineffective against adaptive attacks since these are highly depend on knowledge of prior attacks and the ML model architecture. To alleviate the challenges, We propose a negative filtering strategy that does not require any adversarial knowledge and can work independent of ML models. This filtering strategy relies on salient features of clean (training) data and employs a complementary approach to cover possible attack surface in an application. Our empirical experiments with different data sets demonstrate that the negative filters could effectively detect wide-range of adversarial inputs and update itself to protect against adaptive attacks.