Fraud Detection with Multi-Modal Attention and Correspondence Learning

Deep learning based recognition systems have shown high performances in various tasks. Most of them are single-modality based, using camera inputs only, thus are vulnerable to look-alike fraud inputs. Fraud inputs may frequently be abused when rewards are given to the users, such as in reverse vending machines. Joint use of multi-modal inputs can be a solution to fraud inputs since modalities contain different information about the target task. In this work, we propose a deep neural network that utilizes multi-modal inputs with an attention mechanism and a correspondence learning scheme. With an attention mechanism, the network can learn better feature representation for multiple modalities; with the correspondence learning scheme, the network learns intermodal relationships and thus can detect fraud inputs where modalities do not correspond to each other. We investigate the proposed approach in a reverse vending machine system, where the task is to perform classification among 3 given classes (can, PET bottles, glass bottles), and reject any suspicious input. Three different modalities (image, ultrasound, and weight) are used. As a result, we show that our proposed model can effectively learn to detect fraud inputs while maintaining a high accuracy for the given classification task.

[1]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Xiao Liu,et al.  Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification , 2017, ArXiv.

[4]  Hiroomi Hikawa,et al.  Category recognition system using two ultrasonic sensors and combinational logic circuit , 2005 .

[5]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[6]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[7]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  K. Ohtani,et al.  A Simple Identification Method for Object Shapes and Materials Using an Ultrasonic Sensor Array , 2006, 2006 IEEE Instrumentation and Measurement Technology Conference Proceedings.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Logan Engstrom,et al.  Synthesizing Robust Adversarial Examples , 2017, ICML.

[12]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[13]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[16]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[17]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).