Customs fraud detection

In this customs fraud detection application, we analyse a unique data set of 9,624,124 records resulting from a collaboration with the Belgian customs administration. They are faced with increasing levels of international trade, which pressurizes regulatory control. Governments therefore rely on data mining to focus their limited resources on the most likely fraud cases. The literature on data mining for customs fraud detection lacks in two main directions that are simultaneously addressed in this paper: (1) behavioural and high-cardinality data types are neglected due to a lack of methodology to include them. We demonstrate that such fine-grained features (e.g. the specific entities such as consignee, consignor and declarant and the commodities involved in a declaration) are very predictive. (2) Studies in the tax domain most often use standard learning algorithms on their fraud detection applications. However, customs data are highly imbalanced and this poses challenges for many inducers. We present a new EasyEnsemble method that integrates a support vector machine base learner in a confidence-rated boosting algorithm. This results in a fast and scalable learner that is able to drastically improve predictive performance over the base application of a support vector machine. The results of our proposed framework reveals high AUC and lift values that translate into an immediate impact on the customs fraud detection domain through an improved retrieval of tax losses and an enhanced deterrence.

[1]  WestJarrod,et al.  Intelligent financial fraud detection , 2016 .

[2]  Chuan Zhou,et al.  FraudNE: a Joint Embedding Approach for Fraud Detection , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[3]  David Martens,et al.  Imbalanced classification in sparse and large behaviour datasets , 2017, Data Mining and Knowledge Discovery.

[4]  Edmund F. McGarrell,et al.  Enhancing Security Throughout the Supply Chain , 2004 .

[5]  Reda Alhajj,et al.  A comprehensive survey of numeric and symbolic outlier mining techniques , 2006, Intell. Data Anal..

[6]  Petra Perner,et al.  Machine Learning and Data Mining in Pattern Recognition , 2009, Lecture Notes in Computer Science.

[7]  MartensDavid,et al.  Including high-cardinality attributes in predictive models , 2015 .

[8]  Xiaoli Ma,et al.  Sampling + reweighting: Boosting the performance of AdaBoost on imbalanced datasets , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[9]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[10]  Maumita Bhattacharya,et al.  Intelligent Financial Fraud Detection: A Comprehensive Review , 2015 .

[11]  Wang Yaqin,et al.  Classification Model Based on Association Rules in Customs Risk Management Application , 2010, 2010 International Conference on Intelligent System Design and Engineering Application.

[12]  Ana S. Camanho,et al.  Predicting direct marketing response in banking: comparison of class imbalance methods , 2017 .

[13]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[14]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[15]  Tian-Yu Liu,et al.  EasyEnsemble and Feature Selection for Imbalance Data Sets , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[16]  Hamid Parvin,et al.  Detection of Cancer Patients Using an Innovative Method for Learning at Imbalanced Datasets , 2011, RSKT.

[17]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[18]  Bo Du,et al.  Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding , 2015, Pattern Recognit..

[19]  David Martens,et al.  Including high-cardinality attributes in predictive models: A case study in churn prediction in the energy sector , 2015, Decis. Support Syst..

[20]  Seiichi Uchida,et al.  A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data , 2016, PloS one.

[21]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[22]  Sujit Kumar,et al.  TLUSBoost algorithm: a boosting solution for class imbalance problem , 2018, Soft Comput..

[23]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[24]  Michael J. Shaw,et al.  Quantitative methods for Detection of Financial Fraud , 2011 .

[25]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[26]  Foster J. Provost,et al.  Predictive Modeling With Big Data: Is Bigger Really Better? , 2013, Big Data.

[27]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[28]  Yong Hu,et al.  The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature , 2011, Decis. Support Syst..

[29]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[30]  Habibollah Arasteh Rad,et al.  A Novel Unsupervised Classification Method for Customs Fraud Detection , 2015 .

[31]  Gianluca Bontempi,et al.  Learned lessons in credit card fraud detection from a practitioner perspective , 2014, Expert Syst. Appl..

[32]  Jacques Wainer,et al.  Uses of artificial intelligence in the Brazilian customs fraud detection system , 2008, DG.O.

[33]  Johan A. K. Suykens,et al.  Benchmarking state-of-the-art classification algorithms for credit scoring , 2003, J. Oper. Res. Soc..

[34]  Theodoros Evgeniou,et al.  A benchmarking study of classification techniques for behavioral data , 2019, International Journal of Data Science and Analytics.

[35]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[36]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[37]  Vishnuprasad Nagadevara,et al.  Development of Hybrid Classification Methodology for Mining Skewed Data Sets - A Case Study of Indian Customs Data , 2006, IEEE International Conference on Computer Systems and Applications, 2006..

[38]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[39]  Eftim Zdravevski,et al.  Weight of evidence as a tool for attribute transformation in the preprocessing stage of supervised learning algorithms , 2011, The 2011 International Joint Conference on Neural Networks.

[40]  Foster J. Provost,et al.  Explaining Data-Driven Document Classifications , 2013, MIS Q..

[41]  Bernard F. Buxton,et al.  Performance Degradation in Boosting , 2001, Multiple Classifier Systems.

[42]  Vadlamani Ravi,et al.  Detection of financial statement fraud and feature selection using data mining techniques , 2011, Decis. Support Syst..

[43]  R. Sahu,et al.  Decision support system in customs assessment to detect valuation frauds , 2003, IEMC '03 Proceedings. Managing Technologically Driven Organizations: The Human Side of Innovation and Change.

[44]  Ling Shao,et al.  Discriminative feature learning from big data for visual recognition , 2015, Pattern Recognition.

[45]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[46]  David Martens,et al.  DEPARTMENT OF ENGINEERING MANAGEMENT Classification over bipartite graphs through projection , 2015 .

[47]  Yaobin Mao,et al.  A review of boosting methods for imbalanced data classification , 2014, Pattern Analysis and Applications.

[48]  R. Shah,et al.  Least Squares Support Vector Machines , 2022 .

[49]  Foster J. Provost,et al.  Corporate residence fraud detection , 2014, KDD.

[50]  Jaime S. Cardoso,et al.  Binary ranking for ordinal class imbalance , 2018, Pattern Analysis and Applications.

[51]  Galit Shmueli,et al.  Analyzing Behavioral Big Data: Methodological, practical, ethical, and moral issues , 2016 .

[52]  Michael L. Thomas,et al.  Minimization of Childhood Maltreatment Is Common and Consequential: Results from a Large, Multinational Sample Using the Childhood Trauma Questionnaire , 2016, PloS one.

[53]  Fernando Lozano,et al.  Boosting Support Vector Machines , 2007, MLDM Posters.

[54]  Tom Fawcett,et al.  Data science for business , 2013 .

[55]  Hong Zhao,et al.  Applying data mining to detect fraud behavior in customs declaration , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[56]  Foster J. Provost,et al.  Distribution-based aggregation for relational learning with identifier attributes , 2006, Machine Learning.

[57]  Chang-Ryung Han,et al.  Performance measurement of the KCS customs selectivity system , 2014 .

[58]  David A. Cieslak,et al.  A Robust Decision Tree Algorithm for Imbalanced Data Sets , 2010, SDM.