Value-added tax fraud detection with scalable anomaly detection techniques

Abstract The tax fraud detection domain is characterized by very few labelled data (known fraud/legal cases) that are not representative for the population due to sample selection bias. We use unsupervised anomaly detection (AD) techniques, which are uncommon in tax fraud detection research, to deal with these domain issues. We analyse a unique dataset containing the VAT declarations and client listings of all Belgian VAT numbers pertaining to ten sectors. Our methodology consists in applying AD methods to firms belonging to the same sector and enables an efficient auditing strategy that can be adopted by tax authorities worldwide. The high lifts and hit rates observed in most sectors demonstrate the success of this approach. Sectoral differences exist due to varying market conditions and legal requirements across sectors and we show that the optimal AD method is sector dependent. We focus on three methodological problems that show issues in the related literature. (1) Can we design suitable input features? We develop new fraud indicators from specific fields of the VAT form and client listings and show the predictive value of the combination of these features. (2) Can we design fast algorithms to deal with the large data sizes that can occur in the tax domain? New methods are developed and we demonstrate their scalability both theoretically as well as empirically. (3) How should fraud detection performance be assessed? A new evaluation methodology is proposed that provides reliable performance indications and guarantees that fraud cases are effectively detected by the proposed methods.

[1]  Cesare Alippi,et al.  Credit Card Fraud Detection: A Realistic Modeling and a Novel Learning Strategy , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[2]  David Martens,et al.  Imbalanced classification in sparse and large behaviour datasets , 2017, Data Mining and Knowledge Discovery.

[3]  Vishnuprasad Nagadevara,et al.  Development of Hybrid Classification Methodology for Mining Skewed Data Sets - A Case Study of Indian Customs Data , 2006, IEEE International Conference on Computer Systems and Applications, 2006..

[4]  Maurizio Filippone,et al.  A comparative evaluation of outlier detection algorithms: Experiments and analyses , 2018, Pattern Recognit..

[5]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[6]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[7]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[8]  Habibollah Arasteh Rad,et al.  A Novel Unsupervised Classification Method for Customs Fraud Detection , 2015 .

[9]  Shekhar Mittal,et al.  Who is Bogus?: Using One-Sided Labels to Identify Fraudulent Firms from Tax Returns , 2018, COMPASS.

[10]  Danai Koutra,et al.  Graph based anomaly detection and description: a survey , 2014, Data Mining and Knowledge Discovery.

[11]  Jithin Mathews,et al.  Identifying Malicious Dealers in Goods and Services Tax , 2019, 2019 IEEE 4th International Conference on Big Data Analytics (ICBDA).

[12]  Reda Alhajj,et al.  A comprehensive survey of numeric and symbolic outlier mining techniques , 2006, Intell. Data Anal..

[13]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[14]  Foster J. Provost,et al.  Corporate residence fraud detection , 2014, KDD.

[15]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[16]  Hiroki Takakura,et al.  Toward a more practical unsupervised anomaly detection system , 2013, Inf. Sci..

[17]  Foster J. Provost,et al.  Explaining Data-Driven Document Classifications , 2013, MIS Q..

[18]  Georg Krempl,et al.  Classification in Presence of Drift and Latency , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[19]  Vadlamani Ravi,et al.  Detection of financial statement fraud and feature selection using data mining techniques , 2011, Decis. Support Syst..

[20]  Arthur Zimek,et al.  On the internal evaluation of unsupervised outlier detection , 2015, SSDBM.

[21]  Juan D. Velásquez,et al.  Characterization and detection of taxpayers with false invoices using data mining techniques , 2013, Expert Syst. Appl..

[22]  Nikos Fazakis,et al.  Semi-supervised forecasting of fraudulent financial statements , 2016, PCI.

[23]  Stephan Cl'emenccon,et al.  Mass Volume Curves and Anomaly Ranking , 2017, 1705.01305.

[24]  Christopher Leckie,et al.  Unsupervised Parameter Estimation for One-Class Support Vector Machines , 2016, PAKDD.

[25]  Tom Fawcett,et al.  Data science for business , 2013 .

[26]  Dino Pedreschi,et al.  Using Data Mining Techniques in Fiscal Fraud Detection , 1999, DaWaK.

[27]  Chris Jermaine,et al.  Outlier detection by sampling with accuracy guarantees , 2006, KDD '06.

[28]  Andrés Moreno,et al.  Tax Fraud Detection for Under-Reporting Declarations Using an Unsupervised Machine Learning Approach , 2018, KDD.

[29]  Dirk Van den Poel,et al.  The impact of sample bias on consumer credit scoring performance and profitability , 2005, J. Oper. Res. Soc..

[30]  Chang-Ryung Han,et al.  Performance measurement of the KCS customs selectivity system , 2014 .

[31]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[32]  Zhenisbek Assylbekov,et al.  Detecting Value-Added Tax Evasion by Business Entities of Kazakhstan , 2016, KES-IDT.

[33]  F. Schneider SIZE AND DEVELOPMENT OF THE SHADOW ECONOMY OF 31 EUROPEAN AND 5 OTHER OECD COUNTRIES FROM 2003 TO 2014: DIFFERENT DEVELOPMENTS? , 2015 .

[34]  Gianluca Bontempi,et al.  Learned lessons in credit card fraud detection from a practitioner perspective , 2014, Expert Syst. Appl..

[35]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[36]  She-I Chang,et al.  Using data mining technique to enhance tax evasion detection performance , 2012, Expert Syst. Appl..

[37]  MingJian Tang,et al.  Unsupervised Fraud Detection in Medicare Australia , 2011, AusDM.

[38]  David Martens,et al.  Datamining voor Fraudedetectie , 2016 .

[39]  Yong Hu,et al.  The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature , 2011, Decis. Support Syst..

[40]  Dmitri Roussinov,et al.  A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation , 1998 .

[41]  José Maria Monteiro,et al.  An Empirical Method for Discovering Tax Fraudsters: A Real Case Study of Brazilian Fiscal Evasion , 2015, IDEAS.

[42]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[43]  High-Dimensional Outlier Detection: The Subspace Method , 2013 .

[44]  David J. Hand,et al.  Statistical fraud detection: A review , 2002 .

[45]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[46]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[47]  Arthur Zimek,et al.  Subsampling for efficient and effective unsupervised outlier detection ensembles , 2013, KDD.

[48]  Monique Snoeck,et al.  APATE: A novel approach for automated credit card transaction fraud detection using network-based extensions , 2015, Decis. Support Syst..

[49]  Mohamed Bekkar,et al.  Evaluation Measures for Models Assessment over Imbalanced Data Sets , 2013 .

[50]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[51]  Maumita Bhattacharya,et al.  Intelligent Financial Fraud Detection: A Comprehensive Review , 2015 .

[52]  Dino Pedreschi,et al.  High Quality True-Positive Prediction for Fiscal Fraud Detection , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[53]  Hans-Peter Kriegel,et al.  Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection , 2012, Data Mining and Knowledge Discovery.

[54]  Seiichi Uchida,et al.  A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data , 2016, PloS one.