Classification system for mortgage arrear management

Due to the economic recession in the recent years, more and more mortgage customers default on the payments. This brings tremendous losses to banks and forces their arrear management departments to develop more efficient processes. In this paper, we propose a classification system to predict the outcome of a mortgage arrear. Each customer who delays a monthly mortgage rate payment is assigned a label with two possible values: a delayer, who will pay the rate before the end of the month, and a defaulter, who will fail to do so. In this way, the arrear management department only needs to treat defaulters intensively. We use arrear history records obtained from a data warehouse of one Dutch bank. We apply basic classifiers, ensemble methods and sampling techniques to this classification problem. The obtained results show that sampling techniques and ensemble learning improve the performance of basic classifiers considerably. We choose balanced random forests to build the ultimate classification system. The resulting system has already been deployed in the daily work of the arrear management department of the concerned bank, and this leads to huge cost savings.

[1]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[2]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[3]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[4]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[5]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[6]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[7]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[8]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[9]  John Langford,et al.  An iterative method for multi-class cost-sensitive learning , 2004, KDD.

[10]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[11]  Mohamed Limam,et al.  Rank Aggregation for Filter Feature Selection in Credit Scoring , 2013, MIKE.

[12]  Chao-Ton Su,et al.  An Evaluation of the Robustness of MTS for Imbalanced Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[14]  Dave Feldman,et al.  Mortgage Default: Classification Trees Analysis , 2004 .

[15]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[16]  Gun Ho Lee Rule-based and case-based reasoning approach for internal audit of bank , 2008, Knowl. Based Syst..

[17]  謝楠楨 An integrated data mining and behavioral scoring model for analyzing bank customers , 2004 .

[18]  øöö Blockinøø Well-Trained PETs : Improving Probability Estimation , 2000 .

[19]  N. Capon Credit Scoring Systems: A Critical Analysis , 1982 .

[20]  Leslie G. Valiant,et al.  Cryptographic limitations on learning Boolean formulae and finite automata , 1994, JACM.

[21]  Raphael W. Bostic,et al.  Credit risk, credit scoring, and the performance of home mortgages , 1996 .

[22]  Shu-Hsien Liao,et al.  Data mining techniques and applications - A decade review from 2000 to 2011 , 2012, Expert Syst. Appl..

[23]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[24]  Edward I. Altman,et al.  FINANCIAL RATIOS, DISCRIMINANT ANALYSIS AND THE PREDICTION OF CORPORATE BANKRUPTCY , 1968 .

[25]  Adel Al-Jumaily,et al.  Differential evolution based feature subset selection , 2008, 2008 19th International Conference on Pattern Recognition.

[26]  Yu Zhong,et al.  An Overview of Personal Credit Scoring: Techniques and Future Work , 2012 .

[27]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[28]  Eamonn J. Keogh,et al.  Scaling up dynamic time warping for datamining applications , 2000, KDD '00.

[29]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[30]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[31]  Zhi-Hua Zhou,et al.  ON MULTI‐CLASS COST‐SENSITIVE LEARNING , 2006, Comput. Intell..

[32]  Yong Hu,et al.  The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature , 2011, Decis. Support Syst..

[33]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[34]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[35]  Hisashi Kashima,et al.  Roughly balanced bagging for imbalanced data , 2009, Stat. Anal. Data Min..

[36]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[37]  Yin Zhao,et al.  Mortgage data mining , 1997, Proceedings of the IEEE/IAFE 1997 Computational Intelligence for Financial Engineering (CIFEr).

[38]  James Bennett,et al.  The Netflix Prize , 2007 .

[39]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[40]  Chris Matthews,et al.  Neural Network Classifers in Arrears Management , 2005, ICANN.

[41]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[42]  F. Mörchen Time series feature extraction for data mining using DWT and DFT , 2003 .

[43]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[44]  J. Wiginton A Note on the Comparison of Logit and Discriminant Models of Consumer Credit Behavior , 1980, Journal of Financial and Quantitative Analysis.

[45]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[46]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[47]  L. Thomas A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers , 2000 .

[48]  Agnar Aamodt,et al.  Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches , 1994, AI Commun..

[49]  Pan Su,et al.  Feature Selection Ensemble , 2012, Turing-100.

[50]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[51]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[52]  Piero P. Bonissone,et al.  Financial applications of fuzzy case-based reasoning to residential property valuation , 1997, Proceedings of 6th International Fuzzy Systems Conference.

[53]  Amir M. Hormozi,et al.  Data Mining: A Competitive Weapon for Banking and Retail Industries , 2004, Inf. Syst. Manag..

[54]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[55]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[56]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[57]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[58]  Sung Ho Ha,et al.  Behavioral assessment of recoverable credit of retailer's customers , 2010, Inf. Sci..

[59]  VARUN CHANDOLA,et al.  Outlier Detection : A Survey , 2007 .

[60]  Nan-Chen Hsieh,et al.  An integrated data mining and behavioral scoring model for analyzing bank customers , 2004, Expert Syst. Appl..

[61]  B.V. Dasarathy,et al.  A composite classifier system design: Concepts and methodology , 1979, Proceedings of the IEEE.

[62]  Emmanuel Dellandréa,et al.  Image Categorization Using ESFS: A New Embedded Feature Selection Method Based on SFS , 2009, ACIVS.

[63]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[64]  Roger Burrows,et al.  Mortgage Debt, Insecure Home Ownership and Health: An Exploratory Analysis , 1998 .

[65]  Andreas Zenthöfer,et al.  The housing market in the Netherlands , 2012 .

[66]  Fabio Mavelli,et al.  Quasi-cellular systems: stochastic simulation analysis at nanoscale range , 2012, BMC Bioinformatics.

[67]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[68]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[69]  Lyn C. Thomas,et al.  Transition Matrix Models of Consumer Credit Ratings , 2010 .

[70]  C. Bolton,et al.  Logistic regression and its application in credit scoring , 2010 .

[71]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[72]  Ingoo Han,et al.  A case-based reasoning with the feature weights derived by analytic hierarchy process for bankruptcy prediction , 2002, Expert Syst. Appl..

[73]  Yang Wang,et al.  Boosting for Learning Multiple Classes with Imbalanced Class Distribution , 2006, Sixth International Conference on Data Mining (ICDM'06).

[74]  Ramayya Krishnan,et al.  Predicting repayment of the credit card debt , 2012, Comput. Oper. Res..

[75]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[76]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[77]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[78]  J. Friedman Stochastic gradient boosting , 2002 .

[79]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[80]  Johan A. K. Suykens,et al.  Benchmarking state-of-the-art classification algorithms for credit scoring , 2003, J. Oper. Res. Soc..

[81]  Hajo A. Reijers,et al.  Case-based reasoning as a technique for knowledge management in business process redesign , 2003 .

[82]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[83]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[84]  Marco Vannucci,et al.  Variable Selection and Feature Extraction Through Artificial Intelligence Techniques , 2013 .

[85]  Zhangxi Lin,et al.  Risk Management of Residential Mortgage in China Using Date Mining A Case Study , 2009, 2009 International Conference on New Trends in Information and Service Science.

[86]  Sumit Sarkar,et al.  Bayesian Models for Early Warning of Bank Failures , 2001, Manag. Sci..

[87]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[88]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[89]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[90]  M. S. Brown,et al.  Support Vector Machine Classification of Microarray from Gene Expression Data , 1999 .

[91]  Rodica Potolea,et al.  Imbalanced Classification Problems: Systematic Study, Issues and Best Practices , 2011, ICEIS.

[92]  Omar E. M. Khalil,et al.  Mind Your Business by Mining Your Data , 2001 .

[93]  John W. Straka A Shift in the Mortgage Landscape: The 1990s Move to Automated Credit Evaluations , 2000 .

[94]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[95]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.