The Univariate Flagging Algorithm (UFA): a Fully-Automated Approach for Identifying Optimal Thresholds in Data

In many data classification problems, there is no linear relationship between an explanatory and the dependent variables. Instead, there may be ranges of the input variable for which the observed outcome is signficantly more or less likely. This paper describes an algorithm for automatic detection of such thresholds, called the Univariate Flagging Algorithm (UFA). The algorithm searches for a separation that optimizes the difference between separated areas while providing the maximum support. We evaluate its performance using three examples and demonstrate that thresholds identified by the algorithm align well with visual inspection and subject matter expertise. We also introduce two classification approaches that use UFA and show that the performance attained on unseen test data is equal to or better than that of more traditional classifiers. We demonstrate that the proposed algorithm is robust against missing data and noise, is scalable, and is easy to interpret and visualize. It is also well suited for problems where incidence of the target is low.

[1]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[2]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[3]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[4]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[5]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[6]  C. Sprung,et al.  Surviving Sepsis Campaign: International Guidelines for Management of Severe Sepsis and Septic Shock, 2012 , 2013, Intensive Care Medicine.

[7]  M Mazumdar,et al.  Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decision-making about cancer treatments. , 2000, Statistics in medicine.

[8]  M. Levy,et al.  Surviving Sepsis Campaign: International guidelines for management of severe sepsis and septic shock: 2008 , 2007, Intensive Care Medicine.

[9]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[10]  Sumithra J. Mandrekar,et al.  Finding Optimal Cutpoints for Continuous Covariates with Binary and Time-to-Event Outcomes , 2006 .

[11]  Sholom M. Weiss,et al.  An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods , 1989, IJCAI.

[12]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[13]  Mario L. V. Martina,et al.  A Bayesian decision approach to rainfall thresholds based flood warning , 2005 .

[14]  T. H. Kyaw,et al.  Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database* , 2011, Critical care medicine.

[15]  Christoph F. Eick,et al.  Supervised clustering - algorithms and benefits , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[16]  W. Raub From the National Institutes of Health. , 1990, JAMA.

[17]  J. Godt,et al.  Early warning of rainfall-induced shallow landslides and debris flows in the USA , 2010 .

[18]  Ian H. Witten,et al.  Data Mining: Practical Machine Learning Tools and Techniques, 3/E , 2014 .

[19]  M. Saeed Multiparameter Intelligent Monitoring in Intensive Care II ( MIMIC-II ) : A public-access intensive care unit database , 2011 .

[20]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[21]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .