HIGHLY ROBUST METHODS IN DATA MINING

This paper is devoted to highly robust methods for information extraction from data, with a special attention paid to methods suitable for management applications. The sensitivity of available data mining methods to the presence of outlying measurements in the observed data is discussed as a major drawback of available data mining methods. The paper proposes several newhighly robust methods for data mining, which are based on the idea of implicit weighting of individual data values. Particularly it propose a novel robust method of hierarchical cluster analysis, which is a popular data mining method of unsupervised learning. Further, a robust method for estimating parameters in the logistic regression was proposed. This idea is extended to a robust multinomial logistic classification analysis. Finally, the sensitivity of neural networks to the presence of noise and outlying measurements in the data was discussed. The method for robust training of neural networks for the task of function approximation, which has the form of a robust estimator in nonlinear regression, was proposed.

[1]  Denise M. Rousseau,et al.  Evidence-Based Management: Concept Cleanup Time? , 2009 .

[2]  David B. Hitchcock,et al.  James-Stein shrinkage to improve k-means cluster analysis , 2010, Comput. Stat. Data Anal..

[3]  Panos M. Pardalos,et al.  Robust Data Mining , 2012 .

[4]  D. Altman,et al.  Measurement error. , 1996, BMJ.

[5]  Leon Bobrowski,et al.  Relaxed Linear Separability (RLS) Approach to Feature (Gene) Subset Selection , 2011 .

[6]  Pavel íek,et al.  Semiparametrically weighted robust estimation of regression models , 2011 .

[7]  Daming Shi,et al.  Sensitivity Analysis for Neural Networks , 2009, Sensitivity Analysis for Neural Networks.

[8]  Xuhua Xia,et al.  Selected Works in Bioinformatics , 2011 .

[9]  Thomas S. Gruca,et al.  Mining sales data using a neural network model of market response , 1999, SKDD.

[10]  Jan Kalina,et al.  Outlier detection by means of robust regression estimators for use in engineering science , 2009 .

[11]  Andreas Christmann,et al.  Least median of weighted squares in logistic regression with large strata , 1994 .

[12]  Mykola Pechenizkiy,et al.  Knowledge discovery and computer-based decision support in biomedicine , 2010, Artif. Intell. Medicine.

[13]  Pavel Cizek,et al.  Robust and Efficient Adaptive Estimation of Binary-Choice Regression Models , 2007 .

[14]  Sangmun Shin,et al.  Robust Data Mining: An Integrated Approach , 2009 .

[15]  Jeffrey Solka,et al.  Exploratory Data Analysis with MATLAB, Second Edition , 2010 .

[16]  John Elder,et al.  Handbook of Statistical Analysis and Data Mining Applications , 2009 .

[17]  Hana Rezankova,et al.  Poisson distribution based initialization for fuzzy clustering , 2012 .

[18]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[19]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[20]  Angel R. Martinez,et al.  : Exploratory data analysis with MATLAB ® , 2007 .

[21]  Kadir Liano,et al.  Robust error measure for supervised neural network learning with outliers , 1996, IEEE Trans. Neural Networks.

[22]  J. Pfeffer,et al.  Evidence-based management. , 2006, Harvard business review.

[23]  Girish N. Punj,et al.  Cluster Analysis in Marketing Research: Review and Suggestions for Application , 1983 .

[24]  Daniel Svozil,et al.  DNA conformations and their sequence preferences , 2008, Nucleic acids research.

[25]  Roger Bartlett,et al.  Neural Network Modelling and Dynamical System Theory , 2011, Sports medicine.

[26]  Jan Kalina,et al.  Some Diagnostic Tools in Robust Econometrics , 2011 .

[27]  Cengiz Kahraman,et al.  A decision support system for demand forecasting with artificial neural networks and neuro-fuzzy models: A comparative analysis , 2009, Expert Syst. Appl..

[28]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[29]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[30]  J. Alexander,et al.  Theory and Methods: Critical Essays in Human Geography , 2008 .

[31]  Adem Karahoca,et al.  Data Mining and Knowledge Discovery in Real Life Applications , 2009 .

[32]  Khairil Anuar Arshad,et al.  Artificial Neural Networks' Applications in Management , 2011 .

[33]  Jan Kalina ON MULTIVARIATE METHODS IN ROBUST ECONOMETRICS , 2012 .

[34]  Udo Wagner,et al.  Applications of artificial neural networks in management science: a survey , 1999 .

[35]  Jacek M. Zurada,et al.  Artificial Intelligence and Soft Computing , 2014, Lecture Notes in Computer Science.

[36]  Andrzej Rusiecki,et al.  Robust MCD-Based Backpropagation Learning Algorithm , 2006, ICAISC.

[37]  Michael H. Prager,et al.  Least median of squares: a suitable objective function for stock assessment models? , 2002 .

[38]  George C.J. Fernandez,et al.  Data Mining Using SAS Applications , 2002 .

[39]  Chansoo Kim,et al.  Cluster analysis using different correlation coefficients , 2008 .

[40]  REGRESSION WITH HIGH BREAKDOWN POINT , 2001 .

[41]  P. L. Davies,et al.  Breakdown and groups , 2005, math/0508497.

[42]  Bernd Brandl,et al.  An automated econometric decision support system: forecasts for foreign exchange trades , 2006, Central Eur. J. Oper. Res..

[43]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[44]  Jan Kalina Implicitly Weighted Methods in Robust Image Analysis , 2012, Journal of Mathematical Imaging and Vision.

[45]  Lucila Ohno-Machado,et al.  Logistic regression and artificial neural network classification models: a methodology review , 2002, J. Biomed. Informatics.

[46]  Angappa Gunasekaran,et al.  Decision support systems for logistics and supply chain management , 2012, Decis. Support Syst..

[47]  Ramesh C. Jain,et al.  A robust backpropagation learning algorithm for function approximation , 1994, IEEE Trans. Neural Networks.

[48]  S. Van Aelst,et al.  Robust linear clustering , 2009 .

[49]  P. Arabie,et al.  Cluster analysis in marketing research , 1994 .

[50]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[51]  Matias Salibian-Barrera,et al.  The Asymptotics of MM-Estimators for Linear Regression with Fixed Designs , 2006 .

[52]  John Yearwood,et al.  Robust artificial neural networks and outlier detection. Technical report , 2011, ArXiv.

[53]  Chen-Chia Chuang,et al.  Least trimmed squares based CPBUM neural networks , 2011, Proceedings 2011 International Conference on System Science and Engineering.

[54]  Stephen M. Stigler,et al.  The Changing History of Robustness , 2010 .

[55]  PETER J. ROUSSEEUW,et al.  Computing LTS Regression for Large Data Sets , 2005, Data Mining and Knowledge Discovery.

[56]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[57]  W. Youden,et al.  Index for rating diagnostic tests , 1950, Cancer.

[58]  John P. Buonaccorsi,et al.  Measurement Error: Models, Methods, and Applications , 2010 .

[59]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[60]  Enhancing the Software Effort Estimation using Outlier Elimination Methods for Agriculture in Pakistan , 2010 .

[61]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[62]  Arnošt Veselý,et al.  Data, Information and Knowledge , 2009 .