Classification and Data Mining

This paper presents a robust procedure for the detection of atypical observations and for the analysis of their effect on model inference in random effects models. Given that the observations can be outlying at different levels of the analysis, we focus on the evaluation of the effect of both first and second level outliers and, in particular, on their effect on the higher level variance which is statistically evaluated with the Likelihood-Ratio Test. A cut-off point separating the outliers from the other observations is identified through a graphical analysis of the information collected at each step of the Forward Search procedure; the Robust Forward LRT is the value of the classical LRT statistic at the cut-off point.

[1]  Antonio Punzo,et al.  Discrete Beta-Type Models , 2010 .

[2]  Danny Quah,et al.  Empirical cross-section dynamics in economic growth , 1993 .

[3]  Maria Iannario,et al.  Statistical modelling of subjective survival probabilities , 2010 .

[4]  Direct Reweighting Strategies in Conformation Dynamics , 2011 .

[5]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Maria Iannario,et al.  CUB Models: Statistical Methods and Empirical Evidence , 2011 .

[7]  Ricardo Fraiman,et al.  On the use of the bootstrap for estimating functions with functional data , 2006, Comput. Stat. Data Anal..

[8]  Jorge Mateu,et al.  Kriging for Functional Data , 2014 .

[9]  Yoshihiro Yamanishi,et al.  GEOGRAPHICALLY WEIGHTED FUNCTIONAL MULTIPLE REGRESSION ANALYSIS: A NUMERICAL INVESTIGATION(Functional Data Analysis) , 2003 .

[10]  Bruce L. Golden,et al.  Optimisation , 1982, IEEE Trans. Syst. Man Cybern..

[11]  Marcus Weber,et al.  A coarse graining method for the identification of transition rates between molecular conformations. , 2007, The Journal of chemical physics.

[12]  Martin G. Everett,et al.  Network analysis of 2-mode data , 1997 .

[13]  Marcus Weber A Subspace Approach to Molecular Markov State Models via an Infinitesimal Generator (revised version) , 2010 .

[14]  José G. Dias,et al.  Latent class modeling of website users’ search patterns: Implications for online market segmentation , 2007 .

[15]  Bin Zhang Regression clustering , 2003, Third IEEE International Conference on Data Mining.

[16]  Victoria Zinde-Walsh,et al.  NON AND SEMI-PARAMETRIC ESTIMATION IN MODELS WITH UNKNOWN SMOOTHNESS , 2006 .

[17]  M. Charlton,et al.  Some Notes on Parametric Significance Tests for Geographically Weighted Regression , 1999 .

[18]  O. Lartillot,et al.  A MATLAB TOOLBOX FOR MUSICAL FEATURE EXTRACTION FROM AUDIO , 2007 .

[19]  S. Borgatti,et al.  Analyzing Affiliation Networks , 2011 .

[20]  Monique Noirhomme-Fraiture,et al.  Symbolic Data Analysis and the SODAS Software , 2008 .

[21]  D. Billheimer Functional Data Analysis, 2nd edition edited by J. O. Ramsay and B. W. Silverman , 2007 .

[22]  Ricardo Fraiman,et al.  Robust estimation and classification for functional data via projection-based depth notions , 2007, Comput. Stat..

[23]  M. C. Jones,et al.  A Brief Survey of Bandwidth Selection for Density Estimation , 1996 .

[24]  Rosanna Verde,et al.  Clustering Methods in Symbolic Data Analysis , 2004 .

[25]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[26]  Maria Iannario,et al.  A class of statistical models for evaluating services and performances , 2009 .

[27]  Rosanna Verde,et al.  A Regionalization Method for Spatial Functional Data Based on Variogram Models: An Application on Environmental Data , 2013 .

[28]  Bing Liu,et al.  Sentiment Analysis and Subjectivity , 2010, Handbook of Natural Language Processing.

[29]  Marcus Weber,et al.  Meshless Methods in Conformation Dynamics , 2006 .

[30]  C. Clogg Latent Class Models , 1995 .

[31]  Peter Kampstra,et al.  Beanplot: A Boxplot Alternative for Visual Comparison of Distributions , 2008 .

[32]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[33]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[34]  Jorge Mateu,et al.  Continuous Time-Varying Kriging for Spatial Prediction of Functional Data: An Environmental Application , 2010 .

[35]  Benjamin Schneider,et al.  Strategic job analysis , 1989 .

[36]  Chris Brunsdon,et al.  Geographically Weighted Regression: The Analysis of Spatially Varying Relationships , 2002 .

[37]  Gilbert Saporta,et al.  Clusterwise PLS regression on a stochastic process , 2002, Comput. Stat. Data Anal..

[38]  R. Mahadevan,et al.  Energy consumption, economic growth and prices: A reassessment using panel VECM for developed and developing countries , 2007 .

[39]  Randy L. Haupt,et al.  Practical Genetic Algorithms , 1998 .

[40]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[41]  Zhengxin Chen,et al.  Data Mining and Uncertain Reasoning: An Integrated Approach , 2001 .

[42]  F P Stafford,et al.  Event history calendars and question list surveys: a direct comparison of interviewing methods. , 2001, Public opinion quarterly.

[43]  Jorge Mateu,et al.  Statistics for spatial functional data , 2008 .

[44]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[45]  S. Spilerman,et al.  Extensions of the Mover-Stayer Model , 1972, American Journal of Sociology.

[46]  James E. Payne,et al.  Survey of the international evidence on the causal relationship between energy consumption and growth , 2010 .

[47]  Rob J Hyndman,et al.  Automatic Time Series Forecasting: The forecast Package for R , 2008 .

[48]  Maria Iannario,et al.  On the identifiability of a mixture model for ordinal data , 2010 .

[49]  M. Febrero,et al.  Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels , 2008 .

[50]  J. Mateu,et al.  Ordinary kriging for function-valued spatial data , 2011, Environmental and Ecological Statistics.

[51]  Giada Adelfio,et al.  Second‐order diagnostics for space‐time point processes with application to seismic events , 2008 .

[52]  Richard J. Mirabile,et al.  Everything You Wanted to Know about Competency Modeling. , 1997 .

[53]  Domenico Piccolo,et al.  On the Moments of a Mixture of Uniform and Shifted Binomial random variables , 2003 .

[54]  Pascal Monestiez,et al.  A Cokriging Method for Spatial Functional Data with Applications in Oceanology , 2008 .

[55]  H. Goldstein,et al.  Multilevel Modelling of the Geographical Distributions of Diseases , 1999, Journal of the Royal Statistical Society. Series C, Applied statistics.

[56]  Claus Weihs,et al.  Classification as a Tool for Research , 2010 .

[57]  Antonio Punzo,et al.  Discrete approximations of continuous and mixed measures on a compact interval , 2012 .

[58]  Jan R. Magnus,et al.  THE ASYMPTOTIC VARIANCE OF THE PSEUDO MAXIMUM LIKELIHOOD ESTIMATOR , 2007, Econometric Theory.

[59]  Igor Vatolkin,et al.  AMUSE (Advanced MUSic Explorer) - A Multitool Framework for Music Data Analysis , 2010, ISMIR.

[60]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[61]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[62]  B. Ripley The Second-Order Analysis of Stationary Point Processes , 1976 .

[63]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[64]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[65]  A. Gelfand,et al.  Proper multivariate conditional autoregressive models for spatial data analysis. , 2003, Biostatistics.

[66]  M. Pesaran,et al.  Testing for unit roots in heterogeneous panels , 2003 .

[67]  D. Vere-Jones,et al.  Stochastic Declustering of Space-Time Earthquake Occurrences , 2002 .

[68]  Gavin L. Fox,et al.  Cautionary Remarks on the Use of Clusterwise Regression , 2008, Multivariate behavioral research.

[69]  Stefano Bonnini,et al.  Advances in Permutation Tests for Covariates in a Mixture Model for Preference Data Analysis , 2014 .

[70]  Christian Hennig,et al.  Identifiablity of Models for Clusterwise Linear Regression , 2000, J. Classif..

[71]  G. Gavin,et al.  Graduation by Kernel and Adaptive Kernel Methods with a Boundary Correction , 1995 .

[72]  S. Schwerman,et al.  The Physics of Musical Instruments , 1991 .

[73]  Wasinee Rungsarityotin,et al.  An Indicator for the Number of Clusters , 2006 .

[74]  Richard F. Lyon,et al.  Auditory model inversion for sound separation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[75]  J. Westerlund Testing for Error Correction in Panel Data , 2006 .

[76]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[77]  J. H. Pollard,et al.  The age pattern of mortality , 1979 .

[78]  Yasunari Inamura Estimating Continuous Time Transition Matrices From Discretely Observed Data , 2006 .

[79]  Maurizio Vichi,et al.  Studies in Classification Data Analysis and knowledge Organization , 2011 .

[80]  Halina Frydman,et al.  Testing the Adequacy of Markov Chain and Mover-Stayer Models as Representations of Credit Behavior , 1985, Oper. Res..

[81]  P. Deuflhard,et al.  Robust Perron cluster analysis in conformation dynamics , 2005 .

[82]  John Elder,et al.  Handbook of Statistical Analysis and Data Mining Applications , 2009 .

[83]  Kihoon Lee,et al.  CAUSAL RELATIONSHIP BETWEEN ENERGY CONSUMPTION AND GDP REVISITED: THE CASE OF KOREA 1970-1999 , 2004 .

[84]  Robert R. Bush,et al.  The Industrial Mobility of Labor as a Probability Process. , 1956 .

[85]  J. Pearl Causal inference in statistics: An overview , 2009 .

[86]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[87]  M. Genton,et al.  Functional Boxplots , 2011 .

[88]  Stanley Wasserman,et al.  Correspondence and canonical analysis of relational data , 1990 .

[89]  Christian Hennig,et al.  Cluster-wise assessment of cluster stability , 2007, Comput. Stat. Data Anal..

[90]  L. Held,et al.  Towards joint disease mapping , 2005, Statistical methods in medical research.

[91]  Florence Puech,et al.  Evaluating the geographic concentration of industries using distance-based methods , 2003 .

[92]  A. Timmermann Forecast Combinations , 2005 .

[93]  Gert Rijlaarsdam,et al.  Editorial: Special issue on learning and teaching L2 writing , 2008 .

[94]  José G. Dias,et al.  An empirical comparison of EM, SEM and MCMC performance for problematic Gaussian mixture likelihoods , 2004, Stat. Comput..

[95]  Yee Leung,et al.  Statistical Tests for Spatial Nonstationarity Based on the Geographically Weighted Regression Model , 2000 .

[96]  Sebastian Krey,et al.  SVM Based Instrument and Timbre Classification , 2010 .

[97]  A. Shorrocks,et al.  The Measurement of Mobility , 1978 .

[98]  Fabrizio Cipollini,et al.  Firm Size Dynamics in an Industrial District: The Mover-Stayer Model in Action , 2012 .

[99]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[100]  Rob J Hyndman,et al.  Rainbow Plots, Bagplots, and Boxplots for Functional Data , 2010 .

[101]  Anindya Banerjee,et al.  Error‐correction Mechanism Tests for Cointegration in a Single‐equation Framework , 1998 .

[102]  Domenico Piccolo,et al.  A new approach for modelling consumers’ preferences , 2008 .

[103]  J. Andrew Royle Multivariate Spatial Models , 2000 .

[104]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[105]  Steven Haberman NON-PARAMETRIC GRADUATION USING KERNEL METHODS , 1983 .

[106]  Massimo Mucciardi,et al.  A GWR Model for Local Analysis of Demographic Relationships , 2011 .

[107]  Gernot Kubin,et al.  Anthropomorphic Coding of Speech and Audio: A Model Inversion Approach , 2005, EURASIP J. Adv. Signal Process..

[108]  Burton H. Singer,et al.  The Representation of Social Processes by Markov Models , 1976, American Journal of Sociology.

[109]  P. Pedroni Current Version : July 25 , 1999 CRITICAL VALUES FOR COINTEGRATION TESTS IN HETEROGENEOUS PANELS WITH MULTIPLE REGRESSORS * , 1999 .

[110]  Angela D'Elia,et al.  A mixture model for preferences data analysis , 2005, Comput. Stat. Data Anal..

[111]  Peter Pedroni,et al.  Fully modified OLS for heterogeneous cointegrated panels , 2001 .

[112]  Jeffrey S. Racine Nonparametric econometrics: a primer (in Russian) , 2008 .

[113]  Yves Lechevallier,et al.  Vers la simulation et la détection des changements des données évolutives d'usage du Web , 2009, EGC.

[114]  Y. Ogata Space-Time Point-Process Models for Earthquake Occurrences , 1998 .

[115]  R. Tibshirani,et al.  Varying‐Coefficient Models , 1993 .

[116]  João Gama,et al.  Change Detection in Learning Histograms from Data Streams , 2007, EPIA Workshops.

[117]  J C Brown,et al.  Feature dependence in the automatic identification of musical woodwind instruments. , 2001, The Journal of the Acoustical Society of America.

[118]  Keith Winter,et al.  Electronic Music Studios , 1968 .

[119]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[120]  P. Pedroni PANEL COINTEGRATION: ASYMPTOTIC AND FINITE SAMPLE PROPERTIES OF POOLED TIME SERIES TESTS WITH AN APPLICATION TO THE PPP HYPOTHESIS , 2004, Econometric Theory.

[121]  G. Foody Geographical weighting as a further refinement to regression modelling: An example focused on the NDVI–rainfall relationship , 2003 .

[122]  Jorge Mateu,et al.  Hierarchical clustering of spatially correlated functional data , 2012 .

[123]  John Scott Social Network Analysis , 1988 .

[124]  Denis Fougère,et al.  Bayesian Inference for the Mover-Stayer Model in Continuous-Time , 2003 .

[125]  Bernd Bischl,et al.  Selecting Small Audio Feature Sets in Music Classification by Means of Asymmetric Mutation , 2010, PPSN.

[126]  A. Bowman An alternative method of cross-validation for the smoothing of density estimates , 1984 .

[127]  Halina Frydman,et al.  Estimation in the Continuous Time Mover-Stayer Model with an Application to Bond Ratings Migration , 2002 .

[128]  Leo A. Goodman,et al.  Statistical Methods for the Mover-Stayer Model , 1961 .

[129]  Germana Scepi,et al.  Visualizing and Exploring High Frequency Financial Data: Beanplot Time Series , 2011 .

[130]  John Geweke,et al.  Mobility Indices in Continuous Time Markov Chains , 1986 .

[131]  Guohua Pan,et al.  Local Regression and Likelihood , 1999, Technometrics.

[132]  Maria Iannario Preliminary estimators for a mixture model of ordinal data , 2012, Adv. Data Anal. Classif..

[133]  Wenceslao González-Manteiga,et al.  A functional analysis of NOx levels: location and scale estimation and outlier detection , 2007, Comput. Stat..

[134]  Arshad Mahmood,et al.  Contraceptive use dynamics in Pakistan 2008-09 , 2012 .

[135]  Hans-Joachim Mucha On Validation of Hierarchical Clustering , 2006, GfKl.

[136]  A. M. Masih,et al.  Energy consumption, real income and temporal causality: results from a multi-country study based on cointegration and error-correction modelling techniques , 1996 .

[137]  Angela D'Elia,et al.  Finite sample performance of the E-M algorithm for ranks data modelling , 2007 .

[138]  H. Frydman Maximum Likelihood Estimation in the Mover-Stayer Model , 1984 .

[139]  K. Angayarkkani,et al.  Efficient Forest Fire Detection System: A Spatial Data Mining and Image Processing Based Approach , 2009 .

[140]  John Tenhunen,et al.  Application of a geographically‐weighted regression analysis to estimate net primary production of Chinese forest ecosystems , 2005 .

[141]  Michael D. Vose,et al.  The simple genetic algorithm - foundations and theory , 1999, Complex adaptive systems.

[142]  Marc G. Genton,et al.  Adjusted functional boxplots for spatio‐temporal data visualization and outlier detection , 2012 .

[143]  Germana Scepi,et al.  FORECASTING BY BEANPLOT TIME SERIES , 2010 .

[144]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[145]  Elia Biganzoli,et al.  Conditional independence relations among biological markers may improve clinical decision as in the case of triple negative breast cancers , 2009, BMC Bioinformatics.

[146]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[147]  Clifford M. Hurvich,et al.  Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion , 1998 .

[148]  Rainer Brüggemann,et al.  Model-Based Cluster Analysis Applied to Flow Cytometry Data , 2005 .

[149]  A. Philip Dawid,et al.  Beware of the DAG! , 2008, NIPS Causality: Objectives and Assessment.

[150]  Marcus Weber,et al.  Stable Computation of Probability Densities for Metastable Dynamical Systems , 2007, Multiscale Model. Simul..

[151]  R. Engle,et al.  COINTEGRATION AND ERROR CORRECTION: REPRESENTATION , 1987 .

[152]  D. Talkin Fundamentals of Speech Synthesis and Speech Recognition , 1996 .