Symbolic Data Analysis: another look at the interaction of Data Mining and Statistics

Symbolic Data Analysis (SDA) provides a framework for the representation and analysis of data that comprehends inherent variability. While in Data Mining and classical Statistics the data to be analyzed usually presents one single value for each variable, that is no longer the case when the entities under analysis are not single elements, but groups gathered on the basis of some given criteria. Then, for each variable, variability inherent to each group should be taken into account. Also, when analysing concepts, such as botanic species, disease descriptions, car models, and so on, data entail intrinsic variability, which should be explicitly considered. To this purpose, new variable types have been introduced, whose realizations are not single real values or categories, but sets, intervals, or, more generally, distributions over a given domain. SDA provides methods for the (multivariate) analysis of such data, where the variability expressed in the data representation is taken into account, using various approaches.

[1]  Vladimir Batagelj,et al.  Clustering large data sets described with discrete distributions and its application on TIMSS data set , 2011, Stat. Anal. Data Min..

[2]  Donato Malerba,et al.  Comparing Dissimilarity Measures For Probabilistic Symbolic Objects , 2002 .

[3]  Paula Brito Symbolic objects: order structure and pyramidal clustering , 1995, Ann. Oper. Res..

[4]  G. Grisetti,et al.  Further Reading , 1984, IEEE Spectrum.

[5]  Monique Noirhomme-Fraiture,et al.  Symbolic Data Analysis and the SODAS Software , 2008 .

[6]  G. Choquet Theory of capacities , 1954 .

[7]  L. Billard,et al.  From the Statistics of Data to the Statistics of Knowledge , 2003 .

[8]  Yves Lechevallier,et al.  Dynamic Cluster Methods for Interval Data Based on Mahalanobis Distances , 2004 .

[9]  Francisco de A. T. de Carvalho,et al.  Hierarchical and Pyramidal Clustering , 2008 .

[10]  R.M.C.R. de Souza,et al.  Dynamic clustering of interval data based on adaptive Chebyshev distances , 2004 .

[11]  Francesco Palumbo,et al.  Principal Component Analysis for Non-Precise Data , 2005 .

[12]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[13]  Hans-Hermann Bock 6. Symbolic Data Analysis , 2003 .

[14]  Francisco de A. T. de Carvalho,et al.  Adaptive Batch SOM for Multiple Dissimilarity Data Tables , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[15]  Francisco de A. T. de Carvalho,et al.  Clustering of Interval-Valued Data Using Adaptive Squared Euclidean Distances , 2004, ICONIP.

[16]  Paola Zuccolotto Principal components of sample estimates: an approach through symbolic data analysis , 2007, Stat. Methods Appl..

[17]  L. Billard,et al.  Symbolic Regression Analysis , 2002 .

[18]  Edwin Diday Introduction à l'approche symbolique en analyse des données , 1989 .

[19]  Antonio Irpino,et al.  Dynamic Clustering of Histogram Data: Using the Right Metric , 2007 .

[20]  Paula Brito,et al.  Symbolic Clustering of Constrained Probabilistic Data , 2003 .

[21]  Miin-Shen Yang,et al.  Self-organizing map for symbolic data , 2012, Fuzzy Sets Syst..

[22]  Francisco de A. T. de Carvalho,et al.  Two Partitional Methods for Interval-Valued Data Using Mahalanobis Distances , 2004, IBERAMIA.

[23]  P. Brito,et al.  Modelling interval data with Normal and Skew-Normal distributions , 2012 .

[24]  Fabrice Rossi,et al.  Multi-layer Perceptron on Interval Data ? , 2002 .

[25]  Antonio Irpino,et al.  Ordinary Least Squares for Histogram Data Based on Wasserstein Distance , 2010, COMPSTAT.

[26]  G. Polaillon Interpretation and Reduction of Galois Lattices of Complex Data , 1998 .

[27]  D Simon Introduction à l'analyse des données symboliques , 2006 .

[28]  Paula Brito,et al.  Probabilistic clustering of interval data , 2015, Intell. Data Anal..

[29]  Francisco de A. T. de Carvalho,et al.  Fuzzy c-means clustering methods for symbolic interval data , 2007, Pattern Recognit. Lett..

[30]  E. Diday,et al.  Extension de l'analyse en composantes principales à des données de type intervalle , 1997 .

[31]  Marie Chavent,et al.  Divisive Monothetic Clustering for Interval and Histogram-valued Data , 2012, ICPRAM.

[32]  Donato Malerba,et al.  Dissimilarity and Matching , 2008 .

[33]  Marie Chavent,et al.  A monothetic clustering method , 1998, Pattern Recognit. Lett..

[34]  J. Arroyo,et al.  Forecasting histogram time series with k-nearest neighbours methods , 2009 .

[35]  Carlos Maté,et al.  Electric power demand forecasting using interval time series: A comparison between VAR and iMLP , 2010 .

[36]  Javier Arroyo,et al.  Forecasting with Interval and Histogram Data. Some Financial Applications , 2011 .

[37]  Paula Brito,et al.  Modeling Interval Time Series with Space–Time Processes , 2015 .

[38]  Hans-Hermann Bock,et al.  Visualizing Symbolic Data by Kohonen Maps , 2008 .

[39]  André Hardy,et al.  Clustering of Symbolic Objects Described by Multi-Valued and Modal Variables , 2004 .

[40]  Peter Walley,et al.  Towards a unified theory of imprecise probability , 2000, Int. J. Approx. Reason..

[41]  Edwin Diday,et al.  An introduction to symbolic data analysis and the SODAS software , 2003, Intell. Data Anal..

[42]  Edwin Diday,et al.  Growing a tree classifier with imprecise data , 2000, Pattern Recognit. Lett..

[43]  P. Cazes Régression par boule et par l'analyse des correspondances , 1976 .

[44]  Hani Hamdan,et al.  Self-organizing map based on hausdorff distance for interval-valued data , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[45]  Yves Lechevallier,et al.  Partitional clustering algorithms for symbolic interval data based on single adaptive distances , 2009, Pattern Recognit..

[46]  P. Brito,et al.  Structuring probabilistic data by Galois lattices , 2005 .

[47]  Hans-Hermann Bock,et al.  Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data , 2000 .

[48]  Francisco de A. T. de Carvalho,et al.  Constrained linear regression models for symbolic interval-valued variables , 2010, Comput. Stat. Data Anal..

[49]  Hans-Hermann Bock,et al.  Analysis of Symbolic Data , 2000 .

[50]  L. Billard,et al.  Likelihood functions and some maximum likelihood estimators for symbolic data , 2008 .

[51]  Javier Arroyo,et al.  Time series modeling of histogram-valued data: The daily histogram time series of S&P500 intradaily returns , 2012 .

[52]  G. Cordeiro,et al.  Bivariate symbolic regression models for interval-valued variables , 2011 .

[53]  Paula Brito Symbolic Clustering Of Probabilistic Data , 1998 .

[54]  Marc Csernel,et al.  Usual operations with symbolic data under normal symbolic form , 1999 .

[55]  F. Coolen,et al.  Interval-valued regression and classication models in the framework of machine learning , 2011 .

[56]  Paula Brito,et al.  Linear discriminant analysis for interval data , 2006, Comput. Stat..

[57]  S. J. Simoff Handling uncertainty in neural networks: an interval approach , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[58]  Giuseppe Giordano,et al.  Social Networks as Symbolic Data , 2014 .

[59]  Antonio Irpino,et al.  Comparing Histogram Data Using a Mahalanobis–Wasserstein Distance , 2008 .

[60]  Francisco de A. T. de Carvalho,et al.  Centre and Range method for fitting a linear regression model to symbolic interval data , 2008, Comput. Stat. Data Anal..

[61]  Géraldine Polaillon,et al.  Homogeneity and Stability in Conceptual Analysis , 2011, CLA.

[62]  Francisco de A. T. de Carvalho,et al.  Unsupervised pattern recognition models for mixed feature-type symbolic data , 2010, Pattern Recognit. Lett..

[63]  Hans-Hermann Bock,et al.  Dynamic clustering for interval data based on L2 distance , 2006, Comput. Stat..

[64]  Francisco de A. T. de Carvalho,et al.  Proximity Coefficients between Boolean symbolic objects , 1994 .

[65]  Edwin Diday,et al.  Generalization of the Principal Components Analysis to Histogram Data , 2000 .

[66]  Thanh-Nghi Do,et al.  Kernel Methods and Visualization for Interval Data Mining , 2005 .

[67]  F. Plastria,et al.  Classification problems with imprecise data through separating hyperplanes , 2007 .

[68]  Hisao Ishibuchi,et al.  DISCRIMINANT ANALYSIS OF MULTI-DIMENSIONAL INTERVAL DATA AND ITS APPLICATION TO CHEMICAL SENSING , 1990 .

[69]  Rosanna Verde,et al.  Data Stream Summarization by Histograms Clustering , 2013, Statistical Models for Data Analysis.

[70]  Yves Lechevallier,et al.  New clustering methods for interval data , 2006, Comput. Stat..

[71]  Witold Pedrycz,et al.  Granular Computing: Analysis and Design of Intelligent Systems , 2013 .

[72]  Giancarlo Ragozini,et al.  Analysis and Modeling of Complex Data in Behavioral and Social Sciences , 2014 .

[73]  Francisco de A. T. de Carvalho,et al.  Fuzzy K-means clustering algorithms for interval-valued data based on adaptive quadratic distances , 2010, Fuzzy Sets Syst..

[74]  Antonio Irpino,et al.  Optimal histogram representation of large data sets: Fisher vs piecewise linear approximation , 2007, EGC.

[75]  Herman Stekler,et al.  Measuring consensus in binary forecasts: NFL game predictions , 2009 .

[76]  D. Dubois,et al.  Properties of measures of information in evidence and possibility theories , 1987 .

[77]  Chenyi Hu,et al.  On interval weighted three-layer neural networks , 1998, Proceedings 31st Annual Simulation Symposium.

[78]  L. Billard,et al.  Regression Analysis for Interval-Valued Data , 2000 .

[79]  Francisco de A. T. de Carvalho,et al.  Clustering of interval data based on city-block distances , 2004, Pattern Recognit. Lett..

[80]  Manabu Ichino The quantile method for symbolic principal component analysis , 2011, Stat. Anal. Data Min..

[81]  Donato Malerba,et al.  Comparing Dissimilarity Measures for Symbolic Data Analysis , 2001 .

[82]  Yves Lechevallier,et al.  DIVCLUS-T: A monothetic divisive hierarchical clustering method , 2007, Comput. Stat. Data Anal..

[83]  R. Vignes Caracterisation automatique de groupes biologiques , 1991 .

[84]  A T de CarvalhoFrancisco de,et al.  Centre and Range method for fitting a linear regression model to symbolic interval data , 2008 .

[85]  Jirí Síma,et al.  Neural expert systems , 1995, Neural Networks.

[86]  Francisco de A. T. de Carvalho,et al.  Forecasting models for interval-valued time series , 2008, Neurocomputing.

[87]  Paula Brito,et al.  Distribution and Symmetric Distribution Regression Model for Histogram-Valued Variables , 2013 .

[88]  Hans-Hermann Bock CLUSTERING ALGORITHMS AND KOHONEN MAPS FOR SYMBOLIC DATA(Symbolic Data Analysis) , 2003 .

[89]  Miin-Shen Yang,et al.  Fuzzy clustering algorithms for mixed feature variables , 2004, Fuzzy Sets Syst..

[90]  Jean-Paul Rasson,et al.  Unsupervised Divisive Classification , 2008 .

[91]  Francisco de A. T. de Carvalho,et al.  Selected Contributions in Data Analysis and Classification , 2007 .

[92]  Monique Noirhomme-Fraiture,et al.  Far beyond the classical data models: symbolic data analysis , 2011, Stat. Anal. Data Min..

[93]  Donato Malerba,et al.  Classification of symbolic objects: A lazy learning approach , 2006, Intell. Data Anal..

[94]  Paolo Giordani,et al.  A comparison of three methods for principal component analysis of fuzzy interval data , 2006, Comput. Stat. Data Anal..

[95]  Mohamed A. Ismail,et al.  Fuzzy clustering for symbolic data , 1998, IEEE Trans. Fuzzy Syst..

[96]  Davide Anguita,et al.  Interval discriminant analysis using support vector machines , 2007, ESANN.

[97]  Javier Arroyo Gallardo Métodos de predicción para series temporales de intervalos e histogramas , 2008 .

[98]  Edwin Diday,et al.  Adaptation of interval PCA to symbolic histogram variables , 2012, Adv. Data Anal. Classif..

[99]  Chih-Cheng Tseng,et al.  Robust Interval Competitive Agglomeration Clustering Algorithm with Outliers , 2010 .

[100]  Hani Hamdan,et al.  Self-organizing map based on L2 distance for interval-valued data , 2011, 2011 6th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI).

[101]  Edwin Diday,et al.  Probabilist, possibilist and belief objects for knowledge analysis , 1995, Ann. Oper. Res..

[102]  Francisco de A. T. de Carvalho,et al.  A batch self-organizing maps algorithm based on adaptive distances , 2011, The 2011 International Joint Conference on Neural Networks.

[103]  Carlos A. Coelho Generalized canonical analysis. , 1992 .

[104]  F. Hosseinzadeh Lotfi,et al.  Discriminant analysis of interval data using Monte Carlo method in assessment of overlap , 2007, Appl. Math. Comput..

[105]  Kin Keung Lai,et al.  Interval Time Series Analysis with an Application to the Sterling-Dollar Exchange Rate , 2008, J. Syst. Sci. Complex..

[106]  Edwin Diday,et al.  Symbolic Data Analysis: Conceptual Statistics and Data Mining (Wiley Series in Computational Statistics) , 2007 .

[107]  Géraldine Polaillon,et al.  Classification Conceptuelle avec Généralisation par Intervalles , 2012, EGC.

[108]  Philippe Nivlet,et al.  Interval Discriminant Analysis: An Efficient Method to Integrate Errors In Supervised Pattern Recognition , 2001, ISIPTA.

[109]  Antonio Irpino,et al.  A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data , 2006, Data Science and Classification.

[110]  Yves Lechevallier,et al.  Adaptive Hausdorff distances and dynamic clustering of symbolic interval data , 2006, Pattern Recognit. Lett..