Semantically-grounded construction of centroids for datasets with textual attributes

Centroids are key components in many data analysis algorithms such as clustering or microaggregation. They are considered as the central value that minimises the distance to all the objects in a dataset or cluster. Methods for centroid construction are mainly devoted to datasets with numerical and categorical attributes, focusing on the numerical and distributional properties of data. Textual attributes, on the contrary, consist of term lists referring to concepts with a specific semantic content (i.e., meaning), which cannot be evaluated by means of classical numerical operators. Hence, the centroid of a dataset with textual attributes should be the term that minimises the semantic distance against the members of the set. Semantically-grounded methods aiming to construct centroids for datasets with textual attributes are scarce and, as it will be discussed in this paper, they are hampered by their limited semantic analysis of data. In this paper, we propose a method that, exploiting the knowledge provided by background ontologies (like WordNet), is able to construct the centroid of multivariate datasets described by means of textual attributes. Special efforts have been put in the minimisation of the semantic distance between the centroid and the input data. As a result, our method is able to provide optimal centroids (i.e., those that minimise the distance to all the objects in the dataset) according to the exploited background ontology and a semantic similarity measure. Our proposal has been evaluated by means of a real dataset consisting on short textual answers provided by visitors of a natural park. Results show that our centroids retain the semantic content of the input data better than related works.

[1]  Josep Domingo-Ferrer,et al.  A Survey of Inference Control Methods for Privacy-Preserving Data Mining , 2008, Privacy-Preserving Data Mining.

[2]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[3]  Kai Li,et al.  A classification algorithm based on local cluster centers with a few labeled training examples , 2010, Knowl. Based Syst..

[4]  David Sánchez,et al.  The Role of Ontologies in the Anonymization of Textual Variables , 2010, CCIA.

[5]  David Sánchez,et al.  An ontology-based measure to compute semantic similarity in biomedicine , 2011, J. Biomed. Informatics.

[6]  Carolina Ruiz,et al.  Designing semantics-preserving cluster representatives for scientific input conditions , 2006, CIKM '06.

[7]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[8]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[9]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[10]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[11]  Ajith Abraham,et al.  Enhanced Centroid-Based Classification Technique by Filtering Outliers , 2006, TSD.

[12]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[13]  David Sánchez,et al.  Ontology-based information content computation , 2011, Knowl. Based Syst..

[14]  Timothy W. Finin,et al.  Swoogle: a search and metadata engine for the semantic web , 2004, CIKM '04.

[15]  Michael Greenacre,et al.  Dynamic visualization of statistical learning in the context of high-dimensional textual data , 2010, J. Web Semant..

[16]  Vicenç Torra,et al.  Microaggregation for Categorical Variables: A Median Based Approach , 2004, Privacy in Statistical Databases.

[17]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[18]  Montserrat Batet,et al.  Performance of Ontology-Based Semantic Similarities in Clustering , 2010, ICAISC.

[19]  Yihui Liu,et al.  Dimensionality reduction and main component extraction of mass spectrometry cancer data , 2012, Knowl. Based Syst..

[20]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[21]  Xijin Tang,et al.  Text clustering using frequent itemsets , 2010, Knowl. Based Syst..

[22]  Adolfo Guzmán-Arenas,et al.  The centroid or consensus of a set of objects with qualitative attributes , 2011, Expert Syst. Appl..

[23]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[24]  Philipp Cimiano,et al.  Ontology learning and population from text - algorithms, evaluation and applications , 2006 .

[25]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[26]  Adolfo Guzmán-Arenas,et al.  Obtaining the consensus and inconsistency among a set of assertions on a qualitative attribute , 2010, Expert Syst. Appl..

[27]  David Sánchez,et al.  Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective , 2011, J. Biomed. Informatics.

[28]  David Sánchez,et al.  Privacy protection of textual attributes through a semantic-based masking method , 2012, Inf. Fusion.

[29]  David Sánchez,et al.  Ontology-Based Anonymization of Categorical Values , 2010, MDAI.

[30]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[31]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[32]  Jiye Liang,et al.  An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data , 2011, Knowl. Based Syst..

[33]  Liang Bai,et al.  A dissimilarity measure for the k-Modes clustering algorithm , 2012, Knowl. Based Syst..

[34]  Pierluigi Ritrovato,et al.  Advanced ontology management system for personalised e-Learning , 2009, Knowl. Based Syst..

[35]  David Sánchez,et al.  Ontology-driven web-based semantic similarity , 2010, Journal of Intelligent Information Systems.

[36]  Vicenç Torra,et al.  Towards Semantic Microaggregation of Categorical Data for Confidential Documents , 2010, MDAI.

[37]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[38]  Shaw Voon Wong,et al.  The development of an online knowledge-based expert system for machinability data selection , 2003, Knowl. Based Syst..

[39]  Vicenç Torra,et al.  Semantic Microaggregation for the Anonymization of Query Logs , 2010, Privacy in Statistical Databases.

[40]  Pierangela Samarati,et al.  Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , 1998 .

[41]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[42]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[43]  Gerrit Antonides Evaluation and Applications , 1990 .

[44]  Mostafa Keikha,et al.  Rich document representation and classification: An analysis , 2009, Knowl. Based Syst..

[45]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[46]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[47]  Betsy L. Humphreys,et al.  Relationships in Medical Subject Headings (MeSH) , 2001 .

[48]  Pei-Chann Chang,et al.  Density-based microaggregation for statistical disclosure control , 2010, Expert Syst. Appl..

[49]  David Sánchez,et al.  Ontology-based semantic similarity: A new feature-based approach , 2012, Expert Syst. Appl..

[50]  Jae Yearn Kim,et al.  CLUSTERING CATEGORICAL DATA BASED ON COMBINATIONS OF ATTRIBUTE VALUES , 2009 .