A semantic framework for noise addition with nominal data

Noise addition is a data distortion technique widely used in data intensive applications. For example, in machine learning tasks it helps to reduce overfitting, whereas in data privacy protection it adds uncertainty to personally identifiable information. Yet, due to its mathematical operating principle, noise addition is a method mainly intended for continuous numerical data. In fact, despite the large amount of nominal data that are being currently compiled and used in data analysis, only a few alternative techniques have been proposed to distort nominal data in a similar way as standard noise addition does for numerical data. Furthermore, all these alternative methods rely on the distribution of the data rather than on the semantics of nominal values, which negatively affects the utility of the distorted outcomes. To tackle this issue, in this paper we present a semantically-grounded alternative to numerical noise suitable for nominal data, which we name semantic noise. By means of semantic noise, and by exploiting structured knowledge sources such as ontologies, we are able to distort nominal data while preserving better their semantics and thus, their analytical utility. To that end, we provide semantically and mathematically coherent versions of the statistical operators required in the noise addition process, which include the difference, the mean, the variance and the covariance. Then, we propose semantic noise addition algorithms that cope with the finite, discrete and non-ordinal nature of nominal data. The proposed algorithms cover both uncorrelated noise addition, which is suited to independent attributes, and correlated noise addition, which can cope with multivariate datasets with dependent attributes. Empirical results show that our proposals offer general and configurable mechanisms to distort nominal data while preserving data semantics better than baseline methods based only on the distribution of the data.

[1]  Yao Zhao,et al.  Forensic detection of noise addition in digital images , 2014, J. Electronic Imaging.

[2]  Josep Domingo-Ferrer,et al.  Semantic variance: An intuitive measure for ontology accuracy evaluation , 2015, Eng. Appl. Artif. Intell..

[3]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[4]  Ljiljana Brankovic,et al.  VICUS - A Noise Addition Technique for Categorical Data , 2012, AusDM.

[5]  Kent A Spackman,et al.  SNOMED CT milestones: endorsements are added to already-impressive standards credentials. , 2004, Healthcare informatics : the business magazine for information and communication systems.

[6]  Paul E. Utgoff,et al.  Incremental Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[7]  David Sánchez,et al.  A semantic framework to protect the privacy of electronic health records with non-numerical attributes , 2013, J. Biomed. Informatics.

[8]  Junzhong Gu,et al.  A New Model of Information Content for Semantic Similarity in WordNet , 2008, 2008 Second International Conference on Future Generation Communication and Networking Symposia.

[9]  P. Tendick Optimal noise addition for preserving confidentiality in multivariate data , 1991 .

[10]  Marek Omelka,et al.  A comparison of the Mantel test with a generalised distance covariance test , 2013 .

[11]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[12]  Jay-J. Kim A METHOD FOR LIMITING DISCLOSURE IN MICRODATA BASED ON RANDOM NOISE AND , 2002 .

[13]  Guillermo Navarro-Arribas,et al.  On the Declassification of Confidential Documents , 2011, MDAI.

[14]  Tony Veale,et al.  An Intrinsic Information Content Metric for Semantic Similarity in WordNet , 2004, ECAI.

[15]  Vicenç Torra Towards Knowledge Intensive Data Privacy , 2010, DPM/SETOP.

[16]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[17]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[18]  Montserrat Batet,et al.  Utility preserving query log anonymization via semantic microaggregation , 2013, Inf. Sci..

[19]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[20]  David L. Neuhoff,et al.  The validity of the additive noise model for uniform scalar quantizers , 2005, IEEE Transactions on Information Theory.

[21]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[22]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[23]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[24]  Md Zahidul Islam,et al.  Privacy preserving data mining: A noise addition framework using a novel clustering technique , 2011, Knowl. Based Syst..

[25]  Tim Roughgarden,et al.  Universally utility-maximizing privacy mechanisms , 2008, STOC '09.

[26]  Richard Conway,et al.  Selective partial access to a database , 1976, ACM '76.

[27]  Josep Domingo-Ferrer,et al.  Enhancing data utility in differential privacy via microaggregation-based k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{docume , 2014, The VLDB Journal.

[28]  Rathindra Sarathy,et al.  Security of random data perturbation methods , 1999, TODS.

[29]  David Sánchez,et al.  Ontology-based semantic similarity: A new feature-based approach , 2012, Expert Syst. Appl..

[30]  Chunxiao Jiang,et al.  Information Security in Big Data: Privacy and Data Mining , 2014, IEEE Access.

[31]  David Sánchez,et al.  Enabling semantic similarity estimation across multiple ontologies: An evaluation in the biomedical domain , 2012, J. Biomed. Informatics.

[32]  Philip S. Yu,et al.  A General Survey of Privacy-Preserving Data Mining Models and Algorithms , 2008, Privacy-Preserving Data Mining.

[33]  A. Tversky Features of Similarity , 1977 .

[34]  Nora Cuppens-Boulahia,et al.  Data Privacy Management and Autonomous Spontaneous Security , 2014, Lecture Notes in Computer Science.

[35]  Nicolò Cesa-Bianchi,et al.  Online Learning of Noisy Data , 2011, IEEE Transactions on Information Theory.

[36]  David Sánchez,et al.  Semantic similarity estimation from multiple ontologies , 2012, Applied Intelligence.

[37]  David Sánchez,et al.  Semantic adaptive microaggregation of categorical microdata , 2012, Comput. Secur..

[38]  Jimeng Sun,et al.  Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[39]  Ruth Brand,et al.  Microdata Protection through Noise Addition , 2002, Inference Control in Statistical Databases.

[40]  Max J. Egenhofer,et al.  Determining Semantic Similarity among Entity Classes from Different Ontologies , 2003, IEEE Trans. Knowl. Data Eng..

[41]  Eyke Hüllermeier,et al.  Open challenges for data stream mining research , 2014, SKDD.

[42]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[43]  Montserrat Batet,et al.  An information theoretic approach to improve semantic similarity assessments across multiple ontologies , 2014, Inf. Sci..

[44]  David Sánchez,et al.  An ontology-based measure to compute semantic similarity in biomedicine , 2011, J. Biomed. Informatics.

[45]  David Sánchez,et al.  A Review on Semantic Similarity , 2015 .

[46]  Philipp Cimiano,et al.  Ontology learning and population from text - algorithms, evaluation and applications , 2006 .

[47]  Nenghai Yu,et al.  The Optimal Noise Distribution for Privacy Preserving in Mobile Aggregation Applications , 2014, Int. J. Distributed Sens. Networks.

[48]  Jing Kong,et al.  Using distance correlation and SS-ANOVA to assess associations of familial relationships, lifestyle factors, diseases, and mortality , 2012, Proceedings of the National Academy of Sciences.

[49]  Timothy W. Finin,et al.  Swoogle: a search and metadata engine for the semantic web , 2004, CIKM '04.

[50]  David Sánchez,et al.  Semantically-grounded construction of centroids for datasets with textual attributes , 2012, Knowl. Based Syst..

[51]  Lorenzo L. Pesce,et al.  Noise injection for training artificial neural networks: a comparison with weight decay and early stopping. , 2009, Medical physics.

[52]  Nicola Guarino,et al.  Formal Ontology and Information Systems , 1998 .