Informational content of cosine and other similarities calculated from high-dimensional Conceptual Property Norm data

To study concepts that are coded in language, researchers often collect lists of conceptual properties produced by human subjects. From these data, different measures can be computed. In particular, inter-concept similarity is an important variable used in experimental studies. Among possible similarity measures, the cosine of conceptual property frequency vectors seems to be a de facto standard. However, there is a lack of comparative studies that test the merit of different similarity measures when computed from property frequency data. The current work compares four different similarity measures (cosine, correlation, Euclidean and Chebyshev) and five different types of data structures. To that end, we compared the informational content (i.e., entropy) delivered by each of those 4 × 5 = 20 combinations, and used a clustering procedure as a concrete example of how informational content affects statistical analyses. Our results lead us to conclude that similarity measures computed from lower-dimensional data fare better than those calculated from higher-dimensional data, and suggest that researchers should be more aware of data sparseness and dimensionality, and their consequences for statistical analyses.

[1]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[2]  J. Hampton Polymorphous Concepts in Semantic Memory , 1979 .

[3]  E. Rosch,et al.  Family resemblances: Studies in the internal structure of categories , 1975, Cognitive Psychology.

[4]  Tom F. Wilderjans,et al.  ADPROCLUS: a graphical user interface for fitting additive profile clustering models to object by variable data matrices , 2011, Behavior research methods.

[5]  Norman,et al.  Structural Models: An Introduction to the Theory of Directed Graphs. , 1966 .

[6]  Marc Brysbaert,et al.  How useful are corpus-based methods for extrapolating psycholinguistic variables? , 2015, Quarterly journal of experimental psychology.

[7]  L. Fisher,et al.  391: A Monte Carlo Comparison of Six Clustering Procedures , 1975 .

[8]  M. Garrett,et al.  Representing the meanings of object and action words: The featural and unitary semantic space hypothesis , 2004, Cognitive Psychology.

[9]  Marco Baroni,et al.  A set of semantic norms for German and Italian , 2011, Behavior research methods.

[10]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[11]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[12]  Michael J Cortese,et al.  Predicting semantic priming at the item level , 2008 .

[13]  Ken McRae,et al.  Further evidence for feature correlations in semantic memory. , 1999, Canadian journal of experimental psychology = Revue canadienne de psychologie experimentale.

[14]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[15]  A. Tversky Features of Similarity , 1977 .

[16]  Magnus Sahlgren,et al.  The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[17]  Maria Montefinese,et al.  Semantic similarity between old and new items produces false alarms in recognition memory , 2015, Psychological research.

[18]  D. Kleinbaum,et al.  Applied Regression Analysis and Other Multivariate Methods , 1978 .

[19]  B. Tversky,et al.  Journal of Experimental Psychology : General VOL . 113 , No . 2 JUNE 1984 Objects , Parts , and Categories , 2005 .

[20]  Ken McRae,et al.  Category - Specific semantic deficits , 2008 .

[21]  Wayne D. Gray,et al.  Basic objects in natural categories , 1976, Cognitive Psychology.

[22]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[23]  B. A. Farbey,et al.  Structural Models: An Introduction to the Theory of Directed Graphs , 1966 .

[24]  Zachary Estes,et al.  Using Latent Semantic Analysis to Estimate Similarity , 2006 .

[25]  L. Barsalou,et al.  Perceptual simulation in conceptual combination: evidence from property generation. , 2009, Acta psychologica.

[26]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[27]  Tom Verguts,et al.  Beyond exemplars and prototypes as memory representations of natural concepts: A clustering approach☆ , 2007 .

[28]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[29]  Simon De Deyne,et al.  Redefining the resolution of semantic knowledge in the brain: Advances made by the introduction of models of semantics in neuroimaging , 2019, Neuroscience & Biobehavioral Reviews.

[30]  Roger N. Shepard,et al.  Additive clustering: Representation of similarities as combinations of discrete overlapping properties. , 1979 .

[31]  P. Jaccard Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines , 1901 .

[32]  Mark S. Seidenberg,et al.  Semantic feature production norms for a large set of living and nonliving things , 2005, Behavior research methods.

[33]  Gabriel Recchia,et al.  More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis , 2009, Behavior research methods.

[34]  Ettore Ambrosini,et al.  Semantic memory: A feature-based analysis and new norms for Italian , 2013, Behavior research methods.

[35]  Alessandro Lenci,et al.  BLIND: a set of semantic feature norms from the congenitally blind , 2013, Behavior research methods.

[36]  Wolf Vanpaemel,et al.  Exemplar by feature applicability matrices and other Dutch normative data for semantic concepts , 2008, Behavior research methods.

[37]  Jeroen Geertzen,et al.  The Centre for Speech, Language and the Brain (CSLB) concept property norms , 2013, Behavior research methods.

[38]  Gert Storms,et al.  Similar but not the same: A comparison of the utility of directly rated and feature-based similarity measures for generating spatial models of conceptual data , 2009, Behavior research methods.

[39]  M. Brusco Clustering binary data in the presence of masking variables. , 2004, Psychological methods.

[40]  W. Maki,et al.  Latent structure in measures of associative, semantic, and thematic knowledge , 2008, Psychonomic bulletin & review.

[41]  J. Vivas,et al.  Spanish semantic feature production norms for 400 concrete concepts , 2017, Behavior research methods.

[42]  Charu C. Aggarwal,et al.  Data Mining: The Textbook , 2015 .