Random Indexing Distributional Semantic Models for Croatian Language

Distributional semantic models (DSMs) model semantic relations between expressions by comparing the contexts in which these expressions occur. This paper presents an extensive evaluation of distributional semantic models for Croatian language. We focus on random indexing models, an efficient and scalable approach to building DSMs. We build a number of models with different parameters (dimension, context type, and similarity measure) and compare them against human semantic similarity judgments. Our results indicate that even low-dimensional random indexing models may outperform the raw frequency models, and that the choice of the similarity measure is most important.

[1]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[2]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[3]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[4]  Magnus Sahlgren,et al.  The Distributional Hypothesis , 2008 .

[5]  Polina Panicheva,et al.  Automatic Word Clustering in Russian Texts , 2007, TSD.

[6]  Maciej Piasecki,et al.  Automated Extraction of Lexical Meanings from Corpus : A Case Study of Potentialities and Limitations , 2009 .

[7]  Michael D. Rychener Review of "Coputational Semantics: An Introduction to Artificial Intelligence and Natural Language Comprhension by Eugue Charniak and Yorick Wilks, Eds." North-Holland, Amer. Elsevier. , 1976, SGAR.

[8]  Alessandro Lenci,et al.  One Distributional Memory, Many Semantic Spaces , 2009, Proceedings of the Workshop on Geometrical Models of Natural Language Semantics - GEMS '09.

[9]  Preslav Nakov,et al.  ИЗСЛЕДВАНЕ НА РУСКА ЛИТЕРАТУРА С ЛАТЕНТЕН СЕМАНТИЧЕН АНАЛИЗ Преслав И. Наков Софийски университет "Св. Климент Охридски" LATENT SEMANTIC ANALYSIS FOR RUSSIAN LITERATURE INVESTIGATION , 2001 .

[10]  Stan Szpakowicz,et al.  Corpus-based Semantic Relatedness for the Construction of Polish WordNet , 2008, LREC.

[11]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[12]  James Richard Curran,et al.  From distributional to semantic similarity , 2004 .

[13]  Magnus Sahlgren,et al.  The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[14]  Preslav Nakov,et al.  Latent Semantic Analysis for German Literature Investigation , 2001, Fuzzy Days.

[15]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[16]  Maciej Piasecki,et al.  SuperMatrix: a General tool for lexical semantic knowledge acquisition , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[17]  William D. Marslen-Wilson,et al.  Universals in Morphological Representation: Evidence from Italian , 1997 .

[18]  Pentti Kanerva,et al.  Sparse Distributed Memory , 1988 .

[19]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[20]  Alessandro Lenci,et al.  Distributional semantics in linguistic and cognitive research , 2008 .

[21]  Damir Boras,et al.  Comparing measures of semantic similarity , 2008, ITI 2008 - 30th International Conference on Information Technology Interfaces.

[22]  Pavel Smrz,et al.  Finding Semantically Related Words in Large Corpora , 2001, TSD.

[23]  Curt Burgess,et al.  Modelling Parsing Constraints with High-dimensional Context Space , 1997 .

[24]  Jan Snajder,et al.  Automatic acquisition of inflectional lexica for morphological normalisation , 2008, Inf. Process. Manag..