Building and Evaluating a Distributional Memory for Croatian

We report on the first structured distributional semantic model for Croatian, DM.HR. It is constructed after the model of the English Distributional Memory (Baroni and Lenci, 2010), from a dependencyparsed Croatian web corpus, and covers about 2M lemmas. We give details on the linguistic processing and the design principles. An evaluation shows state-of-theart performance on a semantic similarity task with particularly good performance on nouns. The resource is freely available.

[1]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[2]  Tomaz Erjavec,et al.  hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene , 2011, TSD.

[3]  Preslav Nakov,et al.  ИЗСЛЕДВАНЕ НА РУСКА ЛИТЕРАТУРА С ЛАТЕНТЕН СЕМАНТИЧЕН АНАЛИЗ Преслав И. Наков Софийски университет "Св. Климент Охридски" LATENT SEMANTIC ANALYSIS FOR RUSSIAN LITERATURE INVESTIGATION , 2001 .

[4]  Zeljko Agic,et al.  K-Best Spanning Tree Dependency Parsing With Verb Valency Lexicon Reranking , 2012, COLING.

[5]  Stan Szpakowicz,et al.  Corpus-based Semantic Relatedness for the Construction of Polish WordNet , 2008, LREC.

[6]  Alessandro Lenci,et al.  Distributional Memory: A General Framework for Corpus-Based Semantics , 2010, CL.

[7]  Pavel Smrz,et al.  Finding Semantically Related Words in Large Corpora , 2001, TSD.

[8]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[9]  Sebastian Padó,et al.  A distributional memory for German , 2012, KONVENS.

[10]  Graeme Hirst,et al.  Cross-Lingual Distributional Profiles of Concepts for Measuring Semantic Distance , 2007, EMNLP.

[11]  Katrin Erk,et al.  A Flexible, Corpus-Driven Model of Regular and Inverse Selectional Preferences , 2010, CL.

[12]  Jan Šnajder,et al.  Distributional Semantics Approach to Detecting Synonyms in Croatian Language , 2012 .

[13]  Damir Boras,et al.  Comparing measures of semantic similarity , 2008, ITI 2008 - 30th International Conference on Information Technology Interfaces.

[14]  Tomaz Erjavec,et al.  MULTEXT-East: morphosyntactic resources for Central and Eastern European languages , 2011, Language Resources and Evaluation.

[15]  Zeljko Agic,et al.  Improving Part-of-Speech Tagging Accuracy for Croatian by Morphological Analysis , 2008, Informatica.

[16]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[17]  Marko Tadić Croatian Lemmatization Server , 2005 .

[18]  Eiríkur Rögnvaldsson,et al.  A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) , 2008, GoTAL.

[19]  Nikola Ljubesic,et al.  Lemmatization and Morphosyntactic Tagging of Croatian and Serbian , 2013, BSNLP@ACL.

[20]  Noah A. Smith,et al.  Dependency Parsing , 2009, Encyclopedia of Artificial Intelligence.

[21]  Maciej Piasecki,et al.  Automated Extraction of Lexical Meanings from Corpus : A Case Study of Potentialities and Limitations , 2009 .

[22]  Zeljko Agic,et al.  Three Syntactic Formalisms for Data-Driven Dependency Parsing of Croatian , 2013, TSD.

[23]  Polina Panicheva,et al.  Automatic Word Clustering in Russian Texts , 2007, TSD.

[24]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[25]  Zdravko Dovedan,et al.  Evaluating Full Lemmatization of Croatian Texts , 2009 .

[26]  Fernando Pereira,et al.  Multilingual Dependency Analysis with a Two-Stage Discriminative Parser , 2006, CoNLL.

[27]  Maciej Piasecki,et al.  SuperMatrix: a General tool for lexical semantic knowledge acquisition , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[28]  Jan Snajder,et al.  Random Indexing Distributional Semantic Models for Croatian Language , 2011, TSD.