论文信息 - Utilizing Vector Models for Automatic Text Lemmatization

Utilizing Vector Models for Automatic Text Lemmatization

In this paper we tackle the problem of lemmatization of inflectional languages. We introduce a new algorithm which utilizes vector models of words. Current approaches in this area are limited to knowing either full grammar rules or the translation matrix between the word and its basic form. However, this information is encoded in natural text. Our solution uses text corpora to build vector models of words and a small amount of user input to infer lemmas. We have evaluated our approach on the Slovak language and present interesting findings on its feasibility for real-world utilization.

Marián Simko | Ladislav Gallay

[1] Miloslav Konopík,et al. HPS: High precision stemmer , 2015, Inf. Process. Manag..

[2] M. F. Porter,et al. An algorithm for suffix stripping , 1997 .

[3] Mária Bieliková,et al. Exploring Multidimensional Continuous Feature Space to Extract Relevant Words , 2014, SLSP.

[4] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[5] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6] Olivier Chapelle,et al. Expected reciprocal rank for graded relevance , 2009, CIKM.

[7] Kevin Gimpel,et al. Tailoring Continuous Word Representations for Dependency Parsing , 2014, ACL.

[8] Geoffrey Zweig,et al. Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.