Identifying false friends between closely related languages

In this paper we present a corpus-based approach to automatic identification of false friends for Slovene and Croatian, a pair of closely related languages. By taking advantage of the lexical overlap between the two languages, we focus on measuring the difference in meaning between identically spelled words by using frequency and distributional information. We analyze the impact of corpora of different origin and size together with different association and similarity measures and compare them to a simple frequency-based baseline. With the best performing setting we obtain very good average precision of 0.973 and 0.883 on different gold standards. The presented approach works on non-parallel datasets, is knowledge-lean and language-independent, which makes it attractive for natural language processing tasks that often lack the lexical resources and cannot afford to build them by hand.

[1]  Diana Inkpen,et al.  A tool for detecting French-English cognates and false friends , 2007, JEPTALNRECITAL.

[2]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[3]  Viktor Pekar,et al.  Methods for extracting and classifying pairs of cognates and false friends , 2008, Machine Translation.

[4]  Špela Vintar,et al.  Bilingual lexicon extraction from comparable corpora : A comparative study , 2011 .

[5]  Svetlin Nakov Sofia Cognate or False Friend ? Ask the Web ! , 2007 .

[6]  Sylviane Granger,et al.  False Friends: a Kaleidoscope of Translation Difficulties , 1988 .

[7]  Keith Allan,et al.  Concise Encyclopedia of Semantics , 2009 .

[8]  Manuel Rubén Chacón Beltrán Towards a typological classification of false friends (Spanish-English) , 2006 .

[9]  Diana Inkpen,et al.  Automatic Identification of Cognates and False Friends in French and English , 2005 .

[10]  Preslav Nakov,et al.  Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus , 2009, RANLP.

[11]  Stefan Schulz,et al.  Cognate Mapping - A Heuristic Strategy for the Semi-Supervised Acquisition of a Spanish Lexicon from a Portuguese Seed Lexicon , 2004, COLING.

[12]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[13]  Sandra M. Aluísio,et al.  Using machine learning methods to avoid the pitfall of cognates and false friends in Spanish-Portuguese word pairs , 2011, STIL.

[14]  Tomaz Erjavec,et al.  hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene , 2011, TSD.