Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation

Classified s -grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s -grams have consistently outperformed other approximate string matching techniques, like edit distance or n -grams. The Jaccard coefficient has traditionally been used as an s -gram based string proximity measure. However, other proximity measures for s -gram matching have not been tested. In the current study the performance of seven proximity measures for classified s -grams in CLIR context was evaluated using eleven language pairs. The binary proximity measures performed generally better than their non-binary counterparts, but the difference depended mainly on the padding used with s -grams. When no padding was used, the binary and non-binary proximity measures were nearly equal, though the performance at large deteriorated.

[1]  Peter Willett,et al.  Applications of n-grams in textual information systems , 1998, J. Documentation.

[2]  Kalervo Järvelin,et al.  Fuzzy translation of cross-lingual spelling variants , 2003, SIGIR.

[3]  D. Whitefield,et al.  A review of: “Practical Nonpararnetric Statistics. By W. J. CONOVER. (New York: Wiley, 1971.) [Pl" x+462.] £5·25. , 1972 .

[4]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[5]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[6]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[7]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[8]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000–2002 , 2004, Information Retrieval.

[9]  Carol Peters,et al.  Comparative Evaluation of Multilingual Information Access Systems , 2003, Lecture Notes in Computer Science.

[10]  Kalervo Järvelin,et al.  Employing the resolution power of search keys , 2001 .

[11]  Kazuaki Kishida,et al.  Technical issues of cross-language information retrieval: a review , 2005, Inf. Process. Manag..

[12]  Kalervo Järvelin,et al.  Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants , 2003, SPIRE.

[13]  Kalervo Järvelin,et al.  Targeted s-gram matching: a novel n-gram matching technique for cross- and mono-lingual word form variants , 2002, Inf. Res..

[14]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[15]  Michael I. Posner,et al.  Cognition (2nd ed.). , 1987 .

[16]  Carol Peters Introduction to the CLEF 2003 Working Notes , 2003 .

[17]  Norbert Fuhr,et al.  Retrieval Effectiveness of Proper Name Search Methods , 1996, Inf. Process. Manag..

[18]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[19]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[20]  Kalervo Järvelin,et al.  s-grams: Defining generalized n-grams for information retrieval , 2007, Inf. Process. Manag..

[21]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[22]  Sergios Theodoridis,et al.  Pattern Recognition , 1998, IEEE Trans. Neural Networks.