论文信息 - Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation

Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation

Classified s -grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s -grams have consistently outperformed other approximate string matching techniques, like edit distance or n -grams. The Jaccard coefficient has traditionally been used as an s -gram based string proximity measure. However, other proximity measures for s -gram matching have not been tested. In the current study the performance of seven proximity measures for classified s -grams in CLIR context was evaluated using eleven language pairs. The binary proximity measures performed generally better than their non-binary counterparts, but the difference depended mainly on the padding used with s -grams. When no padding was used, the binary and non-binary proximity measures were nearly equal, though the performance at large deteriorated.

Anni Järvelin | Antti Järvelin | Anni Järvelin | Antti Järvelin

[1] Peter Willett,et al. Applications of n-grams in textual information systems , 1998, J. Documentation.

[2] Kalervo Järvelin,et al. Fuzzy translation of cross-lingual spelling variants , 2003, SIGIR.

[3] D. Whitefield,et al. A review of: “Practical Nonpararnetric Statistics. By W. J. CONOVER. (New York: Wiley, 1971.) [Pl" x+462.] £5·25. , 1972 .

[4] Heikki Mannila,et al. Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[5] M. F. Fuller,et al. Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[6] Justin Zobel,et al. Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[7] Simon Parsons,et al. Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[8] Turid Hedlund,et al. Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000–2002 , 2004, Information Retrieval.

[9] Carol Peters,et al. Comparative Evaluation of Multilingual Information Access Systems , 2003, Lecture Notes in Computer Science.

[10] Kalervo Järvelin,et al. Employing the resolution power of search keys , 2001 .

[11] Kazuaki Kishida,et al. Technical issues of cross-language information retrieval: a review , 2005, Inf. Process. Manag..

[12] Kalervo Järvelin,et al. Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants , 2003, SPIRE.

[13] Kalervo Järvelin,et al. Targeted s-gram matching: a novel n-gram matching technique for cross- and mono-lingual word form variants , 2002, Inf. Res..

[14] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[15] Michael I. Posner,et al. Cognition (2nd ed.). , 1987 .

[16] Carol Peters. Introduction to the CLEF 2003 Working Notes , 2003 .

[17] Norbert Fuhr,et al. Retrieval Effectiveness of Proper Name Search Methods , 1996, Inf. Process. Manag..

[18] Juha Kärkkäinen,et al. Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[19] Esko Ukkonen,et al. Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[20] Kalervo Järvelin,et al. s-grams: Defining generalized n-grams for information retrieval , 2007, Inf. Process. Manag..

[21] Wojciech Rytter,et al. Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[22] Sergios Theodoridis,et al. Pattern Recognition , 1998, IEEE Trans. Neural Networks.