A Computational Phonetic Model for Indian Language Scripts

In spite of South Asia being one of the richest areas in terms o f linguistic diversity, South Asian languages have a lot in common. For exam ple, most of the major Indian languages use scripts which are derived from th e ancient Brahmi script, have more or less the same arrangement of alphabet, a re highly phonetic in nature and are very well organised. We have used this fact t o build a computational phonetic model of Brahmi origin scripts. The pho netic model mainly consists of a model of phonology (including some orthograph ic features) based on a common alphabet of these scripts, numerical values assign ed to these features, a stepped distance function (SDF), and an algorithm for align in strings of feature vectors. The SDF is used to calculate the phonetic and orthog raphic similarity of two letters. The model can be used for applications like spel l ch cking, predicting spelling/dialectal variation, text normalization, findin g rhyming words, and identifying cognate words across languages. Some initial exper iments have been done on this and the results seem encouraging.

[1]  H. H. Hock Principles of historical linguistics , 1986 .

[2]  Yunxin Zhao A speaker-independent continuous speech recognition system using continuous mixture Gaussian density HMM of phoneme-sized units , 1993, IEEE Trans. Speech Audio Process..

[3]  Francois Yergeau UTF-8, a transformation format of ISO 10646 , 1998, RFC.

[4]  M. Sanati,et al.  Iranian Standard Code for Information Interchange (ISCII) , 1987 .

[5]  Peter D. Miller Hello World , 1993 .

[6]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[7]  Arnaud Rey,et al.  Graphemes are perceptual reading units , 2000, Cognition.

[8]  James F. Allen,et al.  Bi-directional conversion between graphemes and phonemes using a joint N-gram model , 2001, SSW.

[9]  Richard Sproat,et al.  Book Reviews: A Computational Theory of Writing Systems , 2006, CL.

[10]  Sieb G. Nooteboom Alphabetics: From phonemes to letters or from letters to phonemes? , 2007 .

[11]  Florian Coulmas,et al.  Writing Systems: An Introduction to Their Linguistic Analysis , 2002 .

[12]  Alan W. Black,et al.  Issues in building general letter to sound rules , 1998, SSW.

[13]  R. Sproat A FORMAL COMPUTATIONAL ANALYSIS OF INDIC SCRIPTS , 2003 .

[14]  Simon Kirby,et al.  Measuring Language Divergence by Intra-Lexical Comparison , 2006, ACL.

[15]  Walter Daelemans,et al.  A language-independent, data-oriented architecture for grapheme-to-phoneme conversion , 1994, SSW.

[16]  L. R. Rabiner,et al.  A comparative study of several dynamic time-warping algorithms for connected-word recognition , 1981, The Bell System Technical Journal.