论文信息 - Low-level Devanāgar̄ı Support for Omega

Low-level Devanāgar̄ı Support for Omega — Adapting devnag

This paper presents tools (OTPs and macros) for typesetting languages using the Devanāgar̄ı script (Hindi, Sanskrit, Marathi). These tools are based on the Omega typesetting system and are using fonts from devnag, a package developed by Frans Velthuis in 1991. We are describing these new OTPs in detail, to provide the reader with insight into Omega techniques and allow him/her to further adapt these tools to his/her own environment (input method, font), and even to other Indic languages. sArA\f yh l̃K d̃vnAgrF ElEp ko þyog krt̃ h̀e BAqAao\ kF VAIps{EV\g yA l̃KAyojn k̃ Ele þyog Eke jAñ vAl̃ V́l ko þ-t̀t krtA h {. ỹ V́l aom̃gA VAIps{EV\g pr aADAErt h { aOr d̃vnAg k̃ l̃KAzpo ko þyog krtA h { joEk ČA\s ṽlT̀is ŠArA 1991 m̃ bnAyA gyA p{k̃j h {. hm in ao VF pF ko Ev-t̂t kr rh̃ h {\ tA\Ek p”hñ vAlo\ ko aom̃gA tknFk kF jAnkArF ho jAe aOr vo apñ aAp ko is V́l s̃ apñ vAtAvrZ aOr d́srF BArtFy BAqAao k̃ aǹzp bdl sk̃ yA xAl sk̃. Introduction One of the first Indic language support packages for TEX was devnag, developed by Frans Velthuis in 1991. At that time it was necessary to use a preprocessor for converting Hindi or Sanskrit text written in a way legible to humans into data legible by TEX. This preprocessor allowed the use of an ascii transcription, and performed the contextual analysis inherent to Devanāgar̄ı script, as well as pre-hyphenation (by explicitly inserting hyphenation points). The preprocessor was necessary for two main reasons: 1 A second system for processing Devanāgar̄ı was created by Charles Wikner. It has important features lacking in Velthuis’s devnag system, but unlike the latter it did not address the setting of Hindi text. The general design of the system – Metafont plus pre-processor – was identical to that of Velthuis. 1. A Sanskrit font contains over 300 glyphs, when ligatures are taken into account. 2. The TFM and VF languages are not powerful enough to make all the necessary glyphs out of a font of 256 characters. Using a preprocessor has many disadvantages, due mainly to the fact that it has to read not plain text, but rather LATEX code. It also has to avoid treating commands and environment names as Devanāgar̄ı text. So the preprocessor should be clever enough to distinguish text from commands, i.e., content from markup. It is well known that, in the case of TEX, this is practically impossible, unless the preprocessor is TEX itself (there is a notorious saying: “only TEX can read TEX”). So much for computing in the 20 century. Nowadays we have other means of processing information, and the concept of (external) preprocessor 50 TUGboat, Volume 23 (2002), No. 1 —Proceedings of the 2002 Annual Meeting Low-level Devanāgar̄ı Support for Omega— Adapting devnag is obsolete. In fact, the same operations are done inside Omega, a successor of TEX. Processing text internally has the crucial advantage of allowing the processor to distinguish precisely what is content and what markup (at least as precisely as TEX itself does it). This makes it much easier to treat properties inherent to writing systems: one only needs to concentrate on the linguistic and typographical properties of the script, and one doesn’t need to think of what to “do with LATEX commands” in the data stream. Furthermore, there is an efficiency issue: using Omega there is only one source file, namely the TEX file (and not a pre-TEX file and a TEX file); one doesn’t need to care about preprocessor directives; the system will not fail because of a new LATEX environment which is not known to devnag; mathematics and other similar constructions do not interfere with Devanāgar̄ı preprocessing. Contextual analysis of Devanāgar̄ı script has, at last, become a fundamental property of the system, independent of macros and packages. Unicode and Devanāgar̄ı The 20-bit information interchange encoding Unicode (www.unicode.org) has tables for all Indic writing systems, based on a common scheme (so that phonetically equivalent letters are placed on the same relative positions in each table). The first of these tables (positions 0900-097F, see Table 1) covers Devanāgar̄ı. For historical reasons (compatibility with legacy encodings) the Unicode approach to Devanāgar̄ı is quite awkward: it is partly logical and partly graphical. For example, there are separate positions for independent and dependent versions of vowels: when encoding text one has to choose if a given vowel is dependent or independent, although this clearly derives from contextual analysis, as in Velthuis’ transcription where both versions of vowels have the same excellent input transcription. On the other hand, this method is not applied to consonant ra; indeed, placing a ra before a cluster of consonants is graphically represented by a mark on the last of the consonants (compare Äkta and Ä‚ rkta)— this mark is not provided in the Unicode table, and hence application of this feature is left to the rendering engine. Nevertheless, despite its weaknesses, Unicode is very important because it ensures compatibility between devices all around the world: a text written in Devanāgar̄ı and encoded in Unicode can be processed (read, printed, analyzed) on every machine or software that is Unicode compliant. Omega fullfils Unicode compliance, and the system we are describing in this paper is designed in such a way that Unicode-encoded texts can be processed equally well as texts encoded in Velthuis’ transcription. Installation and Usage The Omega low-level support of Devanāgar̄ı consists of eight OTPs (Omega Translation Processes) and a small file with macros: velthuis2unicode.otp hindi-uni2cuni.otp hindi-uni2cuni2.otp hindi-cuni2font.otp hindi-cuni2font2.otp hindi-cuni2font3.otp sanskrit-uni2cuni.otp sanskrit-cuni2font.otp odev.sty OTP files have to be converted to binary form (*.ocp) and placed in a directory where Omega expects to find them. To typeset text in Devanāgar̄ı, use the commands \hindi or \sanskrit (depending on the language of your choice) inside a group, and keyboard the text in Velthuis’ transcription (see Table 1, taken from Velthuis’ devnag documentation). For example, {\hindi kulluu, acaanak, \sanskrit kulluu, acaanak} will produce k̀Sĺ , acAnk , k̀ě́ , acAnk̂. Description of the OTPs This description is a bit technical and demands both some knowledge of Omega and of Devanāgar̄ı script. The reader can find more information on the former, on the Omega Web site and on the latter in books about Devanāgar̄ı script. In particular, there is a very nice introduction to the contextual features of the script in the Unicode book (Section 9.1). 2 We call it “low-level,” because there is no standard LATEX3-compliant high-level language support interface yet. We don’t know yet how languages and their properties will be managed in LATEX3 and therefore do not attempt to introduce yet another syntax for switching to Hindi or Sanskrit or Marathi. Instead, we—temporarily— use a devnag-like syntax: simple commands \hindi and \sanskrit which have to be placed inside groups, as in the good old days of plain TEX. . . 3 To be found on CTAN, language/devanagari/distrib/ manual.tex. 4 http://omega.enstb.org 5 The Unicode Standard, Version 3.0, Addison Wesley, Reading Massachusetts, 2000. TUGboat, Volume 23 (2002), No. 1 —Proceedings of the 2002 Annual Meeting 51 Yannis Haralambous and John Plaice velthuis2unicode.otp In this OTP we convert Velthuis’ input transcription into Unicode. It is a quite short OTP (about 80 lines), with lines of the type `z' => @"095B @"094D ; `a'`a' => @"0906 ; On the second line, the pair of letters aa of Velthuis’ transcription is sent to Unicode character @"0906 (independent vowel “aa”). On the first line, letter z is sent to Unicode characters @"095B (letter “za”) and @"094D (virama). This may seem strange, but indeed the plan is to convert in a later step independent vowels into dependent ones, and to use the virama as a way to find out if a given consonant is part of a consonantic cluster or not. This will be done in the forthcoming OTPs. (sanskrit|hindi)-uni2cuni.otp In this OTP we deal with virama and dependent vowels. First of all, in the case of Hindi, we remove the (possible) final virama of the word: {CONSONANT} {VIRAMA} end: => \1 ; {CONSONANT} {VIRAMA} {NONHINDI} => \1 <= \3 ; In these two lines, we remove virama which may be either at the end of the input buffer, or before a non-Hindi character. In the latter case, the nonHindi character is put back in the stream. The code above is for Hindi. In the case of Sanskrit, we add a (fake) Unicode character which will represent internally the final virama: {CONSONANT} {VIRAMA} end: => \1 @"097F ; {CONSONANT} {VIRAMA} {NONHINDI} => \1 @"097F <= \3 ; Follow lines of the type: {CONSONANT} {VIRAMA} {INITA} => \1 @"097D ; {CONSONANT} {VIRAMA} {INITAA} => \1 @"093E @"097D ; Indeed, by placing a virama systematically after each consonant, we have also added viramas between consonants and vowels, which makes no sense. On the first line, the ‘short a’ vowel is removed together with the (spurious) virama. On the second line, the ‘long a’ vowel is replaced by Unicode character "@"093E, which is the dependent version of vowel ‘long a’, and the virama is removed. There are such lines for each vowel. Notice the presence of “fake” Unicode character @"097D. This character will be replaced by a soft hyphen at the very last step of our OTP chain. A special case is the vowel ‘short i’, where the glyph representing it has to be placed in front of the consonantic cluster. This is done by lines of the type: {CONSONANT} {VIRAMA} {INITI} => @"093F \1 \2 @"097D ; where we have a consonant and virama followed by a ‘short i’ vowel. In this case we place Unicode character @"093F followed by the consonant. On similar lines, we have n-uplets (n ≤ 7) of consonants and viramas followed by a ‘short i’ vowel; we replace them by @"093F followed by the group of consonants and viramas, except for the last virama. hindi-uni2cuni2.otp One thing that has not

John Plaice | Yannis Haralambous