论文信息 - Latest developments in

Latest developments in

The system has been available since early 1995, and has been used experimentally in several sites around the world. We gather here some conclusions from this experimenting and explain what aspects will be included in version 1.3 of , which should be the rst large-scale release of the system. Not only will the portability and performance of be improved substantially, but new features, including smart fonts and multidirectional support, will be included. When was rst conceived, the primary objectives were to remove the 8-bit restrictions imposed by the original design of TEX (number of characters, fonts, registers of each kind, etc.), as well as to o er the means necessary for multilingual typesetting, no matter how complex the script. The 8-bit restrictions were removed quite easily by simply doubling the size of all data structures in the TEX program and by introducing a variant of the .tfm le, called the .xfm le, in which fonts of up to 65,536 characters could be built. For typesetting complex scripts, such as classical Arabic or Hebrew, a series of nite state automata, called Translation Processes ( TPs), can be successively applied to the input character stream to do arbitrarily complex manipulations. After each application of an TP, the macro-expansion facilities of TEX are reinvoked, which means that the full power of TEX is available every time an TP is used. Performance and portability The current version of currently resides on the ftp.ens.fr server, and has been used experimentally by several di erent groups in di erent countries. From their responses, we now understand what must be done for to be a realistic replacement for TEX. First, is too big! A typical run of uses about 14MB, which is just ne when you are sitting in front of a 500MHz-machine with 512MB, but certainly not on a typical portable. This tremendous size comes from the TEX program structure, in which static arrays are allocated to handle primitives such as \catcode or \delcode, to store register values and font information, etc. However, the average user will never need 65,536 fonts of 65,536 characters each, nor 65,536 mu-registers, etc. Most of these huge tables are full of zeros, and it seems silly to have to go out and buy RAM and see your system slow down just so you can have lots of empty tables in your program. This problem will be solved in the next version through the introduction of several primitives of the form \MaxActiveCharacter, \MaxRegister, \MaxFont or \MaxWrite. These primitives correspond to compile-time constants, which should be overidable upon loading . By doing this, a single binary can be used, whatever the resources needed. According to Benjamin Bayart (École Supérieure d'Ingénieurs en Électrotechnique et Électronique in Paris), these primitives also make it possible for macro packages to determine whether the existing system has the required resources or whether a new run should be undertaken, with larger tables. Second, does not run correctly on Little Endian machines. This problem was solved by Benjamin Bayart and will be incorporated in version 1.3. As a result, there should be working versions of for Intel boxes running DOS, Windows and Linux. Finally, performance is unsatisfactory for TPs that are being used for multilingual applications. Because the macro-expansion facilities are applied with every use of an TP, the use of two successive TPs can slow down so that it runs 40% as fast as the original TEX. This may be acceptable for limited applications where specialized e ects are wanted, but it is certainly unacceptable in a production setting where thousands of pages are being generated every day. TUGboat, Volume 0 (2001), No. 0 Proceedings of the 2001 Annual Meeting 1001 John Plaice and Yannis Haralambous Support for complex scripts It turns out, however, that for any given language, very few TEX primitives are required for typesetting high-quality output. As a result, we are looking for more e cient techniques that can be used for the standard cases. In particular, one author (J.P.) has worked with ArborText, Inc. (Ann Arbor, Michigan), a supplier of SGML authoring and composition software and services, in developing typesetting support for ideogram scripts from East Asia. ArborText uses a TEXbased engine for printing SGML documents, and this engine was modi ed so that it could output Japanese text. The fundamental understanding of a font in TEX is that each character has width, height, depth and italic correction. Characters are placed in order on the baseline, and the exact choice of character and the exact horizontal positioning can be adjusted using the ligature/kerning table. This technique is reasonable for laying out alphabetic scripts where the characters are printed separately, as is the common case for Latin, Greek, Cyrillic, Armenian, Georgian, among others. Nevertheless, even for the Latin script, problems can arise: the Unicode standard provides for more than 900 precomposed characters. Building a complete ligature/kerning table for a font would require inordinate amounts of memory. Furthermore, it would be unlikely that, say, character 1EA9 (latin small letter a with circumflex and hook above), used in Vietnamese, would be found next to character 01CF (latin capital letter d with small letter z with caron), which is a Croatian digraph: much of the table would be useless. In fact, many of the characters have similar attributes: the di erence betwen `è' and `é' is unlikely to in uence the ligature/kerning program, so there are many opportunities for compressing it. When we pass on to the ideogram scripts found in East Asia, all the characters have the same dimension. There are no ligatures, nor kerning. However, if a xed grid is not chosen, then glue must be placed betweeen successive characters. Furthermore, line breaks cannot occur after left-bracket-like characters or before right-bracket-like characters. To handle such situations, penalties must be placed automatically in the appropriate places. For vowelized Arabic, the requirements are different yet again. Not only must the correct presentation form isolated, initial, medial or nal of each consonant be chosen, but the diacritics, including vowels and hamza, must be properly placed with respect to the consonants. To do this requires additional parameters about each character, designating the horizontal and vertical placement required to place the di erent diacritics. Finally, keshidehs (straight lines or Bézier curves) must be placed between consonants to ll out lines. For the Arabic script in Nastaliq style, as is normally used for Farsi or Urdu, typesetting becomes even more complicated, since successive characters are not placed on the baseline. Rather, the characters within a word are placed in a sort of staircase situation. The rst character, to the right, is placed highest. The lowest is the last character, to the left. Once again, extra character parameters are required so that the successive characters can be displaced vertically by the right amount. The work undertaken with ArborText implied designing another extension to the font metric les so that an arbitrary number of di erent kinds of parameters could be de ned for the font as a whole or for each individual character. Currently, ve sorts of parameter can be de ned: integer, xword, rule, glue and penalty. In addition, the ligature/kerning program has been modi ed to allow the automatic insertion of glue and penalties between characters, as is required for East Asian ideogram fonts. In addition to changes to the TEX driver, the pltotf and tftopl both had to be modi ed. Under an agreement with ArborText, we will be incorporating these ideas into . In fact, will support a generalization of these ideas: the ligature/kerning table will allow two-dimensional capabilities, thereby solving all of the di culties in typesetting calligraphic scripts such as Arabic.

J. Plaice | John. Plaice

[1] Donald E. Knuth,et al. Mixing Right-to-left Texts with Left-to-right Texts , 1987 .