A framework for multilingual information processing

Recent and (continuing) rapid increases in computing power now enable more of humankind's written communication to be represented as digital data. The most recent and obvious changes in multilingual information processing have been the introduction of larger character sets encompassing more writing systems. Yet the very richness of larger collections of characters has made the interpretation and processing of text more difficult. The many competing motivations (satisfying the needs of linguists, computer scientists, and typographers) for standardizing character sets threaten the purpose of information processing: accurate and facile manipulation of data. Existing character sets are constructed without a consistent strategy or architecture. Complex algorithms and reports are necessary now to understand raw streams of characters representing multilingual text. We assert that information processing is an architectural problem and not just a character set problem. We analyze several multilingual information processing algorithms (e.g., bidirectional reordering and character normalization) and we conclude that they are more dangerous than beneficial. The countless number of unexpected interactions suggest a lack of a coherent architecture. We introduce abstractions, novel mechanisms, and take the first steps towards organizing them into a new architecture for multilingual information processing. We propose a multilayered architecture which we call Metacode where character sets appear in lower layers and protocols and algorithms in higher layers. We recast bidirectional reordering and character normalization in the Metacode framework.

[1]  Michael M. T. Yau Supporting the Chinese, Japanese, and Korean Languages in the OpenVMS Operating System , 1993, Digit. Tech. J..

[2]  Thomas Erickson,et al.  Working with interface metaphors , 1995 .

[3]  Tony Graham Unicode: A Primer , 2000 .

[4]  Dave Taylor Global Software: Developing Applications for the International Market , 1992 .

[5]  Jürgen Bettels,et al.  Unicode: A Universal Character Code , 1993, Digit. Tech. J..

[6]  Francois Yergeau,et al.  UTF-8, a transformation format of Unicode and ISO 10646 , 1996, RFC.

[7]  Simon L. Peyton Jones,et al.  Report on the programming language Haskell: a non-strict, purely functional language version 1.2 , 1992, SIGP.

[8]  Muhammad F. Mudawwar Multicode: A Truly Multilingual Approach to text Encoding , 1997, Computer.

[9]  Nadine Kano,et al.  Developing International Software for Windows 95 and Windows NT , 1995 .

[10]  E. J. Smura,et al.  Toward a new beginning: The development of a standard for font and character encoding to control electronic document interchange , 1987, IEEE Transactions on Professional Communication.

[11]  Saul Gorn,et al.  American standard code for information interchange , 1963, CACM.

[12]  Ken Lunde,et al.  CJKV Information Processing , 1999 .

[13]  Sunny Au Hello, world! a guide for transmitting multilingual electronic mail , 1995, SIGUCCS '95.

[14]  Mark Davis,et al.  The Unicode Standard, Version 3.0 , 2000 .

[15]  Ralf Hinze,et al.  Haskell 98 — A Non−strict‚ Purely Functional Language , 1999 .

[16]  Richard F. Walters Design of a bitmapped multilingual workstation , 1990, Computer.

[17]  Kenneth John Small,et al.  The Icon Book: Visual Symbols for Computer Systems and Documentation , 1994 .

[18]  S. F. Actory,et al.  Personal correspondence , 1997 .

[19]  Sandra Martin O'Donnell Programming for the World: A Guide to Internationalization , 1994 .

[20]  前田 亮 Studies on Multilingual Information Processing on the Internet , 2000 .

[21]  Timo Honkela,et al.  A Framework for Global Software , 1995 .

[22]  Joseph D. Becker Arabic word processing , 1987, CACM.

[23]  David Flanagan,et al.  Java in a Nutshell , 1996 .

[24]  Matt Belge,et al.  The next step in software internationalization , 1995, INTR.

[25]  Mark Davis,et al.  International text in JDK 1.2 , 2000 .

[26]  John Hughes,et al.  Why Functional Programming Matters , 1989, Comput. J..

[27]  Scott Jones,et al.  Digital Guide to Developing International User Information , 1991 .

[28]  Harald Tveit Alvestrand IETF Policy on Character Sets and Languages , 1998, RFC.

[29]  R. Stansifer,et al.  Implementations of Bidirectional Reordering Algorithms , 2022 .

[30]  Michael Morrison,et al.  Xml Unleashed , 1999 .

[31]  Tom Madell,et al.  Developing and Localizing International Software , 1994 .

[32]  David J. Taylor,et al.  Internationalization: Developing Software For Global Markets , 1995 .

[33]  Richard M. Stallman,et al.  Gnu Emacs Manual , 1996 .

[34]  David A. Schmitt International Programming for Microsoft Windows , 2000 .

[35]  Hiroyoshi Ohara,et al.  Internalized Text Manipulation Covering Perso-Arabic Enhanced for Mongolian Scripts , 1998, EP.

[36]  Bill Tuthill,et al.  Creating Worldwide Software: Solaris International Developer's Guide , 1997 .

[37]  Glenn Adams Internationalization and character set standards , 1993, STAN.

[38]  F. Abed,et al.  Cultural Influences on Visual Scanning Patterns , 1991 .

[39]  Tony Fernandes,et al.  Global interface design , 1994, CHI Conference Companion.

[40]  Mark R. Crispin,et al.  The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996 , 1997, RFC.

[41]  W. Neville Holmes Toward Decent Text Encoding , 1998, Computer.

[42]  Glenn Searfoss JIS Kanji Character Recognition Methods , 1994 .

[43]  Jack Grimes,et al.  Creating Global Software: Text Handling and Localization in Taligent's CommonPoint Application System , 1996, IBM Syst. J..