论文信息 - How Perl Added Unicode Support 10 Years Ago Without You Noticing It

How Perl Added Unicode Support 10 Years Ago Without You Noticing It

Imagine that your new German customer sends you a ton of text files that you have to add to your document database. You write a Perl script which neatly imports all the data into your shiny new PostgreSQL database. As you tell this to your DBA, she wonders what had happened to all the German ä, ö, ü umlauts and ß characters in the process. You had not suspected that there might be a problem, but, as you look, all is well—this, despite the database being UTF-8 encoded while the German text files were seemingly normal text files. Another shining example of Perl doing exactly what you want even when you don't know what you are doing. All seems well, that is, until someone from accounting notices that all the Euro symbols (€) have been turned into e symbols. That's when you start digging into how this really works with Perl and character encodings and Unicode. Back in the '60s, the American Standard Code for Information Interchange (aka ASCII) had become the lingua franca for encoding English text for electronic processing outside the IBM mainframe world. As the use of computers spread to other languages, the whole encoding business became a jumbled mess. The vendors, as well as some international standardization bodies, fell over each other to come up with sensible ways of encoding all the extra characters found in non-English languages. Each language or group of similar languages got one or several encodings. In Western Europe, the Latin1, or ISO-8859-1, encoding became popular in the '80s and '90s. It sported all the characters required to write in the Western European languages. Working with a single language, this was fine, but as soon as multiple languages were in play, it all became quite confusing; data had to be converted from one encoding to another, often losing information as some symbols from encoding A could not be represented in encoding B. In the late '80s, work had begun to create a single universal encoding, capable of encoding text from all the world's languages in a unified manner. In 1991 the Unicode consortium was incorporated, and it published its first standard later that year. The current version of the standard is Unicode 6 .0, published in October 2010. It covers 109,000 symbols from 93 different scripts. Each symbol is listed with a visual reference, as well as a name made up from …

Tobias Oetiker