Public Services and Procurement Canada
Symbol of the Government of Canada

Institutional Links

 

Important notice

This version of Favourite Articles has been archived and won't be updated before it is permanently deleted.

Please consult the revamped version of Favourite Articles for the most up-to-date content, and don't forget to update your bookmarks!

Search Canada.ca

Character Sets and Their Mysteries. . .

André Guyon
(Language Update, Volume 5, Number 4, 2008, page 33)

The first time I had issues with character sets was in 1984 or 1985, starting with the definition of what is a character set, from a computer science standpoint in particular.

English and French are written using 26 letters, 10 digits and a number of accents and punctuation marks. Computer science first used character sets limited to upper-case letters, numbers and certain punctuation marks.

In 1972, I worked for CNCP Telegraph and noted that telegrams used a character set that contained no accents or capital letters. The character set was telex, based on Baudot code1.

A character set is basically a convention or standard recognized by a certain number of users. Over the years, more or less complete character sets have emerged. One of the best known was the ASCII code, which consisted of 128 characters2, including upper- and lower-case letters, but no accents or characters like œ in the British English spelling of fœtus.

Computers assign a code to each character to represent these abstract symbol values.

A character can be represented in various ways by various fonts, and various attributes (bold, italic, etc.) of the font. Sometimes fonts do not contain all characters in a given set.

Allow me to cut to the chase with an unusual statement: computers store numerical values that represent the various characters that they manipulate3. For example, in ASCII, the "A" is represented by the code 65, "B" by 66 and so on.

However, each new platform (combination of operating system and hardware) has more or less its own coding, usually limited to 256 characters4. A quick aside: for computers, "A" and "a" are different characters.

In the ’80s, I used the PC (IBM), the Apple II and Mac (Apple), the Amiga (Amiga), the Vic 20 and C64 (Commodore) and observed the operation of other less popular models. Most used the ASCII code and added other characters to it for accented letters or graphic symbols for drawings.

Obviously, the codes assigned to accented letters were not the same from one manufacturer to the next, and there was no standard. For example, on a PC, code [alt] 130 gives an "é" while the same key combination on a Mac gives a "Ç."

Not only did each computer manufacturer have its own ASCII code extension, but each printer manufacturer had its own as well. To make things even more complicated, printer designers went one step further by providing users with alternatives that were obtained by configuring tiny dip switches. One had to first find these switches, which were generally concealed under the print head or in another equally difficult-to-access location. The documentation was no better.

In 1987-1988, soon after I joined the Translation Bureau, I witnessed the arrival of the first Ogivars (microcomputers). The poor technician in charge of configuring the printers, one Roger Racine5, asked me for a hand.

At that time, a PC was a lot cheaper than a Mac, but did not yet feature accented upper-case letters, so the font had to be modified. Software did the work for the displayed text, but it was Mr. Racine who wrote the code for the printer. Rumour has it that a certain bearded man helped him out.

We diverted codes assigned to graphic or Greek characters to our accented characters.

Table of graphic and Greek characters

I believe that Roger Racine has fabulous memories of the experience. To draw a character, you needed to know the number of pixel lines per character for a given printer (8 to 24 depending on printer quality).

To represent the shape of a character to the printer, the computer used the value of a number of overwritten bytes6. A character was represented in one byte made up of 8 bits (value of 0 or 1) and could contain a value between 0 and 255.

In binary mode, zero is written 00000000, and 255 is written 11111111. For display purposes, each 0 or 1 represents a pixel that will be printed when it is a 1.

To help visualize the concept, below is a picture of a magnified number and its representation via twelve bytes. The zeros would not show up on the printout.

0 0 0 0 1 1 1 1
0 0 0 1 1 1 1 1
0 0 0 1 0 0 0 0
0 0 0 1 1 0 0 0
0 0 1 1 1 1 1 0
0 0 0 0 0 1 1 1
0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 1
1 1 0 0 0 1 0
1 1 1 1 1 0 0

Picture representing the number 5 in pixel, magnified

        1 1 1 1
      1 1 1 1 1
      1        
      1 1      
    1 1 1 1 1  
          1 1 1
             1 1
                1
                1
                1
  1 1        1 
  1 1 1 1 1    

Picture representing the number 5 in pixel, magnified

The second time, I replaced the zeros with spaces.

To create an accented upper-case letter, we reduced its size by at least two lines, leaving space for the desired accent:

Acute accent Grave accent Circumflex accent Diaeresis
00001000
00010000
00010000
00001000
00001000
00010100
00100100
00000000

All that was left to do was upload our new font to the printer.

These days

We may be led to believe that problems with accented characters affect only Francophone readers, but a linguist called upon to translate a text where all the accented characters are distorted or deleted is hard put as well.

Believe it or not, some parts of the Internet still recognize only ASCII characters (128 characters). You can see what the salutation looks like in an e-mail I received from an organization that deals mainly with localization:

Dear Andr,
[. . .]

Picture a translation request via an e-mail like the one above, where all the accented words are cut off. The famous Être ou ne pas être becomes tre ou ne pas tre, têtu becomes ttu, été becomes t, etc.

When this happens, suggest to the client that he or she copy the entire document into a Word or WordPerfect file. Unlike text, attachments are not interpreted by Internet gateways, and so they usually arrive safe and sound. However, a Web page or a text file can suffer the horrors of 128 bits.

Another frequent problem with character sets and fonts can affect you even if you don’t have to translate or read French.

When someone sends you a text written in a font that doesn’t exist on your computer, the automatic substitution sometimes selects a font like this: Graphical representation og the word Wingdings written with the font Wingdings (Wingdings).

If you can read the message. . . you’re a mutant.

It happens to me all the time when someone sends me a message in WordPerfect (which I no longer have on my home computer). I can open the document in Word, but if the sender chose a WordPerfect font instead of a Windows font (Arial, Times, etc.), the message looks something like this:

Graphical representation of text written in a font which isn’t recognized by the word processor

If this happens to you, don’t panic: just highlight the text and choose another font.

Screen shot showing where to change the font in the word processor

Screen shot of the text in Times New Roman

Unicode is the answer! But. . .

256 codes for English and French are all very well. However, for languages that use thousands of symbols, this was somewhat limited, leading to the appearance of encoding on more than one byte. Encoding on two bytes (16 bits) allows for over 60,000 codes. Some double byte character sets (DBCS) were used for languages with thousands of symbols.

To have just one code representing all characters in all human languages (even artificial ones), researchers developed Unicode.

Unicode makes life easier for programmers and users. We are moving increasingly toward Unicode, which is the default value for HTML, which integrates perfectly into XML, etc.

However, the transition to Unicode does sometimes play some nasty tricks. Documents can contain both Unicode and Windows (ANSI) characters, and sometimes also characters from extended ASCII. When the information specifying which character set was used is not shown on the HTML pages, the navigator uses (Unicode UTF-8), one of the "hybrids" of Unicode7, by default.

You can also experience problems with characters in fonts that do not exist on your computer, such as Chinese or Japanese (of course, if you can read these languages, you will have these fonts).

Accented characters appear as illegible text. This problem is easy to correct. Go to the View menu and select a different encoding:

Screen shot showing where to change the characters coding in Netscape

As you can see, the names can vary somewhat from one navigator to another.

Screen shot showing where to change the characters coding in Internet Explorer

If Unicode is selected, the encoding switches to 8859-1(ISO) or Windows. If one of the first two is selected, the encoding switches to Unicode and, 99% of the time, the display will be corrected.

If automatic selection (or detection) is activated, the problem occurs less frequently. I therefore suggest that you activate this option if it is not activated already.

As I mentioned above, files can also contain more than one character set. Last year, I had the pleasure of seeing a file that contained both characters in UTF-8 (Unicode) AND characters in extended ASCII.

A text file in UTF-8 includes information on its type. Opening it in Notepad provides an interpretation that hides the accented characters in extended ASCII.

The same file opened in Word displays these characters, but interprets them based on the Windows character set (different from extended ASCII).

The problem here is that in this file identified as UTF-8, a process or human action introduced extended ASCII characters.

If I were to open the file in Windows Notepad, I would get something like this:

Screen shot showing how accented characters in extended ASCII disappear in Notepad

The line of accented characters in extended ASCII disappears completely.

Here is what the same text looks like in DOS with the Edit editor:

Screen shot showing how accented characters in UTF-8 are massacred in DOS

This time, UTF-8 is massacred.

Here is what it looks like in Word:

Screen shot showing how accented characters in extended ASCII and UTF-8 appear in Word

I must confess that finding a solution was not a barrel of laughs. But the solution was actually quite simple.

Because I could see the extended ASCII characters in DOS and the others in Notepad, I just needed to make a few global replacements. Not character for character-that would have been too simple-but tag for tag.

In DOS:

É will be replaced by Emajaigu, é by eaigu.
È will be replaced by Egrave, è by egrave, and so on.

I continued the process for all characters that cause problems: accented characters, quotation marks and one or two others.

I performed the opposite process in Notepad and the solution was complete.

I truly hope that this never happens to you. If it does, maybe you’ll remember this story.

But I know that will happen to you one day: you will then know that I didn’t tell this story for fun. If it does happen and your client is a technician, could you please send me a photo of the look on his or her face when you come up with the solution?

Notes

  • Back to the note1 "Old-timers" will remember bauds per second rather than bits per second when we used modems. It was the number of characters per second. Then, kbps (kilobits per second) appeared.
  • Back to the note2 256-character code was not ASCII and each company had its own version.
  • Back to the note3 In fact, they only store 1s and 0s.
  • Back to the note4 Computers manipulated octets (8 bits) that represented values between 0 and 255.
  • Back to the note5 Today Technology Management director at the Translation Bureau.
  • Back to the note6 Again, based on the number of pixel lines per character.
  • Back to the note7 Unicode exists in several versions, but this is the most common.