Character Encodings & Unicode

12 November 2019

Let’s talk about the history of character encodings and how Unicode came and saved the day.

ASCII

ASCII is a character encoding standard first published in 1963. It encodes 128 characters as integers using 7 bits.

A number (code) represents a character. Codes 32 - 126 were to represent characters in the English language plus various symbols. While codes below 32 and code 127 were control characters. Control characters are unprintable and their intent was for signaling rather than to represent a symbol.

ANSI (ASCII Extended)

Because a byte is 8 bits, a lot of people realised that by using ASCII, they had codes 128-255 to themselves.

Hardware manufacturers had various ideas and interesting ways to use the additional 127 codes and soon began creating their own character encodings called code pages.

Most code pages were ways to encode different languages. Although some included other characters, like code page 437 for the IBM PC, which had a bunch of ╚═╗ box-drawing characters ╔═╝.

For the most part, this worked great. Manufacturers simply shipped computers with different code pages depending on where they were being sold.

ANSI Issues

However issues soon arose whenever 2 computers, using 2 different code pages attempted to communicate. If I sent an email from my machine using Spanish, the receiver, let’s say a Japanese machine, wouldn’t be able to see the accented characters as those codes were for Japanese characters.

Problems even arise with computers using the same language. If I sent a message from a Japanese Macintosh, the message receiver using an IBM, HP, DOS or Windows product isn’t going to be using the 10001 code page (Apple Japanese) and therefore potentially won’t actually be able to read it!

On top of which, despite the additional characters, it still wasn’t enough space for some alphabets, especially Asian alphabets that have thousands of characters. This resulted in the DBCS (double byte character set) which was a character encoding that stored some characters in a single byte, and others in two. Without fixed size memory addresses, it resulted in workarounds to move around characters, like calling Windows’ AnsiNext and AnsiPrev which dealt with this.

Not to mention if you’re multilingual in more than 1 language other than English, your options were basically using a program that implements horrible bitmapped graphics than properly native characters.

Unicode to the Rescue

Unicode is a computing industry standard that was created in 1991 as a way to unify all the different and varying character sets across the world. The most recent version to date, 12.1, contains a total of 137,994 characters.

The Unicode group mapped each character to a code point across languages. This came with debates, for example in German is ß a real character or is it a fancier way of saying ‘ss’? Brilliantly, the Unicode group is also responsible for standardising emojis, which in 2017 brought a debate over whether the poop emoji needed emotions.

One thing that they did agree on however, was to start with the ASCII standard as a base. This was a great decision as a way to ramp up adoption from the wester world. As a result, in Unicode ‘abc’ is U+0061 U+0062 U+0063.

These code points are only theoretical however. They do not specify how to store this in memory or how they are represented. For that, we need an encoding.

UCS-2

UCS-2 was the first attempt to encode Unicode that took a very simple approach.

The approach was to store each character as 2 bytes which at the time would cover all defined characters. The problem with this was that the previous ASCII characters now had to take 2 bytes instead of one. For those only working with English, Unicode adopters had to embrace twice the memory footprint for little gain.

UTF-8

UTF-8 resolves this problem by having a variable width. It stores ASCII characters in a single byte, with anything greater being stored in 2 to 4 bytes.

This has the effect that not only were the code points in ASCII and Unicode the same, but how they were represented in memory were also!

The sizes of each character in this encoding generally look like:

1 byte: Standard ASCII
2 bytes: Arabic, Hebrew and most European scripts
3 bytes: BMP
4 bytes: All Unicode characters

UTF-16

UTF-16 also has variable width encoding, with the difference being that code points use a minimum of 2 bytes. It is also often found to be used as a replacement for UCS-2, like in Java.

This isn’t great for English text but works well for Asian text that will always need 2 bytes and could be more memory efficient than UTF-8.

The sizes of each character in this encoding generally look like:

2 bytes: BMP
4 bytes: All Unicode characters

UTF-32

You may have also heard of UTF-32. This encoding has a fixed width of 4 bytes per character and is rarely used as it quite a memory hog.