Character Encoding Explained

How computers turn characters into bytes, why there are so many encoding standards, and how to avoid the dreaded mojibake.

What is Character Encoding?

A character encoding is a mapping between characters (letters, digits, symbols) and the bytes used to store them in memory or transmit them over a network. When you type the letter "A", your computer stores the byte 0x41. When another computer reads that byte and knows it is ASCII (or UTF-8), it displays "A". If it assumes a different encoding, you get garbage: mojibake.

The fundamental problem: bytes are just numbers. The byte 0xC9 could be "É" in ISO 8859-1, "Љ" in Windows-1251, or the first byte of a two-byte UTF-8 sequence. Without knowing the encoding, bytes are meaningless.

ASCII: The Foundation

ASCII (1963) assigned codes 0-127 to 128 characters. It uses 7 bits per character, fitting in a single byte. ASCII is simple, universal for English text, and forms the foundation of every modern encoding.

The limitation is obvious: 128 characters cannot represent the world's writing systems. The 8th bit of each byte went unused, which led to "extended ASCII" standards that used codes 128-255 for additional characters.

ISO 8859 and the Code Page Era

The ISO 8859 family of standards used the upper half (128-255) for region-specific characters:

ISO 8859-1 (Latin-1): Western European languages (French, German, Spanish)
ISO 8859-2 (Latin-2): Central European languages (Polish, Czech, Hungarian)
ISO 8859-5: Cyrillic (Russian, Bulgarian)
ISO 8859-6: Arabic
ISO 8859-15 (Latin-9): Updated Latin-1 with Euro sign (€)

Microsoft added its own variants, most notably Windows-1252, which is almost identical to ISO 8859-1 but uses codes 128-159 for characters like curly quotes and em dashes instead of control characters. This subtle difference has caused countless encoding bugs.

UTF-8: The Universal Encoding

UTF-8 (designed by Ken Thompson and Rob Pike in 1992) is a variable-length encoding for Unicode. It uses 1 to 4 bytes per character, with the number of bytes determined by the leading bits of the first byte:

Codepoint Range	Bytes	Byte Pattern	Example
U+0000 - U+007F	1	0xxxxxxx	A = 41
U+0080 - U+07FF	2	110xxxxx 10xxxxxx	é = C3 A9
U+0800 - U+FFFF	3	1110xxxx 10xxxxxx 10xxxxxx	世 = E4 B8 96
U+10000 - U+10FFFF	4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	emoji = F0 9F 98 80

UTF-8's key properties make it ideal for the modern web:

ASCII compatible: ASCII bytes (0x00-0x7F) are valid single-byte UTF-8
Self-synchronizing: You can always find the start of a character by looking for a byte that does not start with 10
No null bytes: (except for U+0000), safe for C-style null-terminated strings
Byte-order independent: No BOM needed, unlike UTF-16

UTF-16: JavaScript and Windows

UTF-16 uses 2 or 4 bytes per character. Characters in the Basic Multilingual Plane (U+0000 through U+FFFF) are encoded as a single 16-bit code unit. Characters above U+FFFF use a surrogate pair: two 16-bit code units that together represent the codepoint.

UTF-16 is the internal string encoding of JavaScript, Java, C#, and Windows. This is why "emoji".length returns 2 in JavaScript - it counts 16-bit code units, and a supplementary character requires two of them.

UTF-16 also has a byte-order issue: the two bytes of each code unit can be in big-endian (UTF-16BE) or little-endian (UTF-16LE) order. A Byte Order Mark (BOM, U+FEFF) at the start of the file indicates the byte order.

UTF-32: Simple but Wasteful

UTF-32 uses exactly 4 bytes per character, regardless of the codepoint. This makes string indexing trivial (character N is at byte offset N*4) but wastes space for text that is mostly ASCII. A file of English text that is 100 KB in UTF-8 would be 400 KB in UTF-32.

Debugging Encoding Issues

Common symptoms and their causes:

� replacement characters: Invalid byte sequences in UTF-8 decoding
Chinese characters in a European text: Bytes interpreted with wrong encoding entirely
Question marks (?): Characters that cannot be represented in the target encoding

To debug encoding issues, examine the raw bytes. Use our Encoding Comparison tool to see exactly how a character is encoded across UTF-8, UTF-16, and ISO 8859-1, or use the Character Inspector to reveal hidden or unexpected characters in your text.

Best Practices

Use UTF-8 everywhere: For files, APIs, databases, and HTML. There is rarely a good reason to use anything else in new projects.
Declare your encoding: In HTML (<meta charset="UTF-8">), in HTTP headers (Content-Type: text/html; charset=utf-8), and in database connections.
Never assume string length equals byte length: Always use encoding-aware functions for string manipulation.
Test with non-ASCII data: Include accented characters, CJK text, RTL scripts, and emoji in your test data.