Character Encoding Explained
How computers turn characters into bytes, why there are so many encoding standards, and how to avoid the dreaded mojibake.
What is Character Encoding?
A character encoding is a mapping between characters (letters, digits, symbols) and the bytes used to store them in memory or transmit them over a network. When you type the letter "A", your computer stores the byte 0x41. When another computer reads that byte and knows it is ASCII (or UTF-8), it displays "A". If it assumes a different encoding, you get garbage: mojibake.
The fundamental problem: bytes are just numbers. The byte 0xC9 could be "É" in ISO 8859-1, "Љ" in Windows-1251, or the first byte of a two-byte UTF-8 sequence. Without knowing the encoding, bytes are meaningless.
ASCII: The Foundation
ASCII (1963) assigned codes 0-127 to 128 characters. It uses 7 bits per character, fitting in a single byte. ASCII is simple, universal for English text, and forms the foundation of every modern encoding.
The limitation is obvious: 128 characters cannot represent the world's writing systems. The 8th bit of each byte went unused, which led to "extended ASCII" standards that used codes 128-255 for additional characters.
ISO 8859 and the Code Page Era
The ISO 8859 family of standards used the upper half (128-255) for region-specific characters:
- ISO 8859-1 (Latin-1): Western European languages (French, German, Spanish)
- ISO 8859-2 (Latin-2): Central European languages (Polish, Czech, Hungarian)
- ISO 8859-5: Cyrillic (Russian, Bulgarian)
- ISO 8859-6: Arabic
- ISO 8859-15 (Latin-9): Updated Latin-1 with Euro sign (€)
Microsoft added its own variants, most notably Windows-1252, which is almost identical to ISO 8859-1 but uses codes 128-159 for characters like curly quotes and em dashes instead of control characters. This subtle difference has caused countless encoding bugs.
UTF-8: The Universal Encoding
UTF-8 (designed by Ken Thompson and Rob Pike in 1992) is a variable-length encoding for Unicode. It uses 1 to 4 bytes per character, with the number of bytes determined by the leading bits of the first byte:
| Codepoint Range | Bytes | Byte Pattern | Example |
|---|---|---|---|
| U+0000 - U+007F | 1 | 0xxxxxxx | A = 41 |
| U+0080 - U+07FF | 2 | 110xxxxx 10xxxxxx | é = C3 A9 |
| U+0800 - U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx | 世 = E4 B8 96 |
| U+10000 - U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | emoji = F0 9F 98 80 |
UTF-8's key properties make it ideal for the modern web:
- ASCII compatible: ASCII bytes (0x00-0x7F) are valid single-byte UTF-8
- Self-synchronizing: You can always find the start of a character by looking for a byte that does not start with
10 - No null bytes: (except for U+0000), safe for C-style null-terminated strings
- Byte-order independent: No BOM needed, unlike UTF-16
UTF-16: JavaScript and Windows
UTF-16 uses 2 or 4 bytes per character. Characters in the Basic Multilingual Plane (U+0000 through U+FFFF) are encoded as a single 16-bit code unit. Characters above U+FFFF use a surrogate pair: two 16-bit code units that together represent the codepoint.
UTF-16 is the internal string encoding of JavaScript, Java, C#, and Windows. This is why "emoji".length returns 2 in JavaScript - it counts 16-bit code units, and a supplementary character requires two of them.
UTF-16 also has a byte-order issue: the two bytes of each code unit can be in big-endian (UTF-16BE) or little-endian (UTF-16LE) order. A Byte Order Mark (BOM, U+FEFF) at the start of the file indicates the byte order.
UTF-32: Simple but Wasteful
UTF-32 uses exactly 4 bytes per character, regardless of the codepoint. This makes string indexing trivial (character N is at byte offset N*4) but wastes space for text that is mostly ASCII. A file of English text that is 100 KB in UTF-8 would be 400 KB in UTF-32.
Debugging Encoding Issues
Common symptoms and their causes:
- é instead of é: UTF-8 bytes interpreted as ISO 8859-1 (double encoding)
- � replacement characters: Invalid byte sequences in UTF-8 decoding
- Chinese characters in a European text: Bytes interpreted with wrong encoding entirely
- Question marks (?): Characters that cannot be represented in the target encoding
To debug encoding issues, examine the raw bytes. Use our Encoding Comparison tool to see exactly how a character is encoded across UTF-8, UTF-16, and ISO 8859-1, or use the Character Inspector to reveal hidden or unexpected characters in your text.
Best Practices
- Use UTF-8 everywhere: For files, APIs, databases, and HTML. There is rarely a good reason to use anything else in new projects.
- Declare your encoding: In HTML (
<meta charset="UTF-8">), in HTTP headers (Content-Type: text/html; charset=utf-8), and in database connections. - Never assume string length equals byte length: Always use encoding-aware functions for string manipulation.
- Test with non-ASCII data: Include accented characters, CJK text, RTL scripts, and emoji in your test data.
Further Reading
- UTF-8 — Wikipedia
Technical details and history of the UTF-8 encoding.
- Unicode FAQ — unicode.org
Official Unicode FAQ covering codepoints, encodings, and standards.
- What Every Programmer Should Know About Encodings
Practical guide to understanding and debugging character encoding issues.