Skip to main content
Loading time...

Character Encoding Explained

How computers turn characters into bytes, why there are so many encoding standards, and how to avoid the dreaded mojibake.

What is Character Encoding?

A character encoding is a mapping between characters (letters, digits, symbols) and the bytes used to store them in memory or transmit them over a network. When you type the letter "A", your computer stores the byte 0x41. When another computer reads that byte and knows it is ASCII (or UTF-8), it displays "A". If it assumes a different encoding, you get garbage: mojibake.

The fundamental problem: bytes are just numbers. The byte 0xC9 could be "É" in ISO 8859-1, "Љ" in Windows-1251, or the first byte of a two-byte UTF-8 sequence. Without knowing the encoding, bytes are meaningless.

ASCII: The Foundation

ASCII (1963) assigned codes 0-127 to 128 characters. It uses 7 bits per character, fitting in a single byte. ASCII is simple, universal for English text, and forms the foundation of every modern encoding.

The limitation is obvious: 128 characters cannot represent the world's writing systems. The 8th bit of each byte went unused, which led to "extended ASCII" standards that used codes 128-255 for additional characters.

ISO 8859 and the Code Page Era

The ISO 8859 family of standards used the upper half (128-255) for region-specific characters:

  • ISO 8859-1 (Latin-1): Western European languages (French, German, Spanish)
  • ISO 8859-2 (Latin-2): Central European languages (Polish, Czech, Hungarian)
  • ISO 8859-5: Cyrillic (Russian, Bulgarian)
  • ISO 8859-6: Arabic
  • ISO 8859-15 (Latin-9): Updated Latin-1 with Euro sign (€)

Microsoft added its own variants, most notably Windows-1252, which is almost identical to ISO 8859-1 but uses codes 128-159 for characters like curly quotes and em dashes instead of control characters. This subtle difference has caused countless encoding bugs.

UTF-8: The Universal Encoding

UTF-8 (designed by Ken Thompson and Rob Pike in 1992) is a variable-length encoding for Unicode. It uses 1 to 4 bytes per character, with the number of bytes determined by the leading bits of the first byte:

Codepoint RangeBytesByte PatternExample
U+0000 - U+007F10xxxxxxxA = 41
U+0080 - U+07FF2110xxxxx 10xxxxxxé = C3 A9
U+0800 - U+FFFF31110xxxx 10xxxxxx 10xxxxxx世 = E4 B8 96
U+10000 - U+10FFFF411110xxx 10xxxxxx 10xxxxxx 10xxxxxxemoji = F0 9F 98 80

UTF-8's key properties make it ideal for the modern web:

  • ASCII compatible: ASCII bytes (0x00-0x7F) are valid single-byte UTF-8
  • Self-synchronizing: You can always find the start of a character by looking for a byte that does not start with 10
  • No null bytes: (except for U+0000), safe for C-style null-terminated strings
  • Byte-order independent: No BOM needed, unlike UTF-16

UTF-16: JavaScript and Windows

UTF-16 uses 2 or 4 bytes per character. Characters in the Basic Multilingual Plane (U+0000 through U+FFFF) are encoded as a single 16-bit code unit. Characters above U+FFFF use a surrogate pair: two 16-bit code units that together represent the codepoint.

UTF-16 is the internal string encoding of JavaScript, Java, C#, and Windows. This is why "emoji".length returns 2 in JavaScript - it counts 16-bit code units, and a supplementary character requires two of them.

UTF-16 also has a byte-order issue: the two bytes of each code unit can be in big-endian (UTF-16BE) or little-endian (UTF-16LE) order. A Byte Order Mark (BOM, U+FEFF) at the start of the file indicates the byte order.

UTF-32: Simple but Wasteful

UTF-32 uses exactly 4 bytes per character, regardless of the codepoint. This makes string indexing trivial (character N is at byte offset N*4) but wastes space for text that is mostly ASCII. A file of English text that is 100 KB in UTF-8 would be 400 KB in UTF-32.

Debugging Encoding Issues

Common symptoms and their causes:

  • é instead of é: UTF-8 bytes interpreted as ISO 8859-1 (double encoding)
  • � replacement characters: Invalid byte sequences in UTF-8 decoding
  • Chinese characters in a European text: Bytes interpreted with wrong encoding entirely
  • Question marks (?): Characters that cannot be represented in the target encoding

To debug encoding issues, examine the raw bytes. Use our Encoding Comparison tool to see exactly how a character is encoded across UTF-8, UTF-16, and ISO 8859-1, or use the Character Inspector to reveal hidden or unexpected characters in your text.

Best Practices

  • Use UTF-8 everywhere: For files, APIs, databases, and HTML. There is rarely a good reason to use anything else in new projects.
  • Declare your encoding: In HTML (<meta charset="UTF-8">), in HTTP headers (Content-Type: text/html; charset=utf-8), and in database connections.
  • Never assume string length equals byte length: Always use encoding-aware functions for string manipulation.
  • Test with non-ASCII data: Include accented characters, CJK text, RTL scripts, and emoji in your test data.

Further Reading