Skip to main content
Loading time...

Invisible Unicode Characters

The hidden characters that can break your code, compromise security, and corrupt data - and how to detect them.

The Problem

Unicode includes dozens of characters that produce no visible output. They are invisible in editors, terminals, and browsers, but they are very much present in the byte stream. These invisible characters can cause string comparisons to fail, break JSON parsing, trigger security vulnerabilities, and produce subtle bugs that are extremely difficult to diagnose.

Two strings that look identical on screen can be completely different at the byte level. The word "hello" and "he​llo" (with a zero-width space after "e") appear the same visually but are different strings that will fail equality checks.

Common Invisible Characters

Zero-Width Characters

These characters occupy no horizontal space and are completely invisible in most renderings:

  • U+200B Zero Width Space (ZWSP): Suggests a line-break opportunity without adding visible space. Often introduced by web browsers and rich-text editors when copying text.
  • U+200C Zero Width Non-Joiner (ZWNJ): Prevents two characters from being joined in scripts like Arabic and Persian where characters normally connect.
  • U+200D Zero Width Joiner (ZWJ): Forces two characters to join. Used extensively in emoji sequences (e.g., family emoji are composed of person emoji joined by ZWJ characters).
  • U+FEFF Byte Order Mark (BOM): Intended as a byte-order indicator at the start of UTF-16 files, but frequently appears as an invisible character when files are concatenated or converted.

Formatting Characters

  • U+00AD Soft Hyphen: Marks a potential hyphenation point. Invisible unless the word is broken across a line, where it renders as a hyphen.
  • U+00A0 No-Break Space: Looks like a regular space but prevents line-breaking. Common in text pasted from word processors.
  • U+2060 Word Joiner: Prevents a line break at that position, similar to ZWNJ but without the joining semantics.

Bidirectional Control Characters

These characters control text direction in mixed left-to-right/right-to-left text:

  • U+200E Left-to-Right Mark (LRM): Forces subsequent text to render left-to-right
  • U+200F Right-to-Left Mark (RLM): Forces subsequent text to render right-to-left
  • U+202A-U+202E: Embedding and override controls for complex bidirectional layouts
  • U+2066-U+2069: Isolate controls (newer, more robust than embedding)

Security Risks

Trojan Source Attack

In November 2021, researchers disclosed the "Trojan Source" attack (CVE-2021-42574), which uses bidirectional override characters to make source code appear different from what it actually does. For example:

// What the developer sees in their editor:
if (isAdmin) {
  // grant access
}

// What the compiler sees (with hidden bidi overrides):
if (isAdmin‮ ⁦) {⁩ ⁦
  // grant access
}

The bidirectional override characters cause the code to render misleadingly while the compiler processes the actual (different) logical order. This can trick code reviewers into approving malicious changes.

Homoglyph Attacks

While not invisible, visually similar characters from different Unicode scripts can be used for phishing. The Cyrillic "а" (U+0430) looks identical to the Latin "a" (U+0061) but is a different character. An attacker could register a domain like аpple.com (with Cyrillic "а") that looks identical to apple.com in most fonts.

Data Integrity Issues

Invisible characters in form inputs, database values, or API payloads can cause:

  • Database unique constraint violations (two "identical" strings)
  • Login failures (username with hidden character)
  • Search not finding results (query contains invisible char)
  • JSON parse errors (invisible char in key or value)
  • CSV parsing issues (invisible char interpreted as field content)

Detection Strategies

Manual Inspection

Most code editors have a way to show invisible characters. In VS Code, enable editor.renderWhitespace: all and editor.unicodeHighlight.ambiguousCharacters: true. In vim, use :set list.

Programmatic Detection

// JavaScript: detect common invisible characters
function hasInvisibleChars(str) {
  // Zero-width characters, bidi controls, BOM
  const invisible = /[\u200B\u200C\u200D\u200E\u200F\uFEFF\u00AD\u2060\u202A-\u202E\u2066-\u2069]/;
  return invisible.test(str);
}

// Python: strip invisible characters
import unicodedata
def strip_invisible(text):
    return ''.join(
        c for c in text
        if unicodedata.category(c) not in ('Cf', 'Cc')
        or c in ('\n', '\r', '\t')
    )

Sanitization

For user inputs where invisible characters are never legitimate (usernames, email addresses, search queries), strip them on input. For rich text where some invisible characters are valid (ZWJ in emoji, ZWNJ in Persian text), sanitize selectively based on context.

Try the Inspector

Paste any text into our Character Inspector to instantly identify invisible characters, view their Unicode codepoints, and clean them with one click. The inspector categorizes characters by danger level and highlights potential Trojan Source attacks.

Further Reading