Invisible Unicode Characters
The hidden characters that can break your code, compromise security, and corrupt data - and how to detect them.
The Problem
Unicode includes dozens of characters that produce no visible output. They are invisible in editors, terminals, and browsers, but they are very much present in the byte stream. These invisible characters can cause string comparisons to fail, break JSON parsing, trigger security vulnerabilities, and produce subtle bugs that are extremely difficult to diagnose.
Two strings that look identical on screen can be completely different at the byte level. The word "hello" and "hello" (with a zero-width space after "e") appear the same visually but are different strings that will fail equality checks.
Common Invisible Characters
Zero-Width Characters
These characters occupy no horizontal space and are completely invisible in most renderings:
- U+200B Zero Width Space (ZWSP): Suggests a line-break opportunity without adding visible space. Often introduced by web browsers and rich-text editors when copying text.
- U+200C Zero Width Non-Joiner (ZWNJ): Prevents two characters from being joined in scripts like Arabic and Persian where characters normally connect.
- U+200D Zero Width Joiner (ZWJ): Forces two characters to join. Used extensively in emoji sequences (e.g., family emoji are composed of person emoji joined by ZWJ characters).
- U+FEFF Byte Order Mark (BOM): Intended as a byte-order indicator at the start of UTF-16 files, but frequently appears as an invisible character when files are concatenated or converted.
Formatting Characters
- U+00AD Soft Hyphen: Marks a potential hyphenation point. Invisible unless the word is broken across a line, where it renders as a hyphen.
- U+00A0 No-Break Space: Looks like a regular space but prevents line-breaking. Common in text pasted from word processors.
- U+2060 Word Joiner: Prevents a line break at that position, similar to ZWNJ but without the joining semantics.
Bidirectional Control Characters
These characters control text direction in mixed left-to-right/right-to-left text:
- U+200E Left-to-Right Mark (LRM): Forces subsequent text to render left-to-right
- U+200F Right-to-Left Mark (RLM): Forces subsequent text to render right-to-left
- U+202A-U+202E: Embedding and override controls for complex bidirectional layouts
- U+2066-U+2069: Isolate controls (newer, more robust than embedding)
Security Risks
Trojan Source Attack
In November 2021, researchers disclosed the "Trojan Source" attack (CVE-2021-42574), which uses bidirectional override characters to make source code appear different from what it actually does. For example:
// What the developer sees in their editor:
if (isAdmin) {
// grant access
}
// What the compiler sees (with hidden bidi overrides):
if (isAdmin ) {
// grant access
}The bidirectional override characters cause the code to render misleadingly while the compiler processes the actual (different) logical order. This can trick code reviewers into approving malicious changes.
Homoglyph Attacks
While not invisible, visually similar characters from different Unicode scripts can be used for phishing. The Cyrillic "а" (U+0430) looks identical to the Latin "a" (U+0061) but is a different character. An attacker could register a domain like аpple.com (with Cyrillic "а") that looks identical to apple.com in most fonts.
Data Integrity Issues
Invisible characters in form inputs, database values, or API payloads can cause:
- Database unique constraint violations (two "identical" strings)
- Login failures (username with hidden character)
- Search not finding results (query contains invisible char)
- JSON parse errors (invisible char in key or value)
- CSV parsing issues (invisible char interpreted as field content)
Detection Strategies
Manual Inspection
Most code editors have a way to show invisible characters. In VS Code, enable editor.renderWhitespace: all and editor.unicodeHighlight.ambiguousCharacters: true. In vim, use :set list.
Programmatic Detection
// JavaScript: detect common invisible characters
function hasInvisibleChars(str) {
// Zero-width characters, bidi controls, BOM
const invisible = /[\u200B\u200C\u200D\u200E\u200F\uFEFF\u00AD\u2060\u202A-\u202E\u2066-\u2069]/;
return invisible.test(str);
}
// Python: strip invisible characters
import unicodedata
def strip_invisible(text):
return ''.join(
c for c in text
if unicodedata.category(c) not in ('Cf', 'Cc')
or c in ('\n', '\r', '\t')
)Sanitization
For user inputs where invisible characters are never legitimate (usernames, email addresses, search queries), strip them on input. For rich text where some invisible characters are valid (ZWJ in emoji, ZWNJ in Persian text), sanitize selectively based on context.
Try the Inspector
Paste any text into our Character Inspector to instantly identify invisible characters, view their Unicode codepoints, and clean them with one click. The inspector categorizes characters by danger level and highlights potential Trojan Source attacks.
Further Reading
- Trojan Source — Cambridge Research
Original research on bidirectional text attacks in source code.
- Unicode Security Considerations — unicode.org
Official Unicode Technical Report on security issues with Unicode text.
- Confusable Characters — unicode.org
Unicode Consortium tool for identifying visually similar (confusable) characters.