URL Encoding Guide

Why URLs need encoding, how percent-encoding works, and the rules that govern valid URL characters under RFC 3986.

What is URL Encoding?

URL encoding, formally known as percent-encoding, is a mechanism for converting characters into a format that can be safely included in a Uniform Resource Locator (URL). Because URLs can only contain a limited set of characters from the US-ASCII character set, any character outside this set -- or any character that has a special meaning within URL syntax -- must be encoded before it can appear in a URL.

The encoding process replaces unsafe characters with a percent sign (%) followed by two hexadecimal digits representing the character's byte value in UTF-8. For instance, a space character (which is not permitted in a URL) becomes %20, because the hexadecimal value of a space in ASCII is 20.

This encoding scheme is defined in RFC 3986 (Uniform Resource Identifier: Generic Syntax), which is the authoritative standard governing URI syntax as of 2005. Earlier standards like RFC 2396 and RFC 1738 laid the groundwork, but RFC 3986 is the current reference that browsers, servers, and libraries follow.

Why URLs Cannot Contain All Characters

URLs were designed in the early 1990s when the internet was built primarily around ASCII text. Tim Berners-Lee's original specification for URLs intentionally limited the character set to ensure compatibility across diverse systems, networks, and software. There are several reasons why URLs restrict which characters are allowed:

Structural delimiters: Characters like /, ?, #, and & have special syntactic meaning in URLs. A forward slash separates path segments, a question mark introduces the query string, a hash introduces the fragment, and an ampersand separates query parameters. If these characters appeared literally in data values, parsers would misinterpret the URL structure.
Transport safety: URLs are transmitted across protocols (HTTP, email, FTP) that may modify certain characters. Spaces, for example, are problematic because different systems handle whitespace differently. Some protocols strip trailing spaces, others collapse multiple spaces into one.
Character set limitations: The original URL specification was based on a 7-bit ASCII subset. Characters from other scripts (Chinese, Arabic, Cyrillic) or even common symbols like curly braces were not part of this restricted set.
Gateway and proxy compatibility: URLs pass through many intermediaries -- proxies, firewalls, CDNs, and load balancers. Each intermediary must be able to parse the URL correctly. Restricting the character set ensures consistent parsing across all of these systems.

Reserved Characters

RFC 3986 defines a set of reserved characters that have special meaning in URI syntax. These characters are used as delimiters between different components of a URL:

:  /  ?  #  [  ]  @  !  $  &  '  (  )  *  +  ,  ;  =

Each reserved character serves a specific syntactic purpose:

: -- Separates the scheme from the authority (https:) and the host from the port (localhost:3000)
/ -- Separates path segments (/users/profile)
? -- Introduces the query string (?search=term)
# -- Introduces the fragment identifier (#section-2)
[ and ] -- Enclose IPv6 addresses ([::1])
@ -- Separates user info from the host (user@host)
& -- Separates query parameters (key1=val1&key2=val2)
= -- Separates query parameter keys from values (key=value)
+ -- Historically represents a space in form data

When these characters appear as part of data rather than as delimiters, they must be percent-encoded. For example, if a search query contains an ampersand (like "Tom & Jerry"), the ampersand must be encoded as %26 to prevent it from being interpreted as a query parameter separator.

Unreserved Characters

Unreserved characters are those that can appear in a URL without encoding. RFC 3986 defines the following unreserved characters:

A-Z  a-z  0-9  -  .  _  ~

These 66 characters (26 uppercase letters, 26 lowercase letters, 10 digits, and 4 symbols) are guaranteed to be safe in any position within a URL. They carry no special meaning in URI syntax and are passed through without modification by all compliant implementations.

While unreserved characters may be percent-encoded (for example, the letter "A" could be written as %41), RFC 3986 explicitly recommends against encoding them unnecessarily. Encoded and unencoded forms of unreserved characters are considered equivalent, but encoding them adds unnecessary length and reduces readability.

How Percent-Encoding Works

The percent-encoding algorithm is straightforward:

Take the character to be encoded.
Convert it to its byte representation in UTF-8 encoding.
For each byte, write a percent sign (%) followed by the two-digit hexadecimal value of the byte (using uppercase letters A-F).

Here are examples for common characters:

Character  UTF-8 Bytes   Percent-Encoded
---------  -----------   ---------------
Space      0x20          %20
!          0x21          %21
#          0x23          %23
$          0x24          %24
&          0x26          %26
+          0x2B          %2B
/          0x2F          %2F
=          0x3D          %3D
?          0x3F          %3F
@          0x40          %40

For multi-byte characters (like those from non-Latin scripts), each byte is encoded separately. For example, the Euro sign (€) is encoded in UTF-8 as three bytes (0xE2 0x82 0xAC), resulting in the percent-encoded form %E2%82%AC.

Character  UTF-8 Bytes         Percent-Encoded
---------  ----------------    ---------------
é          0xC3 0xA9           %C3%A9
ü          0xC3 0xBC           %C3%BC
€          0xE2 0x82 0xAC      %E2%82%AC
世         0xE4 0xB8 0x96      %E4%B8%96

Spaces: %20 vs + (Plus Sign)

One of the most common sources of confusion in URL encoding is how spaces are represented. There are two conventions, and which one is correct depends on the context:

%20 -- The RFC 3986 Standard

According to RFC 3986, spaces should always be encoded as %20. This is the correct encoding for spaces in URL paths, fragment identifiers, and anywhere that follows the generic URI syntax. For example:

https://example.com/my%20documents/file%20name.pdf

+ (Plus Sign) -- The HTML Form Convention

The HTML specification defines a different encoding for form data submitted via HTTP. When a form usesapplication/x-www-form-urlencoded content type (the default for HTML forms), spaces are encoded as + instead of %20. This convention dates back to the earliest days of the web and is defined in the HTML specification, not in RFC 3986.

// Form submission: space becomes +
https://example.com/search?q=hello+world

// Path: space becomes %20
https://example.com/hello%20world/page

In practice, most modern web servers accept both forms in query strings. However, using + in URL paths is technically incorrect and can cause issues with strict parsers. The safest approach is to use%20 everywhere except when specifically implementing form encoding.

Common Encoding Mistakes

Even experienced developers make URL encoding mistakes. Here are the most frequent pitfalls:

Double Encoding

Double encoding occurs when an already-encoded string is encoded again. For example, the space in "hello world" becomes %20 after the first encoding, but if encoded again, %20 becomes %2520 (because % is encoded as %25). Servers that decode only once will see %20 instead of a space, breaking the original intent.

Original:        hello world
First encoding:  hello%20world       (correct)
Double encoding: hello%2520world     (broken!)

Encoding the Entire URL

Another common mistake is encoding an entire URL, including its structural characters. The scheme (https://), path separators (/), and query delimiter (?) should not be encoded. Only data values within the URL need encoding. In JavaScript, use encodeURI() for full URLs and encodeURIComponent() for individual values.

Forgetting to Encode

Not encoding data values is equally problematic. An unencoded ampersand in a query value will be interpreted as a parameter separator, splitting one parameter into two. An unencoded hash symbol will be interpreted as a fragment delimiter, truncating the URL.

URL Encoding in Practice

Every major programming language provides built-in functions for URL encoding. Here is a quick overview:

// JavaScript
encodeURI('https://example.com/hello world')
// "https://example.com/hello%20world"

encodeURIComponent('hello world & goodbye')
// "hello%20world%20%26%20goodbye"

// Python
from urllib.parse import quote, quote_plus
quote('hello world')        # 'hello%20world'
quote_plus('hello world')   # 'hello+world'

// Go
import "net/url"
url.PathEscape("hello world")   // "hello%20world"
url.QueryEscape("hello world")  // "hello+world"

// PHP
rawurlencode('hello world')  // "hello%20world"
urlencode('hello world')     // "hello+world"

Internationalized URLs (IRIs)

As the internet became global, the need to support non-ASCII characters in URLs grew. Internationalized Resource Identifiers (IRIs), defined in RFC 3987, extend URIs to allow Unicode characters. Modern browsers display IRIs with their original characters in the address bar for readability, but convert them to percent-encoded URIs when making HTTP requests.

For example, the Japanese Wikipedia page for "Tokyo" displays as https://ja.wikipedia.org/wiki/東京 in the address bar, but the actual request uses the percent-encoded form of those characters.

Try It Yourself

Understanding URL encoding is essential for web development, API design, and debugging network issues. Use our URL Encoder/Decoder tool to experiment with encoding and decoding in real time. Paste any text or URL and instantly see its encoded form, or decode a percent-encoded string back to its original characters.