X Xerobit

URL Canonicalization — Normalize URLs for APIs, Caching, and SEO

URL canonicalization normalizes URLs to a consistent form, removing redundant parameters, encoding differences, and trailing slash inconsistencies. Learn how to canonicalize...

Mian Ali Khalid · · 4 min read
Use the tool
URL Encoder / Decoder
Percent-encode and decode URLs per RFC 3986.
Open URL Encoder / Decoder →

URL canonicalization produces a consistent “canonical” form of a URL, making comparison, caching, and deduplication reliable. Without it, example.com/path, example.com/path/, and EXAMPLE.COM/Path are treated as different URLs.

Encode and decode URL components with the URL Encoder.

What canonicalization normalizes

Input URL:  HTTP://Example.COM:80/path/../blog/./Post%20One?b=2&a=1&b=2#section
Canonical:  https://example.com/blog/post%20one?a=1&b=2#section

Changes made:
1. Lowercase scheme: HTTP → https
2. Lowercase host: Example.COM → example.com
3. Remove default port: :80 removed (default for HTTP)
4. Resolve path: /path/../blog/./Post%20One → /blog/Post%20One
5. Lowercase percent-encoded chars: %20 stays lowercase
6. Sort query parameters: b=2&a=1 → a=1&b=2
7. Remove duplicate parameters: b=2&b=2 → b=2
8. Preserve fragment (context-dependent)

Canonical URL in JavaScript

import { URL } from 'url';

function canonicalizeURL(rawUrl, options = {}) {
  const {
    scheme = 'https',
    sortParams = true,
    removeFragment = false,
    removeTrailingSlash = true,
  } = options;

  const url = new URL(rawUrl);

  // Normalize scheme:
  url.protocol = scheme;

  // Lowercase hostname:
  url.hostname = url.hostname.toLowerCase();

  // Remove default ports:
  if ((url.protocol === 'https:' && url.port === '443') ||
      (url.protocol === 'http:'  && url.port === '80')) {
    url.port = '';
  }

  // Resolve dots in path:
  url.pathname = url.pathname
    .split('/')
    .reduce((acc, part) => {
      if (part === '..') acc.pop();
      else if (part !== '.') acc.push(part);
      return acc;
    }, [])
    .join('/') || '/';

  // Remove trailing slash (except root):
  if (removeTrailingSlash && url.pathname.length > 1 && url.pathname.endsWith('/')) {
    url.pathname = url.pathname.slice(0, -1);
  }

  // Sort and deduplicate query parameters:
  if (sortParams && url.search) {
    const params = new URLSearchParams(url.search);
    const sorted = new URLSearchParams([...new Map(params)].sort());
    url.search = sorted.toString() ? '?' + sorted.toString() : '';
  }

  // Remove fragment:
  if (removeFragment) url.hash = '';

  return url.toString();
}

// Examples:
canonicalizeURL('HTTP://EXAMPLE.COM/path/../blog/post?b=2&a=1');
// "https://example.com/blog/post?a=1&b=2"

canonicalizeURL('https://example.com/path/');
// "https://example.com/path"

Python URL normalization

from urllib.parse import urlparse, urlunparse, urlencode, parse_qs

def canonicalize_url(url: str) -> str:
    """Normalize a URL to canonical form."""
    parsed = urlparse(url)
    
    # Normalize scheme and host:
    scheme = parsed.scheme.lower() or 'https'
    netloc = parsed.netloc.lower()
    
    # Remove default ports:
    if netloc.endswith(':80') and scheme == 'http':
        netloc = netloc[:-3]
    elif netloc.endswith(':443') and scheme == 'https':
        netloc = netloc[:-4]
    
    # Normalize path (resolve . and ..):
    path_parts = []
    for part in parsed.path.split('/'):
        if part == '..':
            if path_parts:
                path_parts.pop()
        elif part != '.':
            path_parts.append(part)
    path = '/'.join(path_parts) or '/'
    
    # Remove trailing slash (except root):
    if len(path) > 1 and path.endswith('/'):
        path = path.rstrip('/')
    
    # Sort query parameters:
    if parsed.query:
        params = parse_qs(parsed.query, keep_blank_values=True)
        sorted_params = sorted(params.items())
        query = urlencode(sorted_params, doseq=True)
    else:
        query = ''
    
    return urlunparse((scheme, netloc, path, '', query, ''))

SEO canonical URLs

For SEO, canonicalization tells search engines the “preferred” URL for duplicate content:

<!-- In HTML: declare canonical URL -->
<link rel="canonical" href="https://example.com/blog/post">

<!-- With trailing slash inconsistency: always redirect to one form -->
# nginx: redirect trailing slash to non-trailing slash:
rewrite ^(/.+)/$ $1 permanent;

# Or prefer trailing slash:
rewrite ^([^.]+[^/])$ $1/ permanent;
// Express: canonicalize before handling:
app.use((req, res, next) => {
  const canonical = req.path.replace(/\/+$/, '') || '/';
  if (req.path !== canonical) {
    return res.redirect(301, canonical + (req.search || ''));
  }
  next();
});

URL comparison (canonicalize before comparing)

// Without canonicalization: same page, different URLs appear unequal
const url1 = 'HTTP://Example.com/path/';
const url2 = 'https://example.com/path';
url1 === url2;  // false

// With canonicalization:
canonicalizeURL(url1) === canonicalizeURL(url2);  // true: both → "https://example.com/path"

// Cache key deduplication:
const cache = new Map();
function fetchWithCache(url) {
  const key = canonicalizeURL(url);
  if (cache.has(key)) return cache.get(key);
  
  const result = fetch(url).then(r => r.json());
  cache.set(key, result);
  return result;
}

Related posts

Related tool

URL Encoder / Decoder

Percent-encode and decode URLs per RFC 3986.

Written by Mian Ali Khalid. Part of the Dev Productivity pillar.