X Xerobit

Extract URLs from Text with Regex — Email and URL Extraction Patterns

Extract URLs and email addresses from plain text using regex. Learn battle-tested patterns for finding HTTP/HTTPS URLs, bare domains, and emails in log files, HTML, and user...

Mian Ali Khalid · · 4 min read
Use the tool
Email & URL Extractor
Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.
Open Email & URL Extractor →

Extracting URLs from text is harder than it looks — URLs contain special characters, and the boundaries (where a URL ends) are ambiguous in natural language. Use battle-tested patterns rather than writing your own.

Extract emails and URLs automatically with the Email & URL Extractor.

URL regex pattern

// Battle-tested URL regex (handles most real-world URLs):
const URL_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_+.~#?&//=]*)/gi;

// Usage:
const text = 'Visit https://example.com/path?q=1 or http://sub.domain.co.uk/page#section';
const urls = text.match(URL_REGEX);
// ['https://example.com/path?q=1', 'http://sub.domain.co.uk/page#section']

Using the URL constructor for validation

// After extracting, validate with URL constructor:
function extractValidURLs(text) {
  const pattern = /https?:\/\/[^\s<>"{}|\\^`[\]]+/gi;
  const candidates = text.match(pattern) || [];
  
  return candidates.filter(url => {
    try {
      // Remove common trailing punctuation (sentence endings):
      const cleaned = url.replace(/[.,;:!?)]+$/, '');
      new URL(cleaned);
      return true;
    } catch {
      return false;
    }
  }).map(url => url.replace(/[.,;:!?)]+$/, ''));
}

const text = 'Check out https://example.com/page! And also http://test.org.';
extractValidURLs(text);
// ['https://example.com/page', 'http://test.org']

Email extraction regex

// Email regex (RFC 5322 simplified — good enough for extraction):
const EMAIL_REGEX = /[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}/g;

function extractEmails(text) {
  return [...new Set(text.match(EMAIL_REGEX) || [])];  // Unique emails
}

const text = `
Contact alice@example.com or bob+test@sub.domain.co.uk.
Also: admin@localhost (invalid TLD, but matches simple regex)
Don't email: @noemail or broken@
`;

extractEmails(text);
// ['alice@example.com', 'bob+test@sub.domain.co.uk', 'admin@localhost']

Python: extract URLs and emails

import re
from urllib.parse import urlparse

URL_PATTERN = re.compile(
    r'https?://(?:www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}'
    r'\b(?:[-a-zA-Z0-9()@:%_+.~#?&/=]*)',
    re.IGNORECASE
)

EMAIL_PATTERN = re.compile(
    r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}',
    re.IGNORECASE
)

def extract_from_text(text: str) -> dict:
    urls = URL_PATTERN.findall(text)
    emails = EMAIL_PATTERN.findall(text)
    
    # Clean trailing punctuation:
    clean_urls = [url.rstrip('.,;:!?)') for url in urls]
    
    return {
        'urls': list(set(clean_urls)),
        'emails': list(set(emails)),
    }

text = """
Visit https://example.com/page?q=test or mailto:admin@example.com.
Report issues at https://github.com/org/repo/issues.
"""
result = extract_from_text(text)
# {'urls': ['https://example.com/page?q=test', 'https://github.com/org/repo/issues'],
#  'emails': ['admin@example.com']}

Extract from HTML (don’t use regex)

For HTML, parse the DOM — don’t use regex on HTML:

// Browser DOM extraction (most reliable):
function extractLinksFromPage() {
  return [...document.querySelectorAll('a[href]')]
    .map(a => a.href)  // Already absolute
    .filter(href => href.startsWith('http'));
}

function extractEmailsFromPage() {
  return [...document.querySelectorAll('a[href^="mailto:"]')]
    .map(a => a.href.replace('mailto:', '').split('?')[0]);
}

// Node.js with cheerio:
import * as cheerio from 'cheerio';  // npm install cheerio
import { readFileSync } from 'fs';

function extractFromHTML(html) {
  const $ = cheerio.load(html);
  const urls = new Set();
  const emails = new Set();
  
  $('a[href]').each((_, el) => {
    const href = $(el).attr('href');
    if (href.startsWith('http')) urls.add(href);
    if (href.startsWith('mailto:')) emails.add(href.slice(7).split('?')[0]);
  });
  
  return { urls: [...urls], emails: [...emails] };
}

Extract from log files

// Extract URLs from Apache/nginx access logs:
const logLine = '192.168.1.1 - - [12/May/2024:10:22:31] "GET /api/data?id=123 HTTP/1.1" 200 512';

function extractFromLog(logLine) {
  // Match quoted request portion:
  const requestMatch = logLine.match(/"(\S+)\s+(\S+)\s+HTTP/);
  if (requestMatch) {
    return requestMatch[2];  // The path
  }
  return null;
}

// Extract from referrer fields in logs:
const refererLog = '- https://google.com/search?q=test - "Mozilla/5.0"';
const urlFromReferer = refererLog.match(/https?:\/\/[^\s"]+/)?.[0];

Handle URLs in social media text

// Social media text has shortened URLs, @mentions, and #hashtags:
function extractSocialURLs(text) {
  // Match t.co, bit.ly, etc.:
  const pattern = /https?:\/\/[^\s)>\]"']+/gi;
  return (text.match(pattern) || [])
    .map(url => url.replace(/[.,!?:;]+$/, ''))  // Strip trailing punctuation
    .filter(url => {
      try { new URL(url); return true; }
      catch { return false; }
    });
}

// Filter out bare @mentions and #hashtags:
const tweet = 'Check out https://t.co/abc123 from @username #trending!';
extractSocialURLs(tweet);  // ['https://t.co/abc123']

Related posts

Related tool

Email & URL Extractor

Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.

Written by Mian Ali Khalid. Part of the Dev Productivity pillar.