Extract URLs from Text with Regex — Email and URL Extraction Patterns
Extract URLs and email addresses from plain text using regex. Learn battle-tested patterns for finding HTTP/HTTPS URLs, bare domains, and emails in log files, HTML, and user...
Use the tool
Email & URL Extractor
Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.
Extracting URLs from text is harder than it looks — URLs contain special characters, and the boundaries (where a URL ends) are ambiguous in natural language. Use battle-tested patterns rather than writing your own.
Extract emails and URLs automatically with the Email & URL Extractor.
URL regex pattern
// Battle-tested URL regex (handles most real-world URLs):
const URL_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_+.~#?&//=]*)/gi;
// Usage:
const text = 'Visit https://example.com/path?q=1 or http://sub.domain.co.uk/page#section';
const urls = text.match(URL_REGEX);
// ['https://example.com/path?q=1', 'http://sub.domain.co.uk/page#section']
Using the URL constructor for validation
// After extracting, validate with URL constructor:
function extractValidURLs(text) {
const pattern = /https?:\/\/[^\s<>"{}|\\^`[\]]+/gi;
const candidates = text.match(pattern) || [];
return candidates.filter(url => {
try {
// Remove common trailing punctuation (sentence endings):
const cleaned = url.replace(/[.,;:!?)]+$/, '');
new URL(cleaned);
return true;
} catch {
return false;
}
}).map(url => url.replace(/[.,;:!?)]+$/, ''));
}
const text = 'Check out https://example.com/page! And also http://test.org.';
extractValidURLs(text);
// ['https://example.com/page', 'http://test.org']
Email extraction regex
// Email regex (RFC 5322 simplified — good enough for extraction):
const EMAIL_REGEX = /[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}/g;
function extractEmails(text) {
return [...new Set(text.match(EMAIL_REGEX) || [])]; // Unique emails
}
const text = `
Contact alice@example.com or bob+test@sub.domain.co.uk.
Also: admin@localhost (invalid TLD, but matches simple regex)
Don't email: @noemail or broken@
`;
extractEmails(text);
// ['alice@example.com', 'bob+test@sub.domain.co.uk', 'admin@localhost']
Python: extract URLs and emails
import re
from urllib.parse import urlparse
URL_PATTERN = re.compile(
r'https?://(?:www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}'
r'\b(?:[-a-zA-Z0-9()@:%_+.~#?&/=]*)',
re.IGNORECASE
)
EMAIL_PATTERN = re.compile(
r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}',
re.IGNORECASE
)
def extract_from_text(text: str) -> dict:
urls = URL_PATTERN.findall(text)
emails = EMAIL_PATTERN.findall(text)
# Clean trailing punctuation:
clean_urls = [url.rstrip('.,;:!?)') for url in urls]
return {
'urls': list(set(clean_urls)),
'emails': list(set(emails)),
}
text = """
Visit https://example.com/page?q=test or mailto:admin@example.com.
Report issues at https://github.com/org/repo/issues.
"""
result = extract_from_text(text)
# {'urls': ['https://example.com/page?q=test', 'https://github.com/org/repo/issues'],
# 'emails': ['admin@example.com']}
Extract from HTML (don’t use regex)
For HTML, parse the DOM — don’t use regex on HTML:
// Browser DOM extraction (most reliable):
function extractLinksFromPage() {
return [...document.querySelectorAll('a[href]')]
.map(a => a.href) // Already absolute
.filter(href => href.startsWith('http'));
}
function extractEmailsFromPage() {
return [...document.querySelectorAll('a[href^="mailto:"]')]
.map(a => a.href.replace('mailto:', '').split('?')[0]);
}
// Node.js with cheerio:
import * as cheerio from 'cheerio'; // npm install cheerio
import { readFileSync } from 'fs';
function extractFromHTML(html) {
const $ = cheerio.load(html);
const urls = new Set();
const emails = new Set();
$('a[href]').each((_, el) => {
const href = $(el).attr('href');
if (href.startsWith('http')) urls.add(href);
if (href.startsWith('mailto:')) emails.add(href.slice(7).split('?')[0]);
});
return { urls: [...urls], emails: [...emails] };
}
Extract from log files
// Extract URLs from Apache/nginx access logs:
const logLine = '192.168.1.1 - - [12/May/2024:10:22:31] "GET /api/data?id=123 HTTP/1.1" 200 512';
function extractFromLog(logLine) {
// Match quoted request portion:
const requestMatch = logLine.match(/"(\S+)\s+(\S+)\s+HTTP/);
if (requestMatch) {
return requestMatch[2]; // The path
}
return null;
}
// Extract from referrer fields in logs:
const refererLog = '- https://google.com/search?q=test - "Mozilla/5.0"';
const urlFromReferer = refererLog.match(/https?:\/\/[^\s"]+/)?.[0];
Handle URLs in social media text
// Social media text has shortened URLs, @mentions, and #hashtags:
function extractSocialURLs(text) {
// Match t.co, bit.ly, etc.:
const pattern = /https?:\/\/[^\s)>\]"']+/gi;
return (text.match(pattern) || [])
.map(url => url.replace(/[.,!?:;]+$/, '')) // Strip trailing punctuation
.filter(url => {
try { new URL(url); return true; }
catch { return false; }
});
}
// Filter out bare @mentions and #hashtags:
const tweet = 'Check out https://t.co/abc123 from @username #trending!';
extractSocialURLs(tweet); // ['https://t.co/abc123']
Related tools
- Email & URL Extractor — extract URLs and emails from text
- Regex Tester — test regex patterns
- URL Encoder — encode extracted URLs
Related posts
- Contact Information Extraction — Emails, Phones, and URLs from Text — Extract emails, phone numbers, URLs, and addresses from unstructured text using …
- Email Extractor — How to Pull Email Addresses from Text — An email extractor scans a block of text and finds all valid email addresses. He…
- Email Extractor — Extract Email Addresses from Text — An email extractor finds and pulls all email addresses from a block of text usin…
- Regex Replace in JavaScript — String.replace() and replaceAll() with Patterns — Master JavaScript regex replace: String.replace() with capture groups, replaceAl…
- URL Encoding in JavaScript — encodeURIComponent vs encodeURI — JavaScript has two URL encoding functions: encodeURI for full URLs and encodeURI…
Related tool
Email & URL Extractor
Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.
Written by Mian Ali Khalid. Part of the Dev Productivity pillar.