Extract URLs from Text — Regex and Libraries for URL Detection
Extracting URLs from plain text requires a regex that handles http, https, and various URL formats. Here's how to extract URLs using regex in JavaScript and Python, handle edge...
Use the tool
Email & URL Extractor
Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.
Extracting URLs from text is harder than it looks — URLs can contain special characters, appear with or without protocols, and be interrupted by punctuation. Here’s a reliable approach for URL extraction.
Use the Email & URL Extractor to extract URLs from any text online.
Basic URL regex
// Match http/https URLs:
const urlRegex = /https?:\/\/[^\s<>"{}|\\^`[\]]+/g;
const text = 'Visit https://example.com or http://docs.example.com/guide for more info.';
const urls = text.match(urlRegex);
// ['https://example.com', 'http://docs.example.com/guide']
More comprehensive URL regex
// Handles edge cases: trailing punctuation, parentheses, brackets:
const urlRegex = new RegExp(
'(?:https?:\\/\\/)' + // protocol
'(?:www\\.)?' + // optional www
'(?:[a-zA-Z0-9\\-]+\\.)+' + // domain
'[a-zA-Z]{2,}' + // TLD
'(?:\\/[^\\s<>"\']*)?' + // optional path
'(?:\\?[^\\s<>"\']*)?' + // optional query
'(?:#[^\\s<>"\']*)?', // optional fragment
'gi'
);
JavaScript URL extraction with validation
function extractUrls(text) {
const pattern = /https?:\/\/(?:[-\w]+\.)+\w+(?:\/[^\s<>"']*)?(?:\?[^\s<>"']*)?(?:#[^\s<>"']*)?/gi;
const matches = text.match(pattern) || [];
// Clean up trailing punctuation:
return matches.map(url => url.replace(/[.,;!?:'")\]]+$/, ''));
}
const text = `
Check out https://example.com/docs, also see
https://github.com/user/repo#readme for the source.
Contact us at https://example.com/contact?from=email.
`;
extractUrls(text);
// ['https://example.com/docs',
// 'https://github.com/user/repo#readme',
// 'https://example.com/contact?from=email']
Validate extracted URLs
function isValidUrl(string) {
try {
const url = new URL(string);
return url.protocol === 'http:' || url.protocol === 'https:';
} catch {
return false;
}
}
function extractValidUrls(text) {
const pattern = /https?:\/\/[^\s<>"']+/gi;
const candidates = text.match(pattern) || [];
return candidates
.map(url => url.replace(/[.,;!?:'")\]]+$/, ''))
.filter(isValidUrl);
}
Python URL extraction
import re
from urllib.parse import urlparse
URL_PATTERN = re.compile(
r'https?://' # protocol
r'(?:[-\w]+\.)+\w+' # domain
r'(?:/[^\s<>"\']*)?' # path
r'(?:\?[^\s<>"\']*)?' # query
r'(?:#[^\s<>"\']*)?', # fragment
re.IGNORECASE
)
def extract_urls(text: str) -> list[str]:
matches = URL_PATTERN.findall(text)
# Clean trailing punctuation:
cleaned = [re.sub(r'[.,;!?:\'")\]]+$', '', url) for url in matches]
return [url for url in cleaned if url]
def validate_url(url: str) -> bool:
try:
result = urlparse(url)
return result.scheme in ('http', 'https') and bool(result.netloc)
except Exception:
return False
text = """
Check https://example.com/docs (recommended) and
see https://github.com/user/repo for source code.
"""
urls = extract_urls(text)
valid = [url for url in urls if validate_url(url)]
Extracting URLs from HTML
// From HTML, better to parse the DOM:
function extractUrlsFromHtml(html) {
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');
const urls = [];
// Links:
doc.querySelectorAll('a[href]').forEach(a => {
const href = a.getAttribute('href');
if (href.startsWith('http')) urls.push(href);
});
// Images:
doc.querySelectorAll('img[src]').forEach(img => {
const src = img.getAttribute('src');
if (src.startsWith('http')) urls.push(src);
});
return [...new Set(urls)]; // deduplicate
}
Node.js: extracting from files
import { readFileSync } from 'fs';
const urlPattern = /https?:\/\/(?:[-\w]+\.)+\w+(?:\/[^\s<>"']*)?(?:\?[^\s<>"']*)?/gi;
function extractUrlsFromFile(filePath) {
const content = readFileSync(filePath, 'utf8');
const matches = content.match(urlPattern) || [];
return [...new Set(matches.map(url => url.replace(/[.,;!?:'")\]]+$/, '')))];
}
// Find all external links in a markdown file:
const urls = extractUrlsFromFile('./README.md');
console.log(`Found ${urls.length} unique URLs`);
urls.forEach(url => console.log(url));
Finding broken links
async function checkUrls(urls) {
const results = await Promise.allSettled(
urls.map(async url => {
const response = await fetch(url, { method: 'HEAD', redirect: 'follow' });
return { url, status: response.status, ok: response.ok };
})
);
return results.map((result, i) => {
if (result.status === 'fulfilled') return result.value;
return { url: urls[i], status: 0, ok: false, error: result.reason.message };
});
}
const urls = extractUrlsFromFile('./docs/README.md');
const checks = await checkUrls(urls);
const broken = checks.filter(r => !r.ok);
broken.forEach(r => console.log(`BROKEN: ${r.url} (${r.status || r.error})`));
Related tools
- Email & URL Extractor — extract URLs and emails online
- Email Validation Regex — validating email addresses
- URL Encoding — percent-encoding URLs
Related posts
- Contact Information Extraction — Emails, Phones, and URLs from Text — Extract emails, phone numbers, URLs, and addresses from unstructured text using …
- Email Extractor — How to Pull Email Addresses from Text — An email extractor scans a block of text and finds all valid email addresses. He…
- Email Extractor — Extract Email Addresses from Text — An email extractor finds and pulls all email addresses from a block of text usin…
- URL Encode Decode — Encode and Decode URLs Online — URL encoding converts special characters to percent-encoded sequences (%20 for s…
- URL Encoding Special Characters — Which Characters to Encode and When — URL encoding converts special characters to percent-encoded sequences (%20, %26,…
Related tool
Email & URL Extractor
Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.
Written by Mian Ali Khalid. Part of the Dev Productivity pillar.