X Xerobit

Extract URLs from Text — Regex and Libraries for URL Detection

Extracting URLs from plain text requires a regex that handles http, https, and various URL formats. Here's how to extract URLs using regex in JavaScript and Python, handle edge...

Mian Ali Khalid · · 4 min read
Use the tool
Email & URL Extractor
Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.
Open Email & URL Extractor →

Extracting URLs from text is harder than it looks — URLs can contain special characters, appear with or without protocols, and be interrupted by punctuation. Here’s a reliable approach for URL extraction.

Use the Email & URL Extractor to extract URLs from any text online.

Basic URL regex

// Match http/https URLs:
const urlRegex = /https?:\/\/[^\s<>"{}|\\^`[\]]+/g;

const text = 'Visit https://example.com or http://docs.example.com/guide for more info.';
const urls = text.match(urlRegex);
// ['https://example.com', 'http://docs.example.com/guide']

More comprehensive URL regex

// Handles edge cases: trailing punctuation, parentheses, brackets:
const urlRegex = new RegExp(
  '(?:https?:\\/\\/)' +               // protocol
  '(?:www\\.)?' +                       // optional www
  '(?:[a-zA-Z0-9\\-]+\\.)+' +          // domain
  '[a-zA-Z]{2,}' +                      // TLD
  '(?:\\/[^\\s<>"\']*)?' +             // optional path
  '(?:\\?[^\\s<>"\']*)?' +             // optional query
  '(?:#[^\\s<>"\']*)?',               // optional fragment
  'gi'
);

JavaScript URL extraction with validation

function extractUrls(text) {
  const pattern = /https?:\/\/(?:[-\w]+\.)+\w+(?:\/[^\s<>"']*)?(?:\?[^\s<>"']*)?(?:#[^\s<>"']*)?/gi;
  
  const matches = text.match(pattern) || [];
  
  // Clean up trailing punctuation:
  return matches.map(url => url.replace(/[.,;!?:'")\]]+$/, ''));
}

const text = `
  Check out https://example.com/docs, also see
  https://github.com/user/repo#readme for the source.
  Contact us at https://example.com/contact?from=email.
`;

extractUrls(text);
// ['https://example.com/docs',
//  'https://github.com/user/repo#readme',
//  'https://example.com/contact?from=email']

Validate extracted URLs

function isValidUrl(string) {
  try {
    const url = new URL(string);
    return url.protocol === 'http:' || url.protocol === 'https:';
  } catch {
    return false;
  }
}

function extractValidUrls(text) {
  const pattern = /https?:\/\/[^\s<>"']+/gi;
  const candidates = text.match(pattern) || [];
  
  return candidates
    .map(url => url.replace(/[.,;!?:'")\]]+$/, ''))
    .filter(isValidUrl);
}

Python URL extraction

import re
from urllib.parse import urlparse

URL_PATTERN = re.compile(
    r'https?://'              # protocol
    r'(?:[-\w]+\.)+\w+'       # domain
    r'(?:/[^\s<>"\']*)?'      # path
    r'(?:\?[^\s<>"\']*)?'     # query
    r'(?:#[^\s<>"\']*)?',     # fragment
    re.IGNORECASE
)

def extract_urls(text: str) -> list[str]:
    matches = URL_PATTERN.findall(text)
    # Clean trailing punctuation:
    cleaned = [re.sub(r'[.,;!?:\'")\]]+$', '', url) for url in matches]
    return [url for url in cleaned if url]

def validate_url(url: str) -> bool:
    try:
        result = urlparse(url)
        return result.scheme in ('http', 'https') and bool(result.netloc)
    except Exception:
        return False

text = """
Check https://example.com/docs (recommended) and
see https://github.com/user/repo for source code.
"""

urls = extract_urls(text)
valid = [url for url in urls if validate_url(url)]

Extracting URLs from HTML

// From HTML, better to parse the DOM:
function extractUrlsFromHtml(html) {
  const parser = new DOMParser();
  const doc = parser.parseFromString(html, 'text/html');
  
  const urls = [];
  
  // Links:
  doc.querySelectorAll('a[href]').forEach(a => {
    const href = a.getAttribute('href');
    if (href.startsWith('http')) urls.push(href);
  });
  
  // Images:
  doc.querySelectorAll('img[src]').forEach(img => {
    const src = img.getAttribute('src');
    if (src.startsWith('http')) urls.push(src);
  });
  
  return [...new Set(urls)];  // deduplicate
}

Node.js: extracting from files

import { readFileSync } from 'fs';

const urlPattern = /https?:\/\/(?:[-\w]+\.)+\w+(?:\/[^\s<>"']*)?(?:\?[^\s<>"']*)?/gi;

function extractUrlsFromFile(filePath) {
  const content = readFileSync(filePath, 'utf8');
  const matches = content.match(urlPattern) || [];
  return [...new Set(matches.map(url => url.replace(/[.,;!?:'")\]]+$/, '')))];
}

// Find all external links in a markdown file:
const urls = extractUrlsFromFile('./README.md');
console.log(`Found ${urls.length} unique URLs`);
urls.forEach(url => console.log(url));
async function checkUrls(urls) {
  const results = await Promise.allSettled(
    urls.map(async url => {
      const response = await fetch(url, { method: 'HEAD', redirect: 'follow' });
      return { url, status: response.status, ok: response.ok };
    })
  );
  
  return results.map((result, i) => {
    if (result.status === 'fulfilled') return result.value;
    return { url: urls[i], status: 0, ok: false, error: result.reason.message };
  });
}

const urls = extractUrlsFromFile('./docs/README.md');
const checks = await checkUrls(urls);
const broken = checks.filter(r => !r.ok);
broken.forEach(r => console.log(`BROKEN: ${r.url} (${r.status || r.error})`));

Related posts

Related tool

Email & URL Extractor

Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.

Written by Mian Ali Khalid. Part of the Dev Productivity pillar.