X Xerobit

Extract Emails from HTML — Parsing mailto Links and Text

Extract email addresses from HTML pages by scanning mailto: links, data attributes, and visible text. Includes JavaScript and Python code for client-side and server-side...

Mian Ali Khalid · · 4 min read
Use the tool
Email & URL Extractor
Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.
Open Email & URL Extractor →

HTML pages embed emails in multiple places: <a href="mailto:..."> links, visible text, data-email attributes, and sometimes JavaScript-generated or encoded content. Reliable extraction requires checking all of them.

Use the Email & URL Extractor to extract emails from any HTML or text input instantly.

function extractMailtoEmails(doc = document) {
  const links = doc.querySelectorAll('a[href^="mailto:"]');
  return [...links].map(link => {
    const email = link.href.replace('mailto:', '').split('?')[0];
    return email.trim().toLowerCase();
  }).filter(Boolean);
}

// Also get display text (some sites show email as link text):
function extractEmailLinks(doc = document) {
  const results = [];
  doc.querySelectorAll('a').forEach(link => {
    const href = link.href;
    const text = link.textContent.trim();
    
    if (href.startsWith('mailto:')) {
      results.push({
        email: href.replace('mailto:', '').split('?')[0],
        source: 'mailto',
      });
    }
    
    // Email visible as link text
    if (/^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(text)) {
      results.push({ email: text, source: 'link-text' });
    }
  });
  return results;
}

Extract from full page text (JavaScript)

const EMAIL_REGEX = /\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b/g;

function extractAllEmails(doc = document) {
  const found = new Set();
  
  // From mailto links
  doc.querySelectorAll('a[href^="mailto:"]').forEach(a => {
    const email = a.href.slice(7).split('?')[0];
    found.add(email.toLowerCase());
  });
  
  // From visible text
  const text = doc.body?.innerText || '';
  const matches = text.match(EMAIL_REGEX) || [];
  matches.forEach(e => found.add(e.toLowerCase()));
  
  // From HTML source (catches emails in comments, scripts, hidden inputs)
  const html = doc.documentElement.innerHTML;
  const htmlMatches = html.match(EMAIL_REGEX) || [];
  htmlMatches.forEach(e => found.add(e.toLowerCase()));
  
  return [...found];
}

Extract from HTML string (Node.js)

import { parse } from 'node-html-parser';

const EMAIL_REGEX = /\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b/gi;

function extractEmailsFromHtml(html) {
  const found = new Set();
  const root = parse(html);
  
  // Mailto links
  root.querySelectorAll('a[href]').forEach(a => {
    const href = a.getAttribute('href') || '';
    if (href.startsWith('mailto:')) {
      const email = href.slice(7).split('?')[0].trim();
      if (email) found.add(email.toLowerCase());
    }
  });
  
  // All text content
  const text = root.textContent;
  (text.match(EMAIL_REGEX) || []).forEach(e => found.add(e.toLowerCase()));
  
  // Raw HTML (catches hidden attributes)
  (html.match(EMAIL_REGEX) || []).forEach(e => found.add(e.toLowerCase()));
  
  return [...found];
}

Handle obfuscated emails

Websites use several obfuscation tricks to foil naive scrapers:

function extractObfuscatedEmails(html) {
  const found = new Set();
  
  // 1. HTML entities: &#64; = @  &#46; = .
  const decoded = html
    .replace(/&#64;/g, '@')
    .replace(/&#x40;/gi, '@')
    .replace(/&#46;/g, '.')
    .replace(/&#x2e;/gi, '.');
  
  (decoded.match(EMAIL_REGEX) || []).forEach(e => found.add(e.toLowerCase()));
  
  // 2. [at] and [dot] substitutions
  const normalized = html
    .replace(/\s*\[at\]\s*/gi, '@')
    .replace(/\s*\(at\)\s*/gi, '@')
    .replace(/\s+at\s+/gi, '@')
    .replace(/\s*\[dot\]\s*/gi, '.')
    .replace(/\s*\(dot\)\s*/gi, '.');
  
  (normalized.match(EMAIL_REGEX) || []).forEach(e => found.add(e.toLowerCase()));
  
  // 3. Reversed text in CSS: direction: rtl + unicode-bidi: bidi-override
  // These need DOM access to detect visually — can check for 'bidi-override' in CSS
  
  return [...found];
}

Cloudflare email protection

Cloudflare encodes emails to prevent scraping:

// Cloudflare encodes emails like: <span data-cfemail="...">
// They're decoded by a CF script at runtime

function decodeCloudflareEmail(encoded) {
  let email = '';
  const key = parseInt(encoded.substring(0, 2), 16);
  for (let n = 2; n < encoded.length; n += 2) {
    email += String.fromCharCode(parseInt(encoded.substring(n, n + 2), 16) ^ key);
  }
  return email;
}

// Find all CF-protected emails in a page:
function extractCloudflarEmails(doc = document) {
  const spans = doc.querySelectorAll('[data-cfemail]');
  return [...spans].map(span => {
    const encoded = span.getAttribute('data-cfemail');
    return decodeCloudflareEmail(encoded);
  });
}

Python: parse HTML for emails

from bs4 import BeautifulSoup
import re

EMAIL_PATTERN = r'\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b'

def extract_emails_from_html(html: str) -> list[str]:
    found = set()
    soup = BeautifulSoup(html, 'html.parser')
    
    # mailto links
    for a in soup.find_all('a', href=True):
        href = a['href']
        if href.startswith('mailto:'):
            email = href[7:].split('?')[0].strip()
            if email:
                found.add(email.lower())
    
    # visible text + raw HTML
    for text in [soup.get_text(), html]:
        for email in re.findall(EMAIL_PATTERN, text, re.IGNORECASE):
            found.add(email.lower())
    
    return sorted(found)

Related posts

Related tool

Email & URL Extractor

Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.

Written by Mian Ali Khalid. Part of the Dev Productivity pillar.