Extract Emails from HTML — Parsing mailto Links and Text
Extract email addresses from HTML pages by scanning mailto: links, data attributes, and visible text. Includes JavaScript and Python code for client-side and server-side...
HTML pages embed emails in multiple places: <a href="mailto:..."> links, visible text, data-email attributes, and sometimes JavaScript-generated or encoded content. Reliable extraction requires checking all of them.
Use the Email & URL Extractor to extract emails from any HTML or text input instantly.
Extract from mailto links (JavaScript, browser)
function extractMailtoEmails(doc = document) {
const links = doc.querySelectorAll('a[href^="mailto:"]');
return [...links].map(link => {
const email = link.href.replace('mailto:', '').split('?')[0];
return email.trim().toLowerCase();
}).filter(Boolean);
}
// Also get display text (some sites show email as link text):
function extractEmailLinks(doc = document) {
const results = [];
doc.querySelectorAll('a').forEach(link => {
const href = link.href;
const text = link.textContent.trim();
if (href.startsWith('mailto:')) {
results.push({
email: href.replace('mailto:', '').split('?')[0],
source: 'mailto',
});
}
// Email visible as link text
if (/^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(text)) {
results.push({ email: text, source: 'link-text' });
}
});
return results;
}
Extract from full page text (JavaScript)
const EMAIL_REGEX = /\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b/g;
function extractAllEmails(doc = document) {
const found = new Set();
// From mailto links
doc.querySelectorAll('a[href^="mailto:"]').forEach(a => {
const email = a.href.slice(7).split('?')[0];
found.add(email.toLowerCase());
});
// From visible text
const text = doc.body?.innerText || '';
const matches = text.match(EMAIL_REGEX) || [];
matches.forEach(e => found.add(e.toLowerCase()));
// From HTML source (catches emails in comments, scripts, hidden inputs)
const html = doc.documentElement.innerHTML;
const htmlMatches = html.match(EMAIL_REGEX) || [];
htmlMatches.forEach(e => found.add(e.toLowerCase()));
return [...found];
}
Extract from HTML string (Node.js)
import { parse } from 'node-html-parser';
const EMAIL_REGEX = /\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b/gi;
function extractEmailsFromHtml(html) {
const found = new Set();
const root = parse(html);
// Mailto links
root.querySelectorAll('a[href]').forEach(a => {
const href = a.getAttribute('href') || '';
if (href.startsWith('mailto:')) {
const email = href.slice(7).split('?')[0].trim();
if (email) found.add(email.toLowerCase());
}
});
// All text content
const text = root.textContent;
(text.match(EMAIL_REGEX) || []).forEach(e => found.add(e.toLowerCase()));
// Raw HTML (catches hidden attributes)
(html.match(EMAIL_REGEX) || []).forEach(e => found.add(e.toLowerCase()));
return [...found];
}
Handle obfuscated emails
Websites use several obfuscation tricks to foil naive scrapers:
function extractObfuscatedEmails(html) {
const found = new Set();
// 1. HTML entities: @ = @ . = .
const decoded = html
.replace(/@/g, '@')
.replace(/@/gi, '@')
.replace(/./g, '.')
.replace(/./gi, '.');
(decoded.match(EMAIL_REGEX) || []).forEach(e => found.add(e.toLowerCase()));
// 2. [at] and [dot] substitutions
const normalized = html
.replace(/\s*\[at\]\s*/gi, '@')
.replace(/\s*\(at\)\s*/gi, '@')
.replace(/\s+at\s+/gi, '@')
.replace(/\s*\[dot\]\s*/gi, '.')
.replace(/\s*\(dot\)\s*/gi, '.');
(normalized.match(EMAIL_REGEX) || []).forEach(e => found.add(e.toLowerCase()));
// 3. Reversed text in CSS: direction: rtl + unicode-bidi: bidi-override
// These need DOM access to detect visually — can check for 'bidi-override' in CSS
return [...found];
}
Cloudflare email protection
Cloudflare encodes emails to prevent scraping:
// Cloudflare encodes emails like: <span data-cfemail="...">
// They're decoded by a CF script at runtime
function decodeCloudflareEmail(encoded) {
let email = '';
const key = parseInt(encoded.substring(0, 2), 16);
for (let n = 2; n < encoded.length; n += 2) {
email += String.fromCharCode(parseInt(encoded.substring(n, n + 2), 16) ^ key);
}
return email;
}
// Find all CF-protected emails in a page:
function extractCloudflarEmails(doc = document) {
const spans = doc.querySelectorAll('[data-cfemail]');
return [...spans].map(span => {
const encoded = span.getAttribute('data-cfemail');
return decodeCloudflareEmail(encoded);
});
}
Python: parse HTML for emails
from bs4 import BeautifulSoup
import re
EMAIL_PATTERN = r'\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b'
def extract_emails_from_html(html: str) -> list[str]:
found = set()
soup = BeautifulSoup(html, 'html.parser')
# mailto links
for a in soup.find_all('a', href=True):
href = a['href']
if href.startswith('mailto:'):
email = href[7:].split('?')[0].strip()
if email:
found.add(email.lower())
# visible text + raw HTML
for text in [soup.get_text(), html]:
for email in re.findall(EMAIL_PATTERN, text, re.IGNORECASE):
found.add(email.lower())
return sorted(found)
Related tools
- Email & URL Extractor — extract emails from HTML or text online
- Email Extractor in Python — Python extraction guide
- Extract URLs from Text — URL extraction patterns
Related posts
- Contact Information Extraction — Emails, Phones, and URLs from Text — Extract emails, phone numbers, URLs, and addresses from unstructured text using …
- Email Extractor — How to Pull Email Addresses from Text — An email extractor scans a block of text and finds all valid email addresses. He…
- Email Extractor — Extract Email Addresses from Text — An email extractor finds and pulls all email addresses from a block of text usin…
- Email Extractor in Python — regex, html.parser, and BeautifulSoup — Extract email addresses from plain text, HTML pages, and files using Python. Thi…
- Extract URLs from Text — Regex and Libraries for URL Detection — Extracting URLs from plain text requires a regex that handles http, https, and v…
Related tool
Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.
Written by Mian Ali Khalid. Part of the Dev Productivity pillar.