Email Extractor — How to Pull Email Addresses from Text
An email extractor scans a block of text and finds all valid email addresses. Here's how the regex patterns work, where the edge cases are, and how to extract emails in...
An email extractor finds all email addresses embedded in a block of text. Paste a webpage, document, or data dump and get back a clean list of email addresses. The hard part isn’t the concept — it’s the regex pattern that correctly handles the long tail of valid email formats.
Use the Email & URL Extractor to extract emails, URLs, or both from any text.
What makes an email address valid
The formal specification for email addresses is RFC 5321 (SMTP) and RFC 5322 (Internet Message Format). The full spec permits surprisingly complex email addresses:
Local part:
- A-Z, a-z, 0-9
- . ! # $ % & ' * + - / = ? ^ _ ` { | } ~
- Unicode characters in modern extensions (RFC 6531)
@ symbol
Domain part:
- Subdomains separated by dots
- Each label: letters, digits, hyphens (not at start/end)
- TLD: at least 2 characters, can be very long (.photography, .international)
Examples of valid email addresses:
user@example.com (standard)
user.name@example.com (dot in local)
user+tag@example.com (plus sign — used for tagging)
user@sub.example.com (subdomain)
"user name"@example.com (quoted local with space — technically valid)
user@[192.168.1.1] (IP address domain — valid per spec)
Most email extractors aim for “practical” validation — they catch the 99% of real-world email addresses and ignore the arcane edge cases.
The email extraction regex
A practical regex for extracting email addresses from text:
[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}
Breaking it down:
[a-zA-Z0-9._%+\-]+— local part: alphanumeric, dot, underscore, percent, plus, hyphen@— the @ symbol[a-zA-Z0-9.\-]+— domain: alphanumeric, dot, hyphen\.— dot before TLD[a-zA-Z]{2,}— TLD: at least 2 letters
This pattern handles:
user@example.com✓firstname.lastname@company.co.uk✓user+filter@gmail.com✓no-reply@notifications.example.com✓
It does not handle:
- Quoted local parts with spaces:
"John Smith"@example.com - IP address domains:
user@[192.168.1.1] - Unicode email addresses
For text scraping, this is the right tradeoff.
Implementing email extraction in code
Python
import re
def extract_emails(text):
pattern = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}'
return list(set(re.findall(pattern, text)))
text = """
Contact us at support@example.com or sales@company.io.
For billing: billing@example.com (same as support).
"""
emails = extract_emails(text)
# ['support@example.com', 'sales@company.io', 'billing@example.com']
The set() wrapper removes duplicates. The list() converts back to a list.
JavaScript (browser or Node.js)
function extractEmails(text) {
const pattern = /[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}/g;
const matches = text.match(pattern) || [];
return [...new Set(matches)]; // deduplicate
}
const text = `Contact support@example.com or sales@company.io`;
extractEmails(text); // ['support@example.com', 'sales@company.io']
With context (include surrounding text)
import re
def extract_emails_with_context(text, context_chars=50):
pattern = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}'
results = []
for match in re.finditer(pattern, text):
start = max(0, match.start() - context_chars)
end = min(len(text), match.end() + context_chars)
context = text[start:end].replace('\n', ' ').strip()
results.append({
'email': match.group(),
'context': f'...{context}...'
})
return results
Context extraction is useful when you’re processing a large document and need to know where each email appeared.
Extracting URLs from text
URL extraction is similar but requires a different pattern:
https?://[^\s<>"{}|\\^`\[\]]+
This matches HTTP and HTTPS URLs, stopping at whitespace and common delimiters.
URL extraction in Python
import re
def extract_urls(text):
pattern = r'https?://[^\s<>"{}|\\^`\[\]]+'
urls = re.findall(pattern, text)
# Clean trailing punctuation that's likely not part of the URL:
cleaned = [re.sub(r'[.,;:!?)]+$', '', url) for url in urls]
return list(set(cleaned))
text = """
Visit https://example.com/page or https://docs.example.com/api.
Also check https://github.com/user/repo.
"""
urls = extract_urls(text)
# ['https://example.com/page', 'https://docs.example.com/api', 'https://github.com/user/repo']
The trailing punctuation cleanup handles cases like: “Visit the site (https://example.com).” where the period and parenthesis get included in the match.
Common edge cases
Obfuscated email addresses
Some websites display emails as:
user [at] example [dot] comuser AT example DOT comuser@example[.]com- HTML entity encoded:
user@example.com
A standard regex won’t match these. To extract obfuscated emails, you need additional pattern matching:
def normalize_email(text):
# Replace common obfuscation patterns:
text = re.sub(r'\s*\[at\]\s*', '@', text, flags=re.IGNORECASE)
text = re.sub(r'\s*\(at\)\s*', '@', text, flags=re.IGNORECASE)
text = re.sub(r'\s*AT\s*', '@', text)
text = re.sub(r'\s*\[dot\]\s*', '.', text, flags=re.IGNORECASE)
text = re.sub(r'\s*\(dot\)\s*', '.', text, flags=re.IGNORECASE)
text = re.sub(r'\[\.\]', '.', text)
return text
HTML entities
Email addresses embedded in HTML may use entity encoding:
@=@.=.
Strip HTML entities before running email extraction:
import html
cleaned_text = html.unescape(html_content)
emails = extract_emails(cleaned_text)
Email addresses in HTML attributes
An email in href="mailto:user@example.com" will be caught by the standard regex if you extract from the full HTML source. If you extract from rendered text only, mailto: links won’t appear in the text content.
For HTML sources, extract from both the rendered text and the href attributes:
from bs4 import BeautifulSoup
def extract_emails_from_html(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
emails = set()
# From text content:
text = soup.get_text()
emails.update(extract_emails(text))
# From mailto: links:
for tag in soup.find_all('a', href=True):
href = tag['href']
if href.startswith('mailto:'):
email = href[7:].split('?')[0] # Remove mailto: and query params
if re.match(r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}', email):
emails.add(email)
return list(emails)
Domain filtering and deduplication
In practice, you often want to filter or group extracted emails:
from collections import defaultdict
def group_by_domain(emails):
domains = defaultdict(list)
for email in emails:
domain = email.split('@')[1].lower()
domains[domain].append(email.lower())
return dict(domains)
emails = ['alice@example.com', 'bob@example.com', 'carol@company.io']
grouped = group_by_domain(emails)
# {'example.com': ['alice@example.com', 'bob@example.com'], 'company.io': ['carol@company.io']}
Filtering to exclude no-reply and system addresses:
def filter_real_emails(emails):
exclude_patterns = [
r'^no.?reply@',
r'^noreply@',
r'^postmaster@',
r'^webmaster@',
r'^admin@',
r'^info@', # Keep if you want — depends on use case
]
combined_pattern = '|'.join(exclude_patterns)
return [e for e in emails if not re.match(combined_pattern, e, re.IGNORECASE)]
Legal considerations
When email extraction is appropriate:
- Your own documents, emails, or databases
- Publicly shared contact directories you have permission to process
- Your own sent emails or CRM data
When email extraction is not appropriate:
- Scraping websites to build marketing lists without explicit permission
- Extracting emails from data you don’t have authorization to process
- Building contact lists for unsolicited commercial email (spam)
CAN-SPAM (US), GDPR (EU), and CASL (Canada) all impose requirements on commercial email. Extracting emails from a competitor’s website and sending them marketing emails is both legally risky and ethically problematic.
Using the Email & URL Extractor
The Email & URL Extractor processes text client-side — paste your text, choose email extraction, URL extraction, or both, and download the deduplicated list. No data is sent to a server.
Related tools
- Email & URL Extractor — extract emails and URLs from text
- Regex Tester — build and test regex patterns for custom extraction
- Text Diff — compare two extracted lists
Related posts
- Contact Information Extraction — Emails, Phones, and URLs from Text — Extract emails, phone numbers, URLs, and addresses from unstructured text using …
- Email Extractor — Extract Email Addresses from Text — An email extractor finds and pulls all email addresses from a block of text usin…
- Email Extractor in Python — regex, html.parser, and BeautifulSoup — Extract email addresses from plain text, HTML pages, and files using Python. Thi…
Related tool
Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.
Written by Mian Ali Khalid. Part of the Dev Productivity pillar.