X Xerobit

Email Extractor — How to Pull Email Addresses from Text

An email extractor scans a block of text and finds all valid email addresses. Here's how the regex patterns work, where the edge cases are, and how to extract emails in...

Mian Ali Khalid · · 6 min read
Use the tool
Email & URL Extractor
Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.
Open Email & URL Extractor →

An email extractor finds all email addresses embedded in a block of text. Paste a webpage, document, or data dump and get back a clean list of email addresses. The hard part isn’t the concept — it’s the regex pattern that correctly handles the long tail of valid email formats.

Use the Email & URL Extractor to extract emails, URLs, or both from any text.

What makes an email address valid

The formal specification for email addresses is RFC 5321 (SMTP) and RFC 5322 (Internet Message Format). The full spec permits surprisingly complex email addresses:

Local part:
- A-Z, a-z, 0-9
- . ! # $ % & ' * + - / = ? ^ _ ` { | } ~
- Unicode characters in modern extensions (RFC 6531)

@ symbol

Domain part:
- Subdomains separated by dots
- Each label: letters, digits, hyphens (not at start/end)
- TLD: at least 2 characters, can be very long (.photography, .international)

Examples of valid email addresses:
user@example.com          (standard)
user.name@example.com     (dot in local)
user+tag@example.com      (plus sign — used for tagging)
user@sub.example.com      (subdomain)
"user name"@example.com   (quoted local with space — technically valid)
user@[192.168.1.1]        (IP address domain — valid per spec)

Most email extractors aim for “practical” validation — they catch the 99% of real-world email addresses and ignore the arcane edge cases.

The email extraction regex

A practical regex for extracting email addresses from text:

[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}

Breaking it down:

  • [a-zA-Z0-9._%+\-]+ — local part: alphanumeric, dot, underscore, percent, plus, hyphen
  • @ — the @ symbol
  • [a-zA-Z0-9.\-]+ — domain: alphanumeric, dot, hyphen
  • \. — dot before TLD
  • [a-zA-Z]{2,} — TLD: at least 2 letters

This pattern handles:

  • user@example.com
  • firstname.lastname@company.co.uk
  • user+filter@gmail.com
  • no-reply@notifications.example.com

It does not handle:

  • Quoted local parts with spaces: "John Smith"@example.com
  • IP address domains: user@[192.168.1.1]
  • Unicode email addresses

For text scraping, this is the right tradeoff.

Implementing email extraction in code

Python

import re

def extract_emails(text):
    pattern = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}'
    return list(set(re.findall(pattern, text)))

text = """
Contact us at support@example.com or sales@company.io.
For billing: billing@example.com (same as support).
"""

emails = extract_emails(text)
# ['support@example.com', 'sales@company.io', 'billing@example.com']

The set() wrapper removes duplicates. The list() converts back to a list.

JavaScript (browser or Node.js)

function extractEmails(text) {
  const pattern = /[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}/g;
  const matches = text.match(pattern) || [];
  return [...new Set(matches)]; // deduplicate
}

const text = `Contact support@example.com or sales@company.io`;
extractEmails(text); // ['support@example.com', 'sales@company.io']

With context (include surrounding text)

import re

def extract_emails_with_context(text, context_chars=50):
    pattern = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}'
    results = []
    for match in re.finditer(pattern, text):
        start = max(0, match.start() - context_chars)
        end = min(len(text), match.end() + context_chars)
        context = text[start:end].replace('\n', ' ').strip()
        results.append({
            'email': match.group(),
            'context': f'...{context}...'
        })
    return results

Context extraction is useful when you’re processing a large document and need to know where each email appeared.

Extracting URLs from text

URL extraction is similar but requires a different pattern:

https?://[^\s<>"{}|\\^`\[\]]+

This matches HTTP and HTTPS URLs, stopping at whitespace and common delimiters.

URL extraction in Python

import re

def extract_urls(text):
    pattern = r'https?://[^\s<>"{}|\\^`\[\]]+'
    urls = re.findall(pattern, text)
    # Clean trailing punctuation that's likely not part of the URL:
    cleaned = [re.sub(r'[.,;:!?)]+$', '', url) for url in urls]
    return list(set(cleaned))

text = """
Visit https://example.com/page or https://docs.example.com/api.
Also check https://github.com/user/repo.
"""

urls = extract_urls(text)
# ['https://example.com/page', 'https://docs.example.com/api', 'https://github.com/user/repo']

The trailing punctuation cleanup handles cases like: “Visit the site (https://example.com).” where the period and parenthesis get included in the match.

Common edge cases

Obfuscated email addresses

Some websites display emails as:

  • user [at] example [dot] com
  • user AT example DOT com
  • user@example[.]com
  • HTML entity encoded: user&#64;example&#46;com

A standard regex won’t match these. To extract obfuscated emails, you need additional pattern matching:

def normalize_email(text):
    # Replace common obfuscation patterns:
    text = re.sub(r'\s*\[at\]\s*', '@', text, flags=re.IGNORECASE)
    text = re.sub(r'\s*\(at\)\s*', '@', text, flags=re.IGNORECASE)
    text = re.sub(r'\s*AT\s*', '@', text)
    text = re.sub(r'\s*\[dot\]\s*', '.', text, flags=re.IGNORECASE)
    text = re.sub(r'\s*\(dot\)\s*', '.', text, flags=re.IGNORECASE)
    text = re.sub(r'\[\.\]', '.', text)
    return text

HTML entities

Email addresses embedded in HTML may use entity encoding:

  • &#64; = @
  • &#46; = .

Strip HTML entities before running email extraction:

import html

cleaned_text = html.unescape(html_content)
emails = extract_emails(cleaned_text)

Email addresses in HTML attributes

An email in href="mailto:user@example.com" will be caught by the standard regex if you extract from the full HTML source. If you extract from rendered text only, mailto: links won’t appear in the text content.

For HTML sources, extract from both the rendered text and the href attributes:

from bs4 import BeautifulSoup

def extract_emails_from_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    emails = set()
    
    # From text content:
    text = soup.get_text()
    emails.update(extract_emails(text))
    
    # From mailto: links:
    for tag in soup.find_all('a', href=True):
        href = tag['href']
        if href.startswith('mailto:'):
            email = href[7:].split('?')[0]  # Remove mailto: and query params
            if re.match(r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}', email):
                emails.add(email)
    
    return list(emails)

Domain filtering and deduplication

In practice, you often want to filter or group extracted emails:

from collections import defaultdict

def group_by_domain(emails):
    domains = defaultdict(list)
    for email in emails:
        domain = email.split('@')[1].lower()
        domains[domain].append(email.lower())
    return dict(domains)

emails = ['alice@example.com', 'bob@example.com', 'carol@company.io']
grouped = group_by_domain(emails)
# {'example.com': ['alice@example.com', 'bob@example.com'], 'company.io': ['carol@company.io']}

Filtering to exclude no-reply and system addresses:

def filter_real_emails(emails):
    exclude_patterns = [
        r'^no.?reply@',
        r'^noreply@',
        r'^postmaster@',
        r'^webmaster@',
        r'^admin@',
        r'^info@',  # Keep if you want — depends on use case
    ]
    
    combined_pattern = '|'.join(exclude_patterns)
    return [e for e in emails if not re.match(combined_pattern, e, re.IGNORECASE)]

When email extraction is appropriate:

  • Your own documents, emails, or databases
  • Publicly shared contact directories you have permission to process
  • Your own sent emails or CRM data

When email extraction is not appropriate:

  • Scraping websites to build marketing lists without explicit permission
  • Extracting emails from data you don’t have authorization to process
  • Building contact lists for unsolicited commercial email (spam)

CAN-SPAM (US), GDPR (EU), and CASL (Canada) all impose requirements on commercial email. Extracting emails from a competitor’s website and sending them marketing emails is both legally risky and ethically problematic.

Using the Email & URL Extractor

The Email & URL Extractor processes text client-side — paste your text, choose email extraction, URL extraction, or both, and download the deduplicated list. No data is sent to a server.


Related posts

Related tool

Email & URL Extractor

Extract every email address and URL from a block of text. Regex-based, case-insensitive, deduplicated, sorted output.

Written by Mian Ali Khalid. Part of the Dev Productivity pillar.