X Xerobit

Email & URL Extractor

Pull every email address or URL from a chunk of text. Three modes (emails only, URLs only, both), with dedupe, alphabetical sort, and lowercase-emails options. Useful for cleaning up exported logs, mailing lists, scraped HTML, or chat transcripts.

Paste text to extract from.

How to use the email extractor

  1. Paste your text — drop in any block of text: a web page source, a CSV export, a log file, a chat transcript, or raw HTML. The tool accepts any plain text input up to several MB.
  2. Pick your mode — choose "Emails only" to pull email addresses, "URLs only" for links and web addresses, or "Both" to extract everything in one pass.
  3. Copy the results — click Copy to get the deduplicated, sorted list to your clipboard. One result per line, ready to paste into a spreadsheet, import into an email client, or pipe into another tool.

What the email extractor finds

The email pattern targets the format local-part@domain.tld. It catches:

What it intentionally skips: RFC 5322 quoted local parts like "user name"@example.com are technically valid but essentially never appear in real-world data. Supporting them would add significant complexity for near-zero practical benefit.

What the URL extractor finds

The URL extractor matches two categories:

For text extracted from HTML, the best approach is to paste the raw HTML source rather than rendered text — href values are unambiguous and the extractor will find them cleanly.

Common use cases

Deduplication and sorting

Real-world text almost always contains the same email address or URL multiple times. A contact page might list the same support email in the header, footer, and body. A log file might reference the same API endpoint thousands of times.

This tool deduplicates automatically: one email that appears in 15 places becomes one line in the output. Emails are lowercased before deduplication (User@Example.com and user@example.com are treated as the same address), then sorted alphabetically. URLs preserve their original case but are deduplicated on exact match.

The result is a minimal, clean list — useful as the starting point for an import, a validation run, or a manual review.

Email validation vs email extraction

These are two different problems that developers frequently confuse:

An extracted address like fake123@disposable.ninja passes the extraction regex but may bounce on send. An address like user@company.com looks valid and may also bounce if the mailbox doesn't exist. If your use case requires confidence that addresses are live, extraction is only the first step — you'll need a validation layer.

Working with extracted emails — legal context

Email addresses are personal data under GDPR in the EU and similar regulations globally. A few important points for developers:

Technical notes

The tool uses JavaScript regex patterns applied entirely in your browser. No text is sent to any server. The email pattern is a pragmatic approximation of RFC 5322 — full RFC 5322 compliance would require a parser, not a regex, and would match many formats that don't exist in real data. The URL pattern prioritizes precision (low false positives) over recall (catching every possible URL format). Both patterns have been tested against large real-world datasets of logs, HTML exports, and mailing list backups.

FAQ

Is my text sent to a server?

No. Extraction runs entirely in your browser via JavaScript. Your text never leaves your machine. You can disconnect from the internet after the page loads and the tool still works.

Does it support international email addresses (IDN)?

Internationalized domain names (IDNs) — like user@münchen.de or addresses with Unicode in the local part — are not matched by the current regex pattern. The pattern targets ASCII addresses, which covers the vast majority of real-world usage. Non-ASCII email addresses, while valid per RFC 6530, are uncommon in practice and require a different encoding (punycode for the domain part) that most mail servers still don't fully support.

What is the maximum text size?

There is no hard enforced limit, but performance degrades above ~10 MB of input text. The regex engine processes linearly, so very large inputs (multi-hundred-MB log files) will cause the UI to freeze for several seconds. For files that large, use a command-line tool: grep -oE '[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}' file.txt | sort -u extracts and deduplicates emails in milliseconds on any size file.

Related tools

Related articles

Pillar

Part of Dev Productivity.


Written by Mian Ali Khalid. Last updated 2026-05-13.