CSV Quoting and Escaping Rules (the Real Ones, Not the Folklore)
CSV looks trivial until your spreadsheet has a comma in a name field. Here's the actual RFC 4180 rules, the dialects everyone forgets, and how to stop breaking imports.
A user submits a support ticket: “My import broke.” You open their CSV file. There’s a row that starts with "Smith, John" and another that ends with a quote inside a name. Three columns disappeared. Two extra columns appeared from nowhere. The export tool said it was UTF-8 but the BOM at the start says otherwise. Welcome to CSV.
CSV is the format everyone underestimates. It looks like a few commas separating values. It’s actually a format that’s been re-invented by every spreadsheet, every database export, and every legacy tool, with subtly incompatible rules for quoting, escaping, line endings, and encoding. Get any of them wrong and the import silently produces garbage that looks correct until someone notices six weeks later.
This post is the real rulebook for CSV: what the RFC says, what the dialects actually do, and how to write code that survives meeting other people’s CSVs.
What does RFC 4180 say about CSV quoting?
RFC 4180 defines CSV quoting in three rules: fields containing commas, double quotes, or line breaks must be wrapped in double quotes; literal double quotes inside a quoted field are escaped by doubling them (""); and line endings between records are CRLF. These three rules cover most cases. The dialects below cover the rest.
The RFC itself is short. The full canonical example:
"Smith, John",30,"He said ""hello"""
"Doe, Jane",25,"Line 1
Line 2"
Three fields per row. The first field has a comma so it’s quoted. The third field on row 1 has internal quotes so they’re doubled. The third field on row 2 contains a literal newline, so the entire field is quoted to keep the newline part of the field rather than ending the record.
The five rules everyone gets wrong
1. Quote the whole field, not just the special character
Smith", "John,30
Wrong. The field needs to be quoted from the first character to the last:
"Smith, John",30
Half-quoting is the most common bug in hand-rolled CSV writers. The parser sees Smith as one field, then John as another, then explodes on the comma between them.
2. Escape internal quotes by doubling, not backslashing
"He said \"hello\""
Wrong (this is C-style escaping). The CSV way:
"He said ""hello"""
Doubled quotes inside a quoted field collapse to a single quote at parse time. There is no backslash-escape in canonical CSV. Some dialects accept it; the standard does not.
3. CRLF between records, not LF
The RFC specifies \r\n between records. In practice every parser accepts \n too because Unix tools generated CSV without CR for decades. But if you’re writing a spec-compliant CSV that needs to round-trip through Microsoft Excel or older enterprise tools, use CRLF.
4. Quoted fields can contain literal newlines
"Multi
line
field",next
This is one row, two fields. The first field’s value is literally Multi\nline\nfield. The newlines inside a quoted field are not record separators. Parsers that split on \n first and parse fields after will mangle this — and they’re surprisingly common.
5. The first row is sometimes a header, sometimes not
The RFC says headers are optional. Some tools require them, some prohibit them, some assume them. There’s no in-band signal that says “this row is a header.” The agreement is out-of-band: the API doc says “expect a header row,” or the importer toggles a checkbox. Always document.
The dialects nobody tells you about
In the wild, you’ll see all of these:
TSV (tab-separated values). Delimiter is \t. Quoting is rarely needed because tabs in fields are rare. Common in scientific data, NCBI dumps, log exports.
Semicolon-delimited “CSV”. Standard in much of Europe and in CSVs exported from Excel running under European locales. Why? Because the locale uses , as the decimal separator, so , can’t be the field separator without ambiguity. Excel’s behavior here is locale-dependent — the same file opens differently on a Spanish vs an American machine. This is a real source of cross-team breakage.
Pipe-delimited. Old-school finance and telecom exports. Delimiter is |. Same rules as comma-CSV otherwise.
Tab + quote + JSON-style escape. Some Hadoop/Hive pipelines use tab as delimiter and backslash-escape inside fields. Not RFC 4180. If you’re handling Hive output, expect this.
Microsoft “Save as CSV” with BOM. Excel writes a UTF-8 BOM (\xEF\xBB\xBF) at the start of the file. Some parsers see the BOM as part of the first field name (name instead of name). Strip the BOM on read.
RFC 4180 strict vs lenient parsers. Strict parsers reject \n-only line endings and unquoted fields with embedded quotes. Lenient parsers accept everything and try to do the right thing. Test your parser against malformed input and pick the behavior you want.
Encoding: where most CSV bugs actually live
Half the CSV bugs in production aren’t about quoting. They’re about character encoding.
The setup: A file is exported from a Windows tool in windows-1252 (Latin-1 with extras). It’s transferred to a Linux server that assumes UTF-8. A é becomes é. The data isn’t broken — it’s mis-decoded.
The fix: Always declare encoding. Always require UTF-8 for new pipelines. When you receive CSV, look at the bytes: a UTF-8 é is 0xC3 0xA9, a Windows-1252 é is 0xE9, a Latin-1 é is 0xE9. If you see a single byte 0xE9 standing where a é should be, you’re not in UTF-8.
The BOM: A UTF-8 BOM (\xEF\xBB\xBF) at the start of a file is a hint to Windows tools that the file is UTF-8. It is not part of the data. Strip it on read.
When ingesting unknown CSV, run chardet (Python) or enca (CLI) to guess the encoding. Then explicitly decode with that encoding. Never let your parser guess silently.
Header handling: the column-name trap
CSV headers feel safe until you hit:
- Two columns with the same name (
name,name) — most parsers silently overwrite - Headers with leading/trailing whitespace (
" email") — your code looks foremailand finds nothing - Headers with case differences (
Emailvsemail) — case sensitivity is parser-dependent - Headers with newlines or commas inside them — yes, this is legal if quoted
Always normalize headers on read: trim whitespace, lowercase, replace whitespace with underscores. Reject duplicates explicitly rather than silently. The five lines of code save weeks.
Programmatic CSV: use the standard library, never roll your own
Every mainstream language has a CSV module that handles the rules above:
- Python:
csv(built-in).csv.DictReaderandcsv.DictWriterfor header-aware. Setdialect='excel'for default,dialect='excel-tab'for TSV, or define a custom one for pipe/semicolon. - JavaScript/Node:
csv-parseandcsv-stringify(thecsvpackage). Web:Papa Parse. - Go:
encoding/csv(built-in). ConfigurableCommaandLazyQuotes. - Rust:
csvcrate (Serde-compatible bindings).
The single most common source of CSV bugs is teams writing their own parser with text.split(','). It works on the test file. It fails on real data the first time someone has a comma in their company name. The CSV module is in your standard library for a reason — use it.
When CSV is the wrong format
CSV is great for:
- Spreadsheet round-trips
- Bulk data exchange between databases
- Simple list/table exports for non-technical users
CSV is bad for:
- Nested data (no native nesting; you flatten and lose structure)
- Sparse data (missing fields force you to choose: empty string, skipped column, or sentinel)
- Data with mixed types (everything is text until you parse it)
- Anything you’d want to validate against a schema
If you find yourself encoding nested data into CSV with delimiter conventions like ; for sub-items, you’ve outgrown CSV. Move to JSON or JSONL and use structural comparison tools. Even a one-document-per-line JSONL stream is more maintainable than over-stretched CSV.
For ad-hoc conversion, the CSV to JSON tool handles the common dialects (comma, semicolon, tab, pipe) and gets the quoting rules right.
Common-cases summary
| Field content | CSV representation |
|---|---|
| Plain text | John |
| Contains comma | "Smith, John" |
| Contains double quote | "He said ""hi""" |
| Contains newline | "Line 1\nLine 2" (whole field quoted) |
| Empty | (nothing, or "" to be explicit) |
| Just whitespace | " " (or it gets stripped) |
A working principle
CSV looks simple. It is not simple. Every “easy” CSV problem turns into a quoting bug, an encoding bug, or a dialect bug, and these bugs have a way of staying hidden in production for weeks. The defense is the same in every case: use the standard library’s CSV module, declare encoding explicitly, normalize headers on read, and test against ugly real-world files, not the clean ones you generate yourself.
Treat CSV like JSON or any other format: structured data with rules. The rules are written down. Read them.
Further reading
- YAML vs JSON: Which to Use When — when CSV doesn’t fit
- XML Still Matters in 2026 — the format that did stay strict
- Comparing JSON Structurally — when you upgrade from CSV
- RFC 4180 — CSV specification
- Frictionless Data — CSV dialect description
- Python
csvmodule docs
Related posts
- CSV Data Validation — Schema Validation, Type Checking, and Error Reporting — Validate CSV files before importing them into a database or processing pipeline.…
- Import CSV to Database — PostgreSQL, MySQL, SQLite, and Node.js — Import CSV files into PostgreSQL, MySQL, and SQLite using COPY commands, LOAD DA…
- CSV Format Guide — Structure, Delimiters, and Common Parsing Issues — CSV (Comma-Separated Values) is a simple tabular text format. Here's the RFC 418…
- Comparing JSON Structurally (Not Just as Strings) — Two JSON documents can be byte-different and semantically identical. Or byte-ide…
- XML Still Matters in 2026 (Here's Where and Why) — JSON won the wire format war years ago, but XML is still everywhere it actually …
Related tool
Convert CSV files to JSON with proper quoting and escaping.
Written by Mian Ali Khalid. Part of the Data & Format pillar.