X Xerobit

Diff Algorithm Explained — How Text Comparison Tools Work

Text diff tools use the LCS (Longest Common Subsequence) or Myers diff algorithm to find minimal edits between two texts. Here's how diff algorithms work and how to use them in...

Mian Ali Khalid · · 6 min read
Use the tool
Text Diff
Compare two text blocks line-by-line or word-by-word. Unified and split view. Shows added, removed, and changed segments with full color coding.
Open Text Diff →

Text diff algorithms find the minimal set of changes to transform one text into another. They power git diff, code review tools, and merge conflict resolution. Understanding how they work helps you interpret diff output and choose the right tool for your use case.

Use the text compare tool to compare two texts and see every difference highlighted.

The LCS foundation

The core of most diff algorithms is the Longest Common Subsequence (LCS) problem: find the longest sequence of elements that appear in both texts in the same order.

Text A: "ABCBDAB"
Text B: "BDCAB"

LCS: "BCAB" (length 4)

Everything in Text A that’s not in the LCS was deleted. Everything in Text B not in the LCS was inserted. The LCS itself is unchanged.

For text diffing, the “elements” are usually lines (for line-by-line diff) or characters (for character-level diff).

Myers diff algorithm

The Myers diff algorithm (1986) is the foundation of diff, git diff, and most modern diff tools. It finds the shortest edit script — the minimum number of insertions and deletions to transform text A into text B.

It uses a greedy approach with a graph:

  • Each cell (x, y) represents having matched x characters from A and y characters from B
  • Moving right = deleting a character from A
  • Moving down = inserting a character from B
  • Moving diagonally = matching a character (costs nothing)

The algorithm finds the shortest path through this grid from (0,0) to (|A|, |B|).

Why Myers is used everywhere: It’s O(ND) time where N = total characters and D = edit distance. For texts with few changes (common in code commits), D is small and the algorithm is very fast.

Unified diff format

The standard output format for text diffs:

--- file1.txt   2024-03-01 10:00:00
+++ file2.txt   2024-03-04 14:20:00
@@ -1,7 +1,7 @@
 function greet(name) {
-  console.log("Hello, " + name);
+  console.log(`Hello, ${name}!`);
   return true;
 }
 
-// TODO: add goodbye
+// TODO: add goodbye function

Format elements:

  • --- — original file (deleted content)
  • +++ — new file (added content)
  • @@ — hunk header: -1,7 means starting at line 1, showing 7 lines from original; +1,7 same for new
  • (space) — unchanged context line
  • - (minus) — line removed from original
  • + (plus) — line added in new version

Context lines: Unified diff shows 3 context lines by default around each change. This helps locate the change without reading the whole file.

Diff in code

JavaScript (diff library)

import Diff from 'diff';

const oldText = `function greet(name) {
  console.log("Hello, " + name);
  return true;
}`;

const newText = `function greet(name) {
  console.log(\`Hello, \${name}!\`);
  return true;
}`;

// Word-by-word diff:
const wordDiff = Diff.diffWords(oldText, newText);
wordDiff.forEach((part) => {
  const color = part.added ? 'green' : part.removed ? 'red' : 'white';
  console.log(`[${color}] ${part.value}`);
});

// Line-by-line diff:
const lineDiff = Diff.diffLines(oldText, newText);
lineDiff.forEach((part) => {
  const prefix = part.added ? '+' : part.removed ? '-' : ' ';
  part.value.split('\n').filter(Boolean).forEach(line => {
    console.log(`${prefix} ${line}`);
  });
});

// Create unified patch:
const patch = Diff.createPatch('file.js', oldText, newText);
console.log(patch);

// Apply a patch:
const patched = Diff.applyPatch(oldText, patch);

Python (difflib)

import difflib

old_lines = """function greet(name) {
  console.log("Hello, " + name);
  return true;
}""".splitlines(keepends=True)

new_lines = """function greet(name) {
  console.log(`Hello, ${name}!`);
  return true;
}""".splitlines(keepends=True)

# Unified diff:
diff = difflib.unified_diff(
    old_lines, 
    new_lines,
    fromfile='old.js',
    tofile='new.js',
    n=3  # Context lines
)
print(''.join(diff))

# HTML diff (side-by-side):
differ = difflib.HtmlDiff()
html = differ.make_file(old_lines, new_lines, 
                         fromdesc='Original', todesc='Modified')
with open('diff.html', 'w') as f:
    f.write(html)

# Sequence matcher (similarity ratio):
matcher = difflib.SequenceMatcher(None, 'Hello World', 'Hello Python')
print(matcher.ratio())  # 0.636... (similarity 0-1)

# Find similar strings:
matches = difflib.get_close_matches('pytohn', ['python', 'Cython', 'jython'], n=3)
print(matches)  # ['python', 'jython', 'Cython']

Three-way merge

Git merge uses three-way diff: the base version, your changes, and their changes.

Base:     Line A, Line B, Line C
Ours:     Line A, Line B_modified, Line C
Theirs:   Line A, Line B, Line C_modified

Result:   Line A, Line B_modified, Line C_modified  (auto-merge)

A conflict occurs when both sides modify the same area:

Base:     Line A, Line B, Line C
Ours:     Line A, Line B_ours, Line C
Theirs:   Line A, Line B_theirs, Line C

Conflict:
<<<<<<< HEAD
Line B_ours
=======
Line B_theirs
>>>>>>> feature/branch

The conflict markers show both versions; you choose which to keep.

Character-level vs line-level diff

Line-level diff: Compares whole lines. Fast, standard for code, natural for humans to read. A change anywhere on a line marks the whole line changed.

Word-level diff: Compares individual words. Shows which words changed within a line. Useful for prose/documentation.

Character-level diff: Most granular. Shows exact character positions that changed. Useful for cryptic single-character changes (typos, encoding differences).

When to use each:

  • Source code review: line-level
  • Documentation/markdown: word-level
  • Configuration files with long single-line values: character-level

Semantic diff

Standard diff is syntactic — it doesn’t understand code structure. Tools like difftastic use language-aware parsing to produce semantically meaningful diffs:

// Code change:
const x = {a: 1, b: 2, c: 3};
// to:
const x = {b: 2, a: 1, c: 3};

// Syntactic diff: 1 line changed
// Semantic diff: no structural change (just key reorder)

Language-aware diff tools reduce noise in code reviews by ignoring reformatting when the logic is unchanged.


Related posts

Related tool

Text Diff

Compare two text blocks line-by-line or word-by-word. Unified and split view. Shows added, removed, and changed segments with full color coding.

Written by Mian Ali Khalid. Part of the Dev Productivity pillar.