X Xerobit

XML Parsing in JavaScript and Python — DOM, SAX, and XPath

Parse XML in JavaScript using DOMParser and the browser DOM, and in Python using ElementTree, lxml, and BeautifulSoup. Includes XPath queries, handling namespaces, and...

Mian Ali Khalid · · 5 min read
Use the tool
XML Formatter
Format, validate, and beautify XML documents.
Open XML Formatter →

XML parsing requires understanding tree traversal, namespace prefixes, and XPath. Modern JavaScript has DOMParser built-in; Python has xml.etree.ElementTree in the standard library.

Use the XML Formatter to format and validate XML before parsing.

JavaScript: DOMParser

const xml = `
<library>
  <book id="1">
    <title>Clean Code</title>
    <author>Robert Martin</author>
    <year>2008</year>
  </book>
  <book id="2">
    <title>The Pragmatic Programmer</title>
    <author>David Thomas</author>
    <year>1999</year>
  </book>
</library>`;

const parser = new DOMParser();
const doc = parser.parseFromString(xml, 'application/xml');

// Check for parse errors:
const error = doc.querySelector('parsererror');
if (error) throw new Error(error.textContent);

// Extract all books:
const books = [...doc.querySelectorAll('book')].map(book => ({
  id: book.getAttribute('id'),
  title: book.querySelector('title').textContent,
  author: book.querySelector('author').textContent,
  year: parseInt(book.querySelector('year').textContent),
}));

console.log(books);
// [{ id: '1', title: 'Clean Code', author: 'Robert Martin', year: 2008 }, ...]

JavaScript: Fetch and parse XML

async function fetchXml(url) {
  const res = await fetch(url);
  const text = await res.text();
  const parser = new DOMParser();
  const doc = parser.parseFromString(text, 'application/xml');
  
  const error = doc.querySelector('parsererror');
  if (error) throw new Error(`XML parse error: ${error.textContent}`);
  
  return doc;
}

// Parse RSS feed:
const doc = await fetchXml('https://example.com/feed.rss');
const items = [...doc.querySelectorAll('item')].map(item => ({
  title: item.querySelector('title')?.textContent,
  link: item.querySelector('link')?.textContent,
  pubDate: item.querySelector('pubDate')?.textContent,
  description: item.querySelector('description')?.textContent,
}));

Python: xml.etree.ElementTree

import xml.etree.ElementTree as ET

xml_string = """
<library>
  <book id="1">
    <title>Clean Code</title>
    <author>Robert Martin</author>
  </book>
  <book id="2">
    <title>The Pragmatic Programmer</title>
    <author>David Thomas</author>
  </book>
</library>"""

root = ET.fromstring(xml_string)

# Iterate over children:
for book in root.findall('book'):
    book_id = book.get('id')
    title = book.findtext('title')
    author = book.findtext('author')
    print(f"ID: {book_id}, Title: {title}, Author: {author}")

# Parse from file:
tree = ET.parse('books.xml')
root = tree.getroot()

# XPath-like expressions (limited):
titles = [el.text for el in root.findall('.//title')]

# Serialize back to XML:
tree.write('output.xml', encoding='unicode', xml_declaration=True)

Python: lxml (full XPath support)

from lxml import etree

tree = etree.parse('books.xml')

# Full XPath:
titles = tree.xpath('//book/title/text()')
# ['Clean Code', 'The Pragmatic Programmer']

# Complex XPath:
books_after_2000 = tree.xpath('//book[year > 2000]')

# XPath with attribute:
book_1 = tree.xpath('//book[@id="1"]')[0]

Handling namespaces

Namespaces are the most frustrating part of XML parsing:

<!-- RSS 2.0 with Dublin Core namespace: -->
<rss xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <item>
      <title>Article Title</title>
      <dc:creator>John Doe</dc:creator>
    </item>
  </channel>
</rss>
import xml.etree.ElementTree as ET

NS = {
    'dc': 'http://purl.org/dc/elements/1.1/',
}

root = ET.fromstring(xml_string)
for item in root.findall('.//item'):
    title = item.findtext('title')
    creator = item.findtext('dc:creator', namespaces=NS)
    print(f"{title} by {creator}")
// JavaScript: namespace-aware queries need getElementsByTagNameNS:
const dcNS = 'http://purl.org/dc/elements/1.1/';
const creators = [...doc.getElementsByTagNameNS(dcNS, 'creator')];
creators.forEach(el => console.log(el.textContent));

XML to JSON conversion

import xml.etree.ElementTree as ET

def xml_to_dict(element):
    """Recursively convert XML element to dict."""
    result = {}
    
    # Attributes:
    if element.attrib:
        result['@attributes'] = element.attrib
    
    # Text content:
    text = element.text and element.text.strip()
    if text:
        result['#text'] = text
    
    # Child elements:
    for child in element:
        child_data = xml_to_dict(child)
        if child.tag in result:
            if not isinstance(result[child.tag], list):
                result[child.tag] = [result[child.tag]]
            result[child.tag].append(child_data)
        else:
            result[child.tag] = child_data
    
    return result

root = ET.fromstring(xml_string)
data = {root.tag: xml_to_dict(root)}

Related posts

Related tool

XML Formatter

Format, validate, and beautify XML documents.

Written by Mian Ali Khalid. Part of the Dev Productivity pillar.