Parsing XML with Python and XPath: A Practical Pipeline Guide

Bits Lovers
Written by Bits Lovers on
Parsing XML with Python and XPath: A Practical Pipeline Guide

Last year I spent two days debugging a build pipeline because our CI system was reading the wrong version from a Maven POM. The XPath query looked correct, but it was pulling the parent version instead of the project’s direct version. Same element name, completely different values.

This is a practical guide to parsing XML with Python and XPath for automation pipelines. Not the theoretical overview — the stuff that breaks in real systems.

Why XPath for XML Parsing?

XPath is a query language for navigating XML documents. You describe the path to the data you want, and XPath finds it. It’s cleaner than parsing by hand and more flexible than assuming fixed positions.

Python doesn’t have XPath built-in for the standard library’s xml.etree, but lxml does — and lxml is what you should be using anyway. It’s fast (C library underneath), Pythonic, and handles namespaces properly. In 2024-2026, lxml 5.x became the dominant library with XPath 3.1 support including group expressions and higher-order functions, providing 5-10x better performance than ElementTree.

# Install lxml
pip install lxml

# Or in a Docker container's requirements.txt
echo "lxml>=5.0.0" >> requirements.txt

The XPath Expression Syntax

XPath expressions look like file paths, but with different syntax:

# Navigate to a direct child
./parent/child

# Navigate to any descendant
.//child

# Select by element name
./{*}version

# Select by attribute
.//element[@name='value']

# Select the first matching element
(./element)[1]

# XPath 3.1: Count elements
count(//plugin)

# XPath 3.1: Check existence
boolean(//plugin[artifactId='maven-surefire-plugin'])

The {*} syntax is an lxml extension that matches any namespace. If your XML uses namespaces (and many do), you either need to declare them or use {*} as a wildcard.

A Real XML Example: Parsing Maven POMs

Here’s a POM I use for testing. It has a parent reference, so the <version> tag appears in multiple places:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                             http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <!-- Parent version is 1.2.0 -->
    <parent>
        <groupId>com.bitslovers</groupId>
        <artifactId>blog-base</artifactId>
        <version>1.2.0</version>
    </parent>

    <!-- This is the version I actually want: 2.1.0 -->
    <groupId>com.bitslovers</groupId>
    <artifactId>data-pipeline</artifactId>
    <version>2.1.0</version>
    <packaging>jar</packaging>

    <name>Data Pipeline Service</name>
    <description>ETL pipeline for analytics</description>

    <properties>
        <java.version>17</java.version>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.4.0</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>2.15.2</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.11.0</version>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>3.0.0</version>
            </plugin>
        </plugins>
    </build>
</project>

Writing the Python Script

Here’s a complete script that extracts the data you typically need from a POM:

#!/usr/bin/env python3
"""
Parse Maven POM and extract build configuration.
Used in CI/CD pipelines to inject version numbers into deployments.
"""
from lxml import etree as ET
import sys
import os

def parse_pom(pom_path):
    """Parse a Maven POM file and extract key metadata."""
    parser = ET.XMLParser(remove_comments=False)
    tree = ET.parse(pom_path, parser=parser)

    # Get project-level elements (first occurrence under project root)
    # These are direct children of <project>
    project_version = tree.find("./{*}version")
    group_id = tree.find("./{*}groupId")
    artifact_id = tree.find("./{*}artifactId")
    packaging = tree.find("./{*}packaging")

    # Get parent version separately
    parent_version = tree.find("./{*}parent/{*}version")
    parent_group_id = tree.find("./{*}parent/{*}groupId")
    parent_artifact_id = tree.find("./{*}parent/{*}artifactId")

    # Get properties (useful for build configuration)
    properties = {}
    for prop in tree.findall("./{*}properties/{*}"):
        # Get the tag name as text (without namespace)
        prop_name = prop.tag.split('}')[1] if '}' in prop.tag else prop.tag
        properties[prop_name] = prop.text

    # Get a specific dependency by artifactId
    def find_dependency(artifact_id_query):
        """Find a dependency by its artifactId."""
        for dep in tree.findall(".//{*}dependency"):
            aid = dep.find("{*}artifactId")
            if aid is not None and aid.text == artifact_id_query:
                return {
                    'groupId': dep.find("{*}groupId").text if dep.find("{*}groupId") is not None else None,
                    'artifactId': aid.text,
                    'version': dep.find("{*}version").text if dep.find("{*}version") is not None else None,
                }
        return None

    spark_dep = find_dependency("spark-core_2.12")

    return {
        'project_version': project_version.text if project_version is not None else None,
        'group_id': group_id.text if group_id is not None else None,
        'artifact_id': artifact_id.text if artifact_id is not None else None,
        'packaging': packaging.text if packaging is not None else 'jar',
        'parent_version': parent_version.text if parent_version is not None else None,
        'parent_group_id': parent_group_id.text if parent_group_id is not None else None,
        'parent_artifact_id': parent_artifact_id.text if parent_artifact_id is not None else None,
        'properties': properties,
        'spark_version': spark_dep['version'] if spark_dep else None,
    }


def main():
    if len(sys.argv) < 2:
        pom_file = "pom.xml"
    else:
        pom_file = sys.argv[1]

    if not os.path.exists(pom_file):
        print(f"Error: {pom_file} not found", file=sys.stderr)
        sys.exit(1)

    data = parse_pom(pom_file)

    print("# Generated from pom.xml")
    print(f"POM_VERSION={data['project_version']}")
    print(f"POM_GROUPID={data['group_id']}")
    print(f"POM_ARTIFACTID={data['artifact_id']}")
    print(f"POM_PACKAGING={data['packaging']}")
    print(f"POM_PARENT_VERSION={data['parent_version']}")

    if data['spark_version']:
        print(f"SPARK_VERSION={data['spark_version']}")

    print(f"# Properties: {data['properties']}")


if __name__ == "__main__":
    main()

Running this against the example POM:

python parse_pom.py /path/to/pom.xml
# Output:
# Generated from pom.xml
# POM_VERSION=2.1.0
# POM_GROUPID=com.bitslovers
# POM_ARTIFACTID=data-pipeline
# POM_PACKAGING=jar
# POM_PARENT_VERSION=1.2.0
# SPARK_VERSION=3.4.0
# Properties: {'java.version': '17', 'maven.compiler.source': '17', 'maven.compiler.target': '17'}

Using This in a CI/CD Pipeline

This is where XML parsing becomes valuable. You can extract version numbers and pass them to deployment scripts:

# .gitlab-ci.yml example
stages:
  - build
  - deploy

build:
  stage: build
  script:
    - pip install lxml>=5.0.0
    - python parse_pom.py pom.xml > build.env
    - export $(cat build.env | xargs)
    - mvn package -DskipTests
    - docker build -t registry.example.com/app:${POM_ARTIFACTID}:${POM_VERSION} .

deploy:
  stage: deploy
  script:
    - export $(cat build.env | xargs)
    - kubectl set image deployment/app app=registry.example.com/app:${POM_ARTIFACTID}:${POM_VERSION}

The POM version becomes the Docker tag becomes the Kubernetes deployment version. One source of truth.

XPath Gotchas That Will Break Your Pipeline

1. The Namespace Problem

This XML has a default namespace: xmlns="http://maven.apache.org/POM/4.0.0". If you use find("./version") without the {*} syntax, it returns None. The namespace prefix changes the element name to {http://maven.apache.org/POM/4.0.0}version.

# WRONG - won't find anything
version = tree.find("./version")

# RIGHT - handles any namespace
version = tree.find("./{*}version")

2. Multiple Matches and Position

If you use findall() and then index, the behavior depends on whether the elements are siblings or nested:

# This gets the FIRST dependency
first_dep = tree.findall(".//{*}dependency")[0]

# This gets the FIRST dependency under <build>/<plugins>
first_plugin = tree.find(".//{*}build/{*}plugins/{*}plugin")

# But this gets ALL plugins anywhere in the document
all_plugins = tree.findall(".//{*}plugin")

The difference between .// (anywhere in document) and ./path/to/ (specific location) matters.

3. Missing Elements

When an element doesn’t exist, find() returns None, not an empty list. Accessing .text on None raises an AttributeError.

# WRONG - crashes if version is missing
version_text = tree.find("./{*}version").text

# RIGHT - safe handling
version_elem = tree.find("./{*}version")
version_text = version_elem.text if version_elem is not None else "unknown"

4. Text Content in Mixed Content

If an element contains both text and child elements, .text only gets the first text node:

<description>This is text.
  <emphasis>emphasized</emphasis>
  And more text.
</description>

description.text gives you "This is text.\n " but not the text after the child element. Use tree.xpath("string(//description)") to get all text content.

Pre-compiling XPath Expressions for Reuse

If you’re parsing many similar XML files in a pipeline, pre-compile your XPath expressions. This is a performance win for repetitive operations:

from lxml import etree

# Pre-compile once
XPathExpr = etree.XPath("//book/title")
PriceExpr = etree.XPath("//book[price > 25]/price/@currency")

# Reuse across many documents — faster than re-parsing the expression each time
for xml_file in glob("catalogs/*.xml"):
    tree = etree.parse(xml_file)
    titles = XPathExpr(tree)
    currencies = PriceExpr(tree)
    process(titles, currencies)

Security: The “Billion Laughs” Attack

Here’s a gotcha that became critical in 2024-2026: the Python standard library’s xml.etree.ElementTree is vulnerable to the “billion laughs” attack — exponential entity expansion. A malicious XML file can expand to gigabytes in memory and crash your parser or your system.

# DANGEROUS: stdlib ElementTree can be exploited
import xml.etree.ElementTree as ET
tree = ET.parse("untrusted.xml")  # Don't do this with untrusted input

# SAFE: Use defusedxml for untrusted sources
from defusedxml import ElementTree as defusedET

def safe_parse(xml_bytes):
    try:
        return defusedET.fromstring(xml_bytes)
    except Exception as e:
        print(f"XML parsing blocked for security: {e}")
        return None

If you’re parsing XML from user uploads, web requests, or third-party APIs, use defusedxml. It blocks entity expansion, external entity references, and other attack vectors. lxml also has protections, but defusedxml is the one designed for hostile input.

XSD Validation: Catch Data Quality Issues Early

Before you start parsing, validate your XML against an XSD schema. This catches data quality issues at the parse stage rather than deep in your business logic:

import xmlschema
from lxml import etree

# Load your schema
schema = xmlschema.XMLSchema10("path/to/schema.xsd")

# Validate against incoming XML
xml_bytes = open("data.xml", "rb").read()

if schema.is_valid(xml_bytes):
    validated = schema.decode(xml_bytes)
    print(f"Valid XML, root element: {validated.tag}")
else:
    print("Invalid XML:")
    for error in schema.iter_errors(xml_bytes):
        print(f"  Line {error.line}: {error.message}")

The xmlschema library (pip install xmlschema) is the standard for XSD validation in Python. It’s particularly useful for ETL pipelines where upstream systems may send malformed XML and you want to catch the errors before they propagate.

Namespaces: The Complete Pattern

When your XML has multiple namespaces, declare them explicitly rather than using the {*} wildcard:

from lxml import etree

ns_xml = (
    b'<?xml version="1.0"?>'
    b'<root xmlns:atom="http://www.w3.org/2005/Atom"'
    b'      xmlns:dc="http://purl.org/dc/elements/1.1/">'
    b'<atom:entry><atom:title>Example</atom:title>'
    b'<dc:creator>John Doe</dc:creator></atom:entry></root>'
)

ns = {
    "atom": "http://www.w3.org/2005/Atom",
    "dc": "http://purl.org/dc/elements/1.1/"
}

tree = etree.fromstring(ns_xml)

# Explicit namespace handling — more readable than wildcards
titles = tree.xpath("//atom:title/text()", namespaces=ns)
creators = tree.xpath("//dc:creator/text()", namespaces=ns)

ElementTree vs lxml: Which to Use When

Scenario Recommended
Large documents, performance-critical lxml 5.x
Untrusted XML input defusedxml
Simple, no external deps ElementTree
XPath 3.1 features needed lxml only
XSD validation xmlschema
ETL pipelines to Athena lxml for speed

The standard library’s ElementTree only supports XPath 1.0 — no namespace axis, no advanced predicates, no group expressions. For anything beyond basic navigation, use lxml.

When XML Parsing Is the Wrong Tool

Sometimes you’re dealing with something that looks like XML but isn’t. If you see:

  • Very large files (GBs) — use streaming parsers (SAX, iterparse)
  • Malformed XML — use a permissive parser or switch to regex (yes, really, in specific cases)
  • Config files that happen to use XML syntax — consider whether a simpler format (YAML, TOML) would be better for the project

For CI/CD pipeline work with Maven POMs, Gradle builds, and similar development artifacts, XPath + lxml is the right tool. The data is structured, the files are small, and the queries are predictable.

Memory and Streaming: Large File Handling

For XML files larger than 100MB, the DOM approach (loading the entire document into memory) breaks down. Use iterparse for streaming:

from lxml import etree

def stream_large_xml(filepath, tag_filter="dependency"):
    """Process large XML files without loading everything into memory."""
    context = etree.iterparse(filepath, events=("end",), tag=tag_filter)
    for event, elem in context:
        # Process one element at a time
        artifact_id = elem.find("{*}artifactId")
        version = elem.find("{*}version")
        yield {
            'artifact': artifact_id.text if artifact_id is not None else None,
            'version': version.text if version is not None else None
        }
        # Clear the element to free memory
        elem.clear()

This keeps memory usage flat regardless of file size. For files over 500MB, streaming is not optional — it’s the only viable approach.

For more on pipeline automation, the posts on GitLab CI variable configuration and Terraform state management cover other aspects of infrastructure-as-code workflows. For XML-heavy build systems, the GitLab CI cache post covers caching strategies that pair well with data pipeline scripts.

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus