How to retrieve data from XML file using Xpath and Python

Written by Bits Lovers on 21 Jun 2022

How to retrieve data from XML file using Xpath and Python

I received a lot of help requests from our readers who need a solution to read information from XML and process their pipeline. So, let’s learn how to use Python to read an XML file to do whatever you need. The approach can resolve several different scenarios. But, for our example, we will use a famous XML schema that most Java applications use Maven as Dependency management. But, regardless of the schema of your XML file, it should work fine.

Why use Python with XML?

All bits lovers that stick around us know that we have covered a lot about good practices and articles about DevOps and Gitlab. I am writing this article because I created a solution for a project that contains several steps to extract information from an XML and use that information as input for another automation process.

Python is a poweful language for processing data, and it’s the top language today to work with Big Data and Machine Learning because it provides us with a lot of libraries that make our life easier.

Also, using Python within a pipeline with Gitlab is extremely easy and flexible because it’s just a matter of creating a Docker Image and packaging everything you need inside it.

Extracting Data from XML using Python

Around the computer world, it’s very often to bump at some time with an XML, all developers, and engineers, in general, are familiar with it. One of the approaches to looking up information within an XML file is by analyzing its hierarchical structure, and for this, we can use XPath.

What is XPath?

XPath uses path expressions to pick nodes or node-sets in an XML file. These path expressions look like those you notice when browsing files with a traditional computer file system.

You can utilize XPath expressions in several programming languages like XML Schema, Java, JavaScript, PHP, C, Python and C++, and many other languages.

The ElementTree XML API

For Python, we have one library called lxml.

The lxml XML toolkit is a Pythonic binding to the C libraries libxml2 and libxslt. It is special in that it merges the speed and XML features of these libraries with the clarity of a native Python API, largely compatible but foremost with the famous ElementTree API.

XML Example

Let’s analyze this file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                      http://maven.apache.org/xsd/maven-4.0.0.xsd">
	
	<parent>
		<groupId>com.bitslovers</groupId>
		<artifactId>blog-base</artifactId>
		<version>1.2.0</version>
	</parent>
	
        <version>1.0.0</version>
	<groupId>com.bitslovers</groupId>
	<artifactId>the-best-blog</artifactId>
	

	<name>Bits Lovers API</name>
	<packaging>war</packaging>
	<description>How to work with XML and Python</description>
	<build>

		<resources>
			<resource>
				<directory>src/main/resources</directory>
				<filtering>true</filtering>
			</resource>
		</resources>

		<plugins>
			<plugin>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>3.1</version>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-surefire-plugin</artifactId>
				<version>2.19.1</version>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-war-plugin</artifactId>
				<version>3.0.0</version>
			</plugin>

			
		</plugins>
	</build>

	<dependencies>
		

		<dependency>
			<groupId>com.fasterxml.jackson.jaxrs</groupId>
			<artifactId>jackson-jaxrs-json-provider</artifactId>
			<version>2.5.4</version>
		</dependency>

		<dependency>
			<groupId>io.swagger</groupId>
			<artifactId>swagger-jaxrs</artifactId>
			<version>1.5.15</version>
		</dependency>

		<dependency>
			<groupId>io.swagger</groupId>
			<artifactId>swagger-jersey2-jaxrs</artifactId>
			<version>1.5.15</version>
		</dependency>
		
		<dependency>
			<groupId>io.swagger</groupId>
			<artifactId>swagger-annotations</artifactId>
			<version>1.5.15</version>
		</dependency>
		
	</dependencies>

</project>

We need to retrieve the application version (1.0.0) from this file. But, if you noticed, there are several markups with <version></version>. So, how to get the correct version that we need?

Because of the application version, it’s the first occurrence of the flag <version> within the main tag <project>, no matter its position. You can use the following code:

#! /usr/bin/env python3
from lxml import etree as ET
import os
import sys
path=os.getcwd()
pom_xml = path  + '/' + str(sys.argv[1])

parser = ET.XMLParser(remove_comments=False)
xml = ET.parse(pom_xml, parser=parser)

# Find the project's current version
version = xml.find("./{*}version")
groupId = xml.find("./{*}groupId")
packaging = xml.find("./{*}packaging")
artifactId = xml.find("./{*}artifactId")

print("POM_VERSION=" +version.text)
print("POM_GROUPID=" +groupId.text)
print("POM_PACKAGING=" +packaging.text)
print("POM_ARTIFACTID=" +artifactId.text)


version_parent = xml.find("./{*}parent/{*}version")
groupId_parent = xml.find("./{*}parent/{*}groupId")
artifactId_parent = xml.find("./{*}parent/{*}artifactId")

The script’s pretty straightforward; we also get other information, like groupId, and certificated. The expression below does the job for us:

version = xml.find("./{*}version")

But looks at the expression above. How can we retrieve the version of the parent? Let’s do this exercise to understand how this works:

version_parent = xml.find("./{*}parent/{*}version")

If you noticed, the XPath changed. So, we provide a different path where is the version of our parent project. Simple right?

Conclusion

This same approach saved me thousands of hours and helped my team automate some old applications’ build processes. And because it’s Python, no matter where we need to execute this process, we need to guarantee that Python will be there.

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

How to retrieve data from XML file using Xpath and Python

Why use Python with XML?

Extracting Data from XML using Python

What is XPath?

The ElementTree XML API

XML Example

Conclusion

Bits Lovers

Comments

Explore more like this

How to Coordinate Multiple Scrum Teams

In today’s fast-paced business environment, scaling and coordinating multiple Scrum teams can be challenging. However, your organization can thrive with a robust framework, the guidance of an agile consulting firm,...

Leveraging the Power of AWS: A Guide to Accelerating Your App Development Success

Have you ever wondered how giant companies such as Netflix, LinkedIn, or Facebook manage their vast digital servers and applications? Or how startups rapidly transform their innovative ideas into reality...

Supercharging System Scalability with Event-Driven Architectures

Can traditional system architectures keep up in a world where responsiveness and real-time engagement have become the expectations? Picture a system that responds immediately to user actions and can handle...