How to retrieve data from XML file using Xpath and Python

How to retrieve data from XML file using Xpath and Python

I received a lot of help requests from our readers who need a solution to read information from XML and process their pipeline. So, let’s learn how to use Python to read an XML file to do whatever you need. The approach can resolve several different scenarios. But, for our example, we will use a famous XML schema that most Java applications uses Maven as Dependency management. But, regardless of the schema of your XML file, it should work fine.

Why use Python with XML?

All bits lovers that stick around us know that we have covered a lot about good practices and articles about DevOps and GitLab. I am writing this article because I created a solution for a project that contains several steps to extract information from an XML and use that information as input for another automation process.

Python is a poweful language for processing data, and it’s the top language today to work with Big Data and Machine Learning because it provides us with a lot of libraries that make our life easier.

Also, using Python within a pipeline with Gitlab is extremely easy and flexible because it’s just a matter of creating a Docker Image and packaging everything you need inside it.

Extracting Data from XML using Python

Around the computer world, it’s very often to bump at some time with an XML, all developers, and engineers, in general, are familiar with it. One of the approaches to looking up information within an XML file is by analyzing its hierarchical structure, and for this, we can use XPath.

What is XPath?

XPath uses path expressions to pick nodes or node-sets in an XML file. These path expressions look like those you notice when browsing files with a traditional computer file system.

You can utilize XPath expressions in several programming languages like XML Schema, Java, JavaScript, PHP, C, Python and C++, and many other languages.

The ElementTree XML API

For Python, we have one library called lxml

The lxml XML toolkit is a Pythonic binding to the C libraries libxml2 and libxslt. It is special in that it merges the speed and XML features of these libraries with the clarity of a native Python API, largely compatible but foremost with the famous ElementTree API. 

XML Example

Let’s analyze this file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                      http://maven.apache.org/xsd/maven-4.0.0.xsd">
	
	<parent>
		<groupId>com.bitslovers</groupId>
		<artifactId>blog-base</artifactId>
		<version>1.2.0</version>
	</parent>
	
        <version>1.0.0</version>
	<groupId>com.bitslovers</groupId>
	<artifactId>the-best-blog</artifactId>
	

	<name>Bits Lovers API</name>
	<packaging>war</packaging>
	<description>How to work with XML and Python</description>
	<build>

		<resources>
			<resource>
				<directory>src/main/resources</directory>
				<filtering>true</filtering>
			</resource>
		</resources>

		<plugins>
			<plugin>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>3.1</version>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-surefire-plugin</artifactId>
				<version>2.19.1</version>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-war-plugin</artifactId>
				<version>3.0.0</version>
			</plugin>

			
		</plugins>
	</build>

	<dependencies>
		

		<dependency>
			<groupId>com.fasterxml.jackson.jaxrs</groupId>
			<artifactId>jackson-jaxrs-json-provider</artifactId>
			<version>2.5.4</version>
		</dependency>

		<dependency>
			<groupId>io.swagger</groupId>
			<artifactId>swagger-jaxrs</artifactId>
			<version>1.5.15</version>
		</dependency>

		<dependency>
			<groupId>io.swagger</groupId>
			<artifactId>swagger-jersey2-jaxrs</artifactId>
			<version>1.5.15</version>
		</dependency>
		
		<dependency>
			<groupId>io.swagger</groupId>
			<artifactId>swagger-annotations</artifactId>
			<version>1.5.15</version>
		</dependency>
		
	</dependencies>

</project>

We need to retrieve the application version (1.0.0) from this file. But, if you noticed, there are several markups with <version></version>. So, how to get the correct version that we need?

Because of the application version, it’s the first occurrence of the flag <version> within the main tag <project>, no matter its position. You can use the following code:

#! /usr/bin/env python3
from lxml import etree as ET
import os
import sys
path=os.getcwd()
pom_xml = path  + '/' + str(sys.argv[1])

parser = ET.XMLParser(remove_comments=False)
xml = ET.parse(pom_xml, parser=parser)

# Find the project's current version
version = xml.find("./{*}version")
groupId = xml.find("./{*}groupId")
packaging = xml.find("./{*}packaging")
artifactId = xml.find("./{*}artifactId")

print("POM_VERSION=" +version.text)
print("POM_GROUPID=" +groupId.text)
print("POM_PACKAGING=" +packaging.text)
print("POM_ARTIFACTID=" +artifactId.text)


version_parent = xml.find("./{*}parent/{*}version")
groupId_parent = xml.find("./{*}parent/{*}groupId")
artifactId_parent = xml.find("./{*}parent/{*}artifactId")

The script’s pretty straightforward; we also get other information, like groupId, and certificated. The expression below does all job for us:

version = xml.find("./{*}version")

But, looks at the expression above. How can we retrieve the version of the parent? Let’s do this exercise to understand how this works:

version_parent = xml.find("./{*}parent/{*}version")

If you noticed, the XPath changed. So, we provide a different path where is the version of our parent project. Simple right?

Conclusion

This same approach saved me thousands of hours and helped my team automate some old applications’ build processes. And because it’s Python, no matter where we need to execute this process, we need to guarantee that Python will be there.

Leave a Comment

Your email address will not be published.