top of page
Writer's pictureHackers Realm

Scraping XML Data using Beautiful SOUP | Web Scraping | Python

In the digital age, vast amounts of information are stored and exchanged in various structured formats, including XML (Extensible Markup Language). XML is a versatile markup language used for data representation and exchange between different systems. When you need to extract and manipulate data from XML documents, whether for web scraping, data analysis, or automation, BeautifulSoup is a powerful Python library that comes to the rescue.

Scraping xml data using Beautiful soup
Scraping xml data using Beautiful soup

In this tutorial, we'll delve into the fundamentals of scraping XML data using BeautifulSoup. We'll explore how to parse XML documents, navigate their hierarchical structure, extract specific elements or attributes, and transform the data to meet your needs. Whether you're looking to gather data from websites, APIs, or any other source that provides data in XML format, BeautifulSoup is an invaluable tool in your data scraping toolkit. So, let's dive in and discover how to harness the power of BeautifulSoup for effective XML data extraction.



You can watch the video-based tutorial with a step-by-step explanation down below.


Import Modules

from bs4 import BeautifulSoup
import requests
import re
  • BeautifulSoup - used for web scraping and parsing HTML and XML documents.

  • requests - used for making HTTP requests to web servers.


Get Data from URL


First define the url from where we want to get the content of the page.

url = "https://www.w3schools.com/xml/note.xml"
  • It creates a variable called url and stores the web address as its value.

Next send an https request to the page.

# get the data
xml = requests.get(url)
  • get(url): This is a method provided by the requests library. It sends an HTTP GET request to the specified url and returns the server's response.


Next display the content of the page.

xml.content

b'<?xml version="1.0" encoding="UTF-8"?>\n<note>\n <to>Tove</to>\n <from>Jani</from>\n <heading>Reminder</heading>\n <body>Don\'t forget me this weekend!</body>\n</note>'

  • The xml.content attribute in the context of the requests library in Python represents the binary content of the HTTP response obtained from the requests.get(url) request.


Next parse the data.

# parse the data
soup = BeautifulSoup(xml.content, 'xml')
print(soup)

<?xml version="1.0" encoding="utf-8"?>

<note>

<to>Tove</to>

<from>Jani</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>

</note>

  • The BeautifulSoup constructor is used to create a BeautifulSoup object, and you're specifying 'xml' as the parser, indicating that you want to parse XML data. After creating the BeautifulSoup object, you can work with it to navigate and extract information from the XML.


Next let us find the specific tag.

xml_tag = soup.find('heading')
xml_tag

<heading>Reminder</heading>

  • soup.find('heading'): This line of code searches the BeautifulSoup object soup for the first occurrence of an XML element with the tag name 'heading'. It returns a BeautifulSoup Tag object representing that element.

  • xml_tag: This line assigns the result of the soup.find('heading') operation to the variable xml_tag. Now, xml_tag contains the BeautifulSoup Tag object representing the first 'heading' element found in the XML data.


Next let us just display the text.

xml_tag.text

'Reminder'

  • The xml_tag.text attribute, when applied to a BeautifulSoup Tag object, retrieves the text content of that specific XML element.

Next let us find another tag.

xml_tag = soup.find('note')
xml_tag

<note>

<to>Tove</to>

<from>Jani</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>

</note>

  • You're using BeautifulSoup to search for an XML element with the tag name 'note'.

  • When you run soup.find('note'), BeautifulSoup will find and return the first occurrence of an XML element with the tag name 'note' within the XML data represented by the soup object.

  • The result, xml_tag, will be a BeautifulSoup Tag object representing the first 'note' element found in your XML data.


Next let us just display the text.

print(xml_tag.text)

Tove

Jani

Reminder

Don't forget me this weekend!

  • The xml_tag.text attribute, when applied to a BeautifulSoup Tag object, retrieves the text content of that specific XML element.


Final Thoughts

  • BeautifulSoup provides a user-friendly and Pythonic way to parse and navigate XML data. Its intuitive syntax makes it accessible to both beginners and experienced developers.

  • BeautifulSoup can parse XML documents using different parsers, including the 'xml' parser for well-formed XML. This allows you to handle a wide range of XML data sources.

  • You can traverse the hierarchical structure of XML documents using BeautifulSoup's methods like find, find_all, find_parent, find_next_sibling, and more. These methods make it easy to locate specific elements and attributes within the XML.

  • When working with real-world data, it's important to implement error handling to deal with potential issues like missing elements or unexpected structures. BeautifulSoup provides ways to check for the existence of elements before extracting data.

In summary, BeautifulSoup is a valuable tool for scraping and manipulating XML data in Python. It simplifies the process of working with structured data, making it accessible for a wide range of data extraction and analysis tasks. By mastering BeautifulSoup's capabilities, you can efficiently gather and process data from various XML sources to meet your specific requirements.


Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

416 views

Comments


bottom of page