Scraping data using regular expressions is a common technique when you need to extract specific patterns or information from unstructured text. Regular expressions (regex or regexp) are powerful tools for pattern matching and text manipulation. You can use them in various programming languages like Python, JavaScript, and many others.
In this guide, we will explore how to use regular expressions to scrape data effectively. We'll discuss the key steps involved in the process, from defining the regex patterns to processing the extracted data.
You can watch the video-based tutorial with a step-by-step explanation down below.
Import Modules
from bs4 import BeautifulSoup
import requests
import re
BeautifulSoup - used for web scraping and parsing HTML and XML documents.
requests - used for making HTTP requests to web servers.
re - used for working with regular expressions.
Get the Data using URL
First define the url from where we want to get the content of the page.
url = "https://www.imdb.com/chart/top/"
It creates a variable called url and stores the web address as its value.
Next we will define the headers.
HEADERS = {'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'}
This dictionary is typically used to store HTTP headers for making HTTP requests using libraries like requests in Python.
User-Agent: This is an HTTP header field that provides information about the user agent (i.e., the client making the request). In this case, the user agent string is set to mimic the behavior of a web browser running on an iPad with specific details:
Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148: This user agent string contains information about the device (iPad), its operating system (CPU OS 12_2 like Mac OS X), and the browser engine (AppleWebKit). It also includes a version number (Mobile/15E148) and mentions the Gecko rendering engine, commonly found in web browsers.
The purpose of setting a user agent in an HTTP request header like this is to provide information to the web server about the client making the request. Websites and web services often use this information to tailor their responses based on the type of client, such as displaying mobile-optimized content for mobile devices.
Next send an https request to the page.
# get page data
page = requests.get(url, headers=HEADERS)
page
<Response [200]>
The code snippet you provided uses the requests.get method from the requests library to make an HTTP GET request to a specified url. It also includes the HEADERS dictionary in the request headers to specify the user agent.
page = requests.get(url, headers=HEADERS): This line sends an HTTP GET request to the URL specified by the url variable while including the user agent specified in the HEADERS dictionary. The response from the server is stored in the page variable.
page: This variable now contains the response object returned by the HTTP GET request. You can use this variable to access various properties of the response, such as the content, status code, headers, and more.
Next we will parse the data.
# parse the data
soup = BeautifulSoup(page.content, 'html.parser')
BeautifulSoup(page.content, 'html.parser'): Here, you pass two arguments to the BeautifulSoup constructor:
-> page.content: This is the content of the web page that you retrieved using requests.get(url) earlier. It contains the HTML content of the page.
-> html.parser: This specifies the parser to be used by BeautifulSoup to parse the HTML content. 'html.parser' is a built-in parser provided by Python's standard library for parsing HTML documents.
Regex to find particular class
Next we will use the Beautiful Soup library (soup) to search for a <div> tag that contains a specific text string matching a regular expression pattern.
tag = soup.find('div', string=re.compile(r'by+'))
tag
<div class="ipc-title__description">IMDb Top 250 as rated by regular IMDb voters</div>
soup.find('div', string=re.compile(r'by+')): This line uses Beautiful Soup's find method to search for a <div> tag within the parsed HTML document (soup) that contains text matching the regular expression pattern re.compile(r'by+').
-> 'div': Specifies that you want to search for <div> tags.
-> string=re.compile(r'by+'): Specifies the search criteria. In this case, it's searching for a <div> tag whose text content matches the regular expression pattern r'by+'. This pattern matches one or more occurrences of the letter 'b' followed by 'y'.
Next retrieve the text content of that HTML element, you can access it using the .text attribute of the tag object.
tag.text
'IMDb Top 250 as rated by regular IMDb voters'
tag.text returns the text content of the tag element, and it is stored in the tag_text variable.
Using Regular Expression
re.findall function with a regular expression pattern to search for and extract text enclosed within <title> HTML tags in the page.text string.
re.findall(r'<title>(.*?)</title>', page.text)
['IMDb Top 250 Movies', 'IMDb, an Amazon company']
This code uses the re.findall function to search for all non-overlapping matches of the regular expression pattern r'<title>(.*?)</title>' within the page.text string.
r'<title>(.*?)</title>': This regular expression pattern is designed to match text enclosed within <title> HTML tags. Here's a breakdown of its components:
-> <title>: Matches the opening <title> tag.
-> (.*?): This is a non-greedy capture group that captures any text (including newlines) between the <title> and </title> tags. The .*? part matches any character (.) zero or more times (*) in a non-greedy manner (i.e., it matches as few characters as possible).
-> </title>: Matches the closing </title> tag.
Next let us see how we can get the entire tag value.
re.findall(r'(<title>(.*?)</title>)', page.text)
re.findall(r'(<title>(.*?)</title>)', page.text): This code uses the re.findall function to find all non-overlapping matches of the regular expression pattern r'(<title>(.*?)</title>)' within the page.text string.
r'(<title>(.*?)</title>)': This regular expression pattern captures the entire <title> element, including both the opening and closing <title> tags, as well as the text content within the tags. Here's a breakdown of its components:
-> <title>: Matches the opening <title> tag.
-> (.*?): This is a non-greedy capturing group that matches any text (including newlines) between the <title> and </title> tags.
-> </title>: Matches the closing </title> tag.
Final Thoughts
Regular expressions offer a high degree of flexibility for pattern matching. You can extract specific data patterns, such as email addresses, phone numbers, URLs, and more, from text data with precision.
Regular expressions can be very fast for simple patterns, but they can become inefficient and slow for complex patterns or large datasets. It's essential to optimize your regular expressions when working with extensive text data.
When using regular expressions, it's crucial to implement error handling to handle cases where patterns do not match or where unexpected input is encountered.
Beautiful Soup provides a Pythonic way to parse and navigate structured documents, making it easier to access specific elements and attributes within the document's tree-like structure. When coupled with regular expressions, it becomes even more versatile, enabling you to fine-tune your data extraction efforts by targeting precise patterns within the text content.
In summary, while regular expressions are a valuable tool for scraping data from unstructured text, they should be used judiciously and in conjunction with other tools and techniques when dealing with more complex or structured data. Understanding their strengths and limitations is essential for successful data scraping projects.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comments