Scraping Multimedia Files using Beautiful Soup | Web Scraping | Python

Hackers Realm

Oct 13, 20235 min read

In the digital age, multimedia content plays a pivotal role in our online experiences, from images and videos to audio files. Whether you're a researcher, data analyst, or simply an enthusiast looking to gather and analyze multimedia data from the web, web scraping can be a powerful tool. Beautiful Soup, a popular Python library, offers a robust solution for scraping multimedia files from websites with ease and precision.

Scraping Multimedia files using Beautiful Soup

Multimedia scraping, often referred to as web scraping, involves the automated extraction of various media files, such as images, videos, and audio clips, from web pages. This process allows you to harness the vast wealth of multimedia content available on the internet for various purposes, including content curation, data analysis, machine learning, or creating personalized collections.

You can watch the video-based tutorial with step by step explanation down below.

Import Modules

from bs4 import BeautifulSoup
import requests

BeautifulSoup - used for web scraping and parsing HTML and XML documents.
requests - used for making HTTP requests to web servers.

Get Page Content from URL

First define the url from where we want to get the content of the page.

url = 'https://www.thehindu.com/news/national/coronavirus-live-updates-may-29-2021/article34672944.ece?homepage=true'

It creates a variable called url and stores the web address as its value.

Next send an https request to the page.

page = requests.get(url)
page

page = requests.get(url): This line sends an HTTP GET request to the URL specified by the variable url. The requests.get() function is used to make a GET request to the specified URL, and the response from the server is stored in the variable page. This response typically includes the HTML content of the web page, along with various metadata.
page: When you execute this line, it will not produce any output. Instead, it assigns the response from the HTTP GET request to the variable page, which you can then use to access the content and other information from the web page.

Next we will parse the data.

# parse the data
soup = BeautifulSoup(page.content, 'html.parser')

BeautifulSoup(page.content, 'html.parser'): Here, you pass two arguments to the BeautifulSoup constructor:

-> page.content: This is the content of the web page that you retrieved using requests.get(url) earlier. It contains the HTML content of the page.

-> html.parser: This specifies the parser to be used by BeautifulSoup to parse the HTML content. 'html.parser' is a built-in parser provided by Python's standard library for parsing HTML documents.

Next we will find the source link.

# find the image src link
img_tag = soup.find('source')
img_tag

We use soup.find('source') to locate the first <source> tag in the HTML content.
The use of the <source> tag is typically associated with HTML5 video and audio elements, not images.

Next we will access the srcset attribute of the img_tag variable.

img_tag['srcset']
img_url = img_tag['srcset']

'https://th-i.thgim.com/public/news/national/2g2qwq/article53557510.ece/alternates/LANDSCAPE_1200/Migrants2jpg'

The srcset attribute is commonly used in HTML to specify multiple sources for an image, each with different resolutions or sizes.
This attribute is often used for responsive web design, where the browser can choose the most appropriate image source based on the user's device and screen size.

Download the Image from URL

Next we will download the image.

image = requests.get(img_url)

requests.get(img_url): This line uses the requests library to send an HTTP GET request to the URL specified by img_url. The img_url variable should contain the URL of the image you want to download.
The response from the HTTP GET request is stored in the image variable. This response will include not only the image data but also additional information such as the HTTP status code, headers, and more.

Next we will store the image in a file.

# store the image in file
with open('image.jpg', 'wb') as file:
    for chunk in image.iter_content(chunk_size=1024):
        file.write(chunk)

with open('image.jpg', 'wb') as file: This line opens a file named "image.jpg" in binary write mode ('wb'). The with statement ensures that the file is properly closed after the code block is executed, even if an error occurs.
for chunk in image.iter_content(chunk_size=1024): This line iterates over the content of the image response object in chunks, each with a size of 1024 bytes (1 kilobyte). This is done to efficiently download and save the image in smaller pieces, especially useful for large files.
file.write(chunk): Within the loop, each chunk of data is written to the file using the file.write() method. This accumulates the image data in the file as it's received from the response.

Download PPT from URL

Next we will download the ppt.

ppt=requests.get('http://www.howtowebscrape.com/examples/media/images/SampleSlides.pptx')

This fetches the PowerPoint file from the URL and stores it in the variable ppt.

Next we will the store the ppt file.

with open('sample.pptx', 'wb') as file:
    for chunk in ppt.iter_content(chunk_size=1024):
        file.write(chunk)

with open('sample.pptx', 'wb') as file: This line uses a with statement to open a local file named "sample.pptx" in binary write mode ('wb'). The with statement ensures that the file is properly closed after the code block is executed. The file variable represents the open file.
for chunk in ppt.iter_content(chunk_size=1024): This line starts a loop that iterates over the content of the ppt response object obtained from an HTTP request. The iter_content method is used to stream the content in chunks. In this case, each chunk has a size of 1024 bytes (1 kilobyte).
file.write(chunk): Inside the loop, each chunk of data obtained from the HTTP response is written to the local file using the write method of the file object. This accumulates the downloaded data in the file.

Download Video from URL

Next we will download the video.

video=requests.get('http://www.howtowebscrape.com/examples/media/images/BigRabbit.mp4')

This fetches the video file from the URL and stores it in the variable video.

Next we will store the video file.

with open('BigRabbit.mp4', 'wb') as file:
    for chunk in video.iter_content(chunk_size=1024):
        file.write(chunk)

with open('BigRabbit.mp4', 'wb') as file:: This line uses a with statement to open a local file named "BigRabbit.mp4" in binary write mode ('wb'). The with statement ensures that the file is properly closed after the code block is executed. The file variable represents the open file.
for chunk in video.iter_content(chunk_size=1024): This line starts a loop that iterates over the content of the video response object obtained from an HTTP request. The iter_content method is used to stream the content in chunks. In this case, each chunk has a size of 1024 bytes (1 kilobyte).
file.write(chunk): Inside the loop, each chunk of data obtained from the HTTP response is written to the local file using the write method of the file object. This accumulates the downloaded data in the file.
As the loop iterates through the chunks of data, it downloads the video file in smaller pieces (1 KB at a time) and writes them to the local file. This approach is especially useful for efficiently downloading and saving large video files because it doesn't load the entire file into memory at once.

Final Thoughts

Ensure that your scraping activities are ethical and legal. Scraping should be done responsibly and for legitimate purposes, such as research, analysis, or personal use. Avoid scraping private or sensitive data without proper authorization.
Beautiful Soup is a versatile library for parsing HTML and XML documents, but it may not handle complex multimedia content extraction on its own. You may need to combine it with other libraries or techniques to handle different types of media files, such as images, videos, or audio.
Some websites load multimedia content dynamically using JavaScript. In such cases, you may need to use tools like Selenium to automate interaction with web pages and scrape content after it has been dynamically loaded.
Consider how you will store and organize the multimedia files you scrape. Depending on the volume of data, you may need to implement a structured storage system or database.

In summary, scraping multimedia files using Beautiful Soup can be a valuable tool for collecting and analyzing web content, but it should be done with care, respect for ethical and legal considerations, and a robust approach to handling different types of media files and dynamic web content.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm