Uncover the secrets of efficient web scraping from Amazon products using Selenium in Python. This comprehensive tutorial guides you through extracting product data, empowering you to gather valuable information for various applications. Whether you're a beginner or seeking to refine your web scraping skills, this guide provides step-by-step instructions. Elevate your Python proficiency and learn to navigate dynamic e-commerce websites. Join us on this educational journey to master product scraping with Selenium in Python!
In this tutorial, we will explore the process of scraping product data from Amazon using Selenium. We will discuss the essential components, steps, and best practices involved in building a web scraping that can extract product information such as titles, prices, descriptions, and customer reviews.
You can watch the video-based tutorial with a step-by-step explanation down below.
Import Modules
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
webdriver - software component or library that is used to automate interactions with web browsers.
selenium.webdriver.common.by - provides a set of locator strategies to find and interact with web elements on a web page.
sleep - used to introduce a delay or pause in the execution of a program or script.
Set path for Webdriver
We will set the path of our chrome driver exe file.
path = 'C:\\Chromedriver.exe'
Give the path of your file. My chrome driver file is on the C drive.
Next create the browser
browser = webdriver.Chrome(executable_path = path)
We are accessing the chrome browser in this tutorial. You can access different browsers as per your requirement.
This is a fundamental step when using Selenium for web automation, testing, or web scraping with Google Chrome.
Chrome() is a constructor for creating a WebDriver instance that controls the Google Chrome browser. When you call webdriver.Chrome(), it initializes a new Chrome browser window or tab that can be controlled programmatically using Selenium.
Next we will open the url page from browser that we have created.
# load the webpage
browser.get('https://www.amazon.in')
We use browser.get() to navigate to the specified URL.
After opening the URL, you can proceed to interact with the web page, extract data, or perform other tasks using Selenium.
Next we will maximize the browser window.
browser.maximize_window()
This method is used to maximize the browser window so that it takes up the entire screen, making it useful when you want to interact with a web page in its full view.
By maximizing the browser window, you ensure that the web page is displayed in its largest possible view, which can be particularly useful for testing or scraping websites that have responsive designs or elements that may behave differently depending on the size of the browser window.
Next we will find the required elements.
# get the input elements
input_search = browser.find_element(By.ID, 'twotabsearchtextbox')
search_button = browser.find_element(By.XPATH, "(//input[@type='submit'])[1]")
input_search is being used to find an input element with the ID 'twotabsearchtextbox'. This is commonly used to locate the search input box on many websites, including Amazon.
search_button is being used to find an input element with the XPath "(//input[@type='submit'])[1]". This appears to be used to locate a submit button, possibly the search button on the Amazon website.
Next let us send the input to web page.
# send the input to the webpage
input_search.send_keys("Smartphones under 10000")
sleep(1)
search_button.click()
input_search.send_keys("Smartphones under 10000"): This line of code is using Selenium to interact with a web page. It finds an input element (usually a text box) with the variable input_search, and it simulates typing the text "Smartphones under 10000" into this input field. Essentially, it's like a virtual user typing this search query into a search bar on a web page.
sleep(1): This line of code is introducing a pause in the execution of your script for 1 second. It uses the sleep function from the time module. This pause allows the web page a moment to process the input text you entered in the previous step before proceeding to the next action. It's a common practice to include small delays like this in web automation scripts to ensure that the web page has time to respond to your actions.
search_button.click(): This line of code finds a button or element that triggers a search action on the web page and simulates a click on it. After typing the search query in the input field, you typically need to trigger the search action by clicking a button, such as a "Search" or "Submit" button.
Scrap the Products from Amazon
Next we will scrap the product details from Amazon.
products = []
for i in range(10):
print('Scraping page', i+1)
product = browser.find_elements(By.XPATH, "//span[@class='a-size-medium a-color-base a-text-normal']")
for p in product:
products.append(p.text)
next_button = browser.find_element(By.XPATH, "//a[text()='Next']")
next_button.click()
sleep(2)
Scraping page 1
Scraping page 2
Scraping page 3
Scraping page 4
Scraping page 5
Scraping page 6
Scraping page 7
Scraping page 8
Scraping page 9
Scraping page 10
products = []: Initializes an empty list called 'products' to store the scraped product names.
The script then enters a loop that will iterate 10 times, representing 10 pages of search results.
Inside the loop, it prints a message to indicate which page is being scraped.
product = browser.find_elements(By.XPATH, "//span[@class='a-size-medium a-color-base a-text-normal']"): This line finds all elements on the current page that match the specified XPath expression. These elements are assumed to be product names.
It then enters another loop to iterate through the found product elements (p) and appends their text (product names) to the 'products' list.
next_button.click(): It clicks the 'Next' button to navigate to the next page of search results.
sleep(2): Introduces a 2-second delay to allow the next page to load before scraping. This delay helps ensure that the script doesn't try to scrape the page before it has fully loaded.
Next let us see the length of the products list.
len(products)
186
If you want to determine the number of products that have been scraped and stored in the products list, you can use the len() function in Python.
Length of list is 186.
Next retrieve the first 5 elements from the list.
products[:5]
['Lava Blaze 2 (6GB RAM, 128GB Storage) - Glass Blue | 18W Fast Charging | 6.5 inch 90Hz Punch Hole Display | Side Fingerprint Sensor | Upto 11GB Expandable RAM | 5000 mAh Battery',
'Lava Yuva 2 Pro (Glass Lavender, 4GB RAM, 64GB Storage)| 2.3 Ghz Octa Core Helio G37| 13 MP AI Triple Camera |Fingerprint Sensor| 5000 mAh Battery| Upto 7GB Expandable RAM',
'realme narzo N53 (Feather Black, 4GB+64GB) 33W Segment Fastest Charging | Slimmest Phone in Segment | 90 Hz Smooth Display',
'realme narzo 50i Prime (Dark Blue 4GB RAM+64GB Storage) Octa-core Processor | 5000 mAh Battery',
'Redmi A2 (Aqua Blue, 2GB RAM, 32GB Storage) | Powerful Octa Core G36 Processor | Upto 7GB RAM | Large 16.5 cm HD+ Display with Massive 5000mAh Battery | 2 Years Warranty [Limited time Offer]']
To retrieve the first 5 products from the products list, you can use list slicing in Python.
This code snippet will extract and print the first 5 products from the products list.
Next we will close the browser that we created at the start of the tutorial.
browser.quit()
browser.quit() is called to close the browser window or tab and release any resources associated with it.
It's good practice to include browser.quit() at the end of your Selenium script to clean up after your automation tasks, ensuring that the browser doesn't remain open unnecessarily. This helps in preventing memory leaks and ensuring that your script doesn't interfere with subsequent browser sessions or other processes.
Final Thoughts
Amazon has strict terms of service that prohibit automated scraping of its website. While many people do scrape Amazon, doing so without permission can lead to legal consequences or IP blocking. Always check Amazon's robots.txt file and terms of service for scraping restrictions and consider reaching out to Amazon for permission if necessary.
Ensure that your code uses reliable XPath or CSS selectors to locate elements on Amazon's website. Amazon's HTML structure may change over time, so make your selectors as robust as possible to handle these changes gracefully.
Implement error handling and monitoring mechanisms to detect issues with your scraping script and ensure it continues running smoothly.
Selenium for web scraping on Amazon, empowering you to gather valuable insights, track competitors, and make informed decisions in the dynamic world of e-commerce. However, it is crucial to emphasize the importance of responsible and ethical web scraping practices, as misuse can have legal consequences and damage your reputation.
In summary, web scraping from Amazon using Selenium can be a valuable tool, but it requires careful planning, ethical considerations, and adherence to Amazon's terms and policies. Always prioritize responsible and respectful scraping practices to avoid potential legal or technical issues.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comentários