How to Deploy a Trained ML Model in Cloud

Hackers Realm

Oct 5, 20245 min read

As machine learning continues to drive innovation across industries, the ability to efficiently deploy trained models into production has become a critical skill. Once a model is developed and trained to solve a particular problem, the next step is to make it accessible for real-world applications. Cloud platforms provide the perfect infrastructure for deploying machine learning models, offering scalability, flexibility, and ease of management.

In this article, we'll explore the steps required to deploy a trained machine learning model in the cloud, covering popular platforms like AWS, Google Cloud, and Azure. Whether you're looking to serve real-time predictions via APIs or need to scale your model to handle large amounts of data, this guide will walk you through the entire process—from setting up your cloud environment to monitoring your model's performance.

You can watch the video-based tutorial with step by step explanation down below.

Please refer to this article to create a flask web app before proceeding further.

Project Setup

First we will see the project folder setup.

The project structure, as illustrated above, consists of several key components imported from the Iris Flask web app tutorial. These include the templates, the deploy.py file, the training notebook, the Iris dataset, and the trained model.
Deploying Flask applications for production environments requires careful consideration of scalability and performance.
To deploy this application in a public cloud environment, modifications are required to the existing deploy.py file. By default, Flask operates as a single-threaded framework, which is sufficient for development but may not meet the demands of handling multiple concurrent requests in a production setting.
In such scenarios, deploying additional tools like Waitress becomes essential to ensure optimal performance.

Waitress Setup for Production

First we will see how to run a web application using the Waitress WSGI server. Waitress is a production-ready server that serves WSGI applications, commonly used to serve Python web frameworks like Flask or Django.

from waitress import serve
from deploy import app
import multiprocessing
if name == "__main__":    
	# get cpu count    
	num_cpus = multiprocessing.cpu_count()    
	threads_per_worker = max(1, num_cpus-1)    
	print("Threads:", threads_per_worker)    
	print('ServerStarted')    
	serve(        
			app,        
			host='0.0.0.0',        
			port=8080,        
			threads=threads_per_worker    
	)

Import Statements
- from waitress import serve: This imports the serve function from the Waitress library. Waitress is a simple, high-performance server for Python web applications.
- from deploy import app: This imports the app object from the deploy module. Typically, this app object would represent a WSGI-compatible web application, such as a Flask or FastAPI app.
- import multiprocessing: This imports Python's multiprocessing module, which is used here to get the number of CPU cores on the system.
Main Block
- This block ensures that the code inside it is only executed if the script is run directly (not imported as a module in another script). This is a standard Python practice to control when certain parts of the code should be executed.
Determine CPU Count
- num_cpus = multiprocessing.cpu_count(): This line retrieves the total number of available CPU cores on the machine using multiprocessing.cpu_count(). The number of CPUs can be useful for determining how many threads to run for optimal performance when handling web requests.
Calculate Number of Threads
- Next, the script calculates the number of threads to assign per worker. It subtracts 1 from the total number of CPUs to leave at least one core free for other tasks (like system processes).
- The max(1, num_cpus-1) function ensures that at least 1 thread is always available, even if the system has only one CPU core.
Log Output
- Next print the calculated number of threads and a message indicating that the server is starting.
Start the Waitress Server
- serve(): This function starts the Waitress WSGI server and serves the app object (your web application).
- app: This is the application you are deploying (probably a Flask or FastAPI app).
- host='0.0.0.0': This ensures that the server is accessible from any IP address, allowing the web app to be accessed externally, not just from localhost.
- port=8080: The server will listen for HTTP requests on port 8080.
- threads=threads_per_worker: This sets the number of threads for handling requests, as determined by the earlier calculation.

Deploy in Google cloud

Let us see the steps to deploy the model in google cloud:

Create a Zip File of the Project
- Compress the folder containing all your code, including main.py, dependencies, and configuration files, into a zip file (e.g., deploy.zip).
Login to Your Google Cloud Account
Google cloud console
cloud shell
- Go to Google Cloud Console.
- Click on Activate Cloud Shell (a terminal icon in the top right corner).
Upload the Zip File
- Click the "Upload" button in the Cloud Shell terminal to upload your deploy.zip file.
Extract the Zip File in Cloud Shell
- Extract the contents of the zip file using the following commands in the terminal.

unzip deploy.zip
cd iris_deploy/
ls

This will unzip the file and move to the iris_deploy directory.

Deploy the Model Using Google Cloud Run
- Run the following command to start the deployment using Google Cloud Run:

gcloud run deploy

prompts after executing gcloud run deploy

Follow the prompts:
- Enter Service Name: Provide a name for your service (e.g., iris-prediction-service).
- Enter the Region: Choose the region where you want to deploy (e.g., us-central1).

Automatic Docker Build and Deployment
- Google Cloud Run will automatically create a Docker container in the backend using the contents of your project folder.
- It will then deploy the container, which will run your main.py file.
Wait for Deployment
- The entire process typically takes around 10-15 minutes. Once completed, Google Cloud Run will provide a URL where your deployed model can be accessed.

Test the Deployment

Next we will test the deployment.

After deployment, Google Cloud Run provides a public URL for your service. It will look something like https://your-service-name-region.a.run.app.
Copy this URL, as it will be used to send requests to your deployed model.

Final Thoughts

Deploying machine learning models in the cloud is a crucial step towards making AI-driven solutions available in real-world applications.
Cloud platforms like Google Cloud offer flexibility, scalability, and managed services that simplify this deployment process, enabling you to focus on creating powerful models rather than managing infrastructure.
Whether you're using Google Cloud Run for easy deployment with automatic scaling or setting up more customized solutions with Compute Engine, each approach has its unique advantages that can be tailored to your use case.
By leveraging cloud services, you ensure that your models are accessible, responsive, and capable of handling varying levels of demand.

Ultimately, deploying ML models in the cloud is not just about making them available but also about ensuring reliability, ease of access, and performance. With the steps covered in this guide, you are now equipped to take your machine learning models from development to production, allowing them to deliver insights and value where it truly matters—at scale, and in the hands of users.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm