The Cartpole balance problem is the classic inverted pendulum problem which consists of a cart that moves along the horizontal axis and objective is to balance the pole on the cart using python. Instead of applying control theories, the goal here is to solve it using controlled trial-and-error, also known as reinforcement learning.
In this tutorial, we will use the OpenAI Gym module as a reinforcement learning tool to process and evaluate the Cartpole simulation.
You can watch the video-based tutorial with step by step explanation down below.
Project Description
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
# install modules
!pip install gym stable_baselines3
Necessary installation to use the Gym module
You must previously install Pytorch and Tensorflow in order to use baseline 3.
Import Modules
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
PPO - Proximal Policy Optimization algorithm to train AI policies in challenging environments.
DummyVecEnv - Vectorized environment to parallelly train multiple models.
evaluate_policy - Pre-defined evaluation to test your current model.
Environment Testing with Random Actions
env_name = 'CartPole-v0'
env = gym.make(env_name)
Creation of the simulation environment
for episode in range(1, 11):
score = 0
state = env.reset()
done = False
while not done:
env.render()
action = env.action_space.sample()
n_state, reward, done, info = env.step(action)
score += reward
print('Episode:', episode, 'Score:', score)
env.close()
Episode: 1 Score: 19.0 Episode: 2 Score: 17.0 Episode: 3 Score: 77.0 Episode: 4 Score: 36.0 Episode: 5 Score: 17.0 Episode: 6 Score: 22.0 Episode: 7 Score: 18.0 Episode: 8 Score: 25.0 Episode: 9 Score: 11.0 Episode: 10 Score: 15.0
Display video of the Cartpole balance episode.
It will stop after iterating 10 episodes or stops receiving any more inputs.
score - total of the reward points
state = env.reset() - resetting the environment to the initial stage
env.render() - render the environment
action = env.action_space - Gives two discrete values, 0 (or) 1 to move left (or) right. Using sample() will return a random number, either 0 (or) 1.
Model Training
env = gym.make(env_name)
env = DummyVecEnv([lambda: env])
model = PPO('MlpPolicy', env, verbose=1)
Using cuda device
Creation of the environment for training
Policy defined in order to determine how the input and output will be processed
verbose = 1 - Display all the information in the console
model.learn(total_timesteps=20000)
-----------------------------
| time/ | |
| fps | 352 |
| iterations | 1 |
| time_elapsed | 5 |
| total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/ | |
| fps | 360 |
| iterations | 2 |
| time_elapsed | 11 |
| total_timesteps | 4096 |
| train/ | |
| approx_kl | 0.009253612 |
| clip_fraction | 0.112 |
| clip_range | 0.2 |
| entropy_loss | -0.686 |
| explained_variance | -0.00415 |
| learning_rate | 0.0003 |
| loss | 5.15 |
| n_updates | 10 |
| policy_gradient_loss | -0.0178 |
| value_loss | 50.7 |
-----------------------------------------
-----------------------------------------
| time/ | |
| fps | 369 |
| iterations | 3 |
| time_elapsed | 16 |
| total_timesteps | 6144 |
| train/ | |
| approx_kl | 0.011285827 |
| clip_fraction | 0.071 |
| clip_range | 0.2 |
| entropy_loss | -0.666 |
| explained_variance | 0.059 |
| learning_rate | 0.0003 |
| loss | 13.8 |
| n_updates | 20 |
| policy_gradient_loss | -0.0164 |
| value_loss | 36.5 |
-----------------------------------------
-----------------------------------------
| time/ | |
| fps | 373 |
| iterations | 4 |
| time_elapsed | 21 |
| total_timesteps | 8192 |
| train/ | |
| approx_kl | 0.008387755 |
| clip_fraction | 0.0782 |
| clip_range | 0.2 |
| entropy_loss | -0.631 |
| explained_variance | 0.22 |
| learning_rate | 0.0003 |
| loss | 12.9 |
| n_updates | 30 |
| policy_gradient_loss | -0.0171 |
| value_loss | 52 |
-----------------------------------------
-----------------------------------------
| time/ | |
| fps | 375 |
| iterations | 5 |
| time_elapsed | 27 |
| total_timesteps | 10240 |
| train/ | |
| approx_kl | 0.009724656 |
| clip_fraction | 0.0705 |
| clip_range | 0.2 |
| entropy_loss | -0.605 |
| explained_variance | 0.31 |
| learning_rate | 0.0003 |
| loss | 18.7 |
| n_updates | 40 |
| policy_gradient_loss | -0.0153 |
| value_loss | 58.7 |
-----------------------------------------
------------------------------------------
| time/ | |
| fps | 377 |
| iterations | 6 |
| time_elapsed | 32 |
| total_timesteps | 12288 |
| train/ | |
| approx_kl | 0.0058702496 |
| clip_fraction | 0.0446 |
| clip_range | 0.2 |
| entropy_loss | -0.591 |
| explained_variance | 0.303 |
| learning_rate | 0.0003 |
| loss | 23.8 |
| n_updates | 50 |
| policy_gradient_loss | -0.00995 |
| value_loss | 71.7 |
------------------------------------------
------------------------------------------
| time/ | |
| fps | 378 |
| iterations | 7 |
| time_elapsed | 37 |
| total_timesteps | 14336 |
| train/ | |
| approx_kl | 0.0020627829 |
| clip_fraction | 0.012 |
| clip_range | 0.2 |
| entropy_loss | -0.59 |
| explained_variance | 0.271 |
| learning_rate | 0.0003 |
| loss | 9.21 |
| n_updates | 60 |
| policy_gradient_loss | -0.00347 |
| value_loss | 68.9 |
------------------------------------------
------------------------------------------
| time/ | |
| fps | 379 |
| iterations | 8 |
| time_elapsed | 43 |
| total_timesteps | 16384 |
| train/ | |
| approx_kl | 0.0068374504 |
| clip_fraction | 0.0835 |
| clip_range | 0.2 |
| entropy_loss | -0.571 |
| explained_variance | 0.734 |
| learning_rate | 0.0003 |
| loss | 16.4 |
| n_updates | 70 |
| policy_gradient_loss | -0.00912 |
| value_loss | 44.5 |
------------------------------------------
-----------------------------------------
| time/ | |
| fps | 381 |
| iterations | 9 |
| time_elapsed | 48 |
| total_timesteps | 18432 |
| train/ | |
| approx_kl | 0.005816878 |
| clip_fraction | 0.0488 |
| clip_range | 0.2 |
| entropy_loss | -0.554 |
| explained_variance | 0.449 |
| learning_rate | 0.0003 |
| loss | 28.8 |
| n_updates | 80 |
| policy_gradient_loss | -0.00556 |
| value_loss | 83.1 |
-----------------------------------------
-----------------------------------------
| time/ | |
| fps | 382 |
| iterations | 10 |
| time_elapsed | 53 |
| total_timesteps | 20480 |
| train/ | |
| approx_kl | 0.008002328 |
| clip_fraction | 0.0686 |
| clip_range | 0.2 |
| entropy_loss | -0.569 |
| explained_variance | 0.7 |
| learning_rate | 0.0003 |
| loss | 25.5 |
| n_updates | 90 |
| policy_gradient_loss | -0.00599 |
| value_loss | 64.9 |
-----------------------------------------
# save the model
model.save('ppo model')
Model Testing
evaluate_policy(model, env, n_eval_episodes=10, render=True)
(200.0, 0.0)
Shows the max score that can be obtained by the model
env.close()
Now we will apply an alternative way after training the model
for episode in range(1, 11):
score = 0
obs = env.reset()
done = False
while not done:
env.render()
action, _ = model.predict(obs)
obs, reward, done, info = env.step(action)
score += reward
print('Episode:', episode, 'Score:', score)
env.close()
Episode: 1 Score: [200.] Episode: 2 Score: [200.] Episode: 3 Score: [200.] Episode: 4 Score: [200.] Episode: 5 Score: [200.] Episode: 6 Score: [200.] Episode: 7 Score: [200.] Episode: 8 Score: [200.] Episode: 9 Score: [200.] Episode: 10 Score: [200.]
Performance has increased significantly after model training, giving perfect scores.
Final Thoughts
Processing large amount of episodes can take a lot of time and system resource.
You may use other simulation models like Atari games, space shooting games, etc.
You can create multi environments and process them simultaneously.
In this project tutorial, we have explored the Cartpole balance problem using the OpenAI Gym module as a reinforcement learning project. We have obtained very good results after processing and training the model.
Toptal provides a top-rated platform connecting businesses and startups with expert OpenAI Gym developers. Clients trust Toptal to supply them with mission-critical talent for their advanced OpenAI Gym projects, including developing and testing reinforcement learning algorithms, designing and building virtual environments for training and testing, tuning hyperparameters, and integrating OpenAI Gym with other machine learning libraries and tools. You can augment your organization's development team and AI capabilities with access to this talent guide where you can easily find and hire skilled OpenAI Gym developers.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm