# SageMaker Reinforcement Learning: From Zero to Hero

Do you remember the days at university studying AI and Reinforcement Learning theories before the last decade’s AI boom? **YES! :-)**

I’m pretty sure that you were also exposed to some toy examples, such as finding your way in a Maze or the Atari Pong game. Then, you asked yourself “Amazing! I can solve real problems now!” So, you turned on your laptop but you realized that you were missing some important parts:

- Are there simulators/environments (such as open AI Gym) to mimic a real-world scenario?
- Are there mature RL libraries out there to develop RL algorithms and test them on environment?
- Do these RL libraries (if they exist) make it easy to apply RL theory in practice?
- Can I easily train my own RL algorithms somewhere apart from my laptop ?

Machine Learning and Deep Learning have made so much impact these days because there are easy and quick answers for similar questions with the above ones! In case of Deep Learning:

**Where can I find a realistic dataset?**That can be the equivalent of environments. Plenty of open realistic datasets in Kaggle and other sources. For example, https://www.kaggle.com/c/quora-insincere-questions-classification**Are there mature libraries out there to train DL algorithms?**YES! Tensorflow, Keras, Pytorch, MXNet, Chainer, etc**Do these DL libraries (if they exist) make it easy to apply DL theory in practice?**Of course, you need a few lines of code for Convolution Neural Network in Keras as depicted in Figure 1.**Can I easily train my own DL algorithms somewhere apart from my laptop?**AWS SageMaker is here to the rescue! It’s a mature and easy to use ML infrastructure!

You heard that DeepMind published in 2013 a famous paper, Playing Atari with Deep Reinforcement Learning, in which they introduced a new algorithm called Deep Q Network (DQN). You even watched a clip that shows the DQN in action! Then, you asked yourself “Is there any similarity between DQN and Q-Learning that I was taught at university?”. You read the paper and discovered that there is. And, more specifically, DQN uses a neural network to approximate the reward function based on the state *Q(s, a)* where *s* is and *a* is action.

Here is an implementation of DQN in Keras:

import random

import numpy as np

from collections import deque

from keras.models import Sequential

from keras.layers import Dense

from keras.optimizers import Adamclass DQNAgent:

def __init__(self, state_size, action_size):

self.state_size = state_size

self.action_size = action_size

self.memory = deque(maxlen=2000)

self.gamma = 0.95 # discount rate

self.epsilon = 1.0 # exploration rate

self.epsilon_min = 0.01

self.epsilon_decay = 0.995

self.learning_rate = 0.001

self.model = self._build_model() def _build_model(self):

# Neural Net for Deep-Q learning Model

model = Sequential()

model.add(Dense(24, input_dim=self.state_size, activation=’relu’))

model.add(Dense(24, activation=’relu’))

model.add(Dense(self.action_size, activation=’linear’))

model.compile(loss=’mse’,

optimizer=Adam(lr=self.learning_rate))

return model def remember(self, state, action, reward, next_state, done):

self.memory.append((state, action, reward, next_state, done)) def act(self, state):

if np.random.rand() <= self.epsilon:

return random.randrange(self.action_size)

act_values = self.model.predict(state)

return np.argmax(act_values[0]) # returns action def replay(self, batch_size):

minibatch = random.sample(self.memory, batch_size)

states, targets_f = [], []

for state, action, reward, next_state, done in minibatch:

target = reward

if not done:

target = (reward + self.gamma *

np.amax(self.model.predict(next_state)[0]))

target_f = self.model.predict(state)

target_f[0][action] = target

# Filtering out states and targets for training

states.append(state[0])

targets_f.append(target_f[0])

history = self.model.fit(np.array(states), np.array(targets_f), epochs=1, verbose=0)

# Keeping track of loss

loss = history.history[‘loss’][0]

if self.epsilon > self.epsilon_min:

self.epsilon *= self.epsilon_decay

return loss

Now, you’re familiar with theory and implementation behind DQN, and you want to try it out on a different problem, the CartPole game. The goal of CartPole is to balance a pole connected with one joint on top of a moving cart. Instead of pixel information, there are 4 kinds of information given by the state, such as angle of the pole and position of the cart. An agent can move the cart by performing a series of actions of 0 or 1 to the cart, pushing it left or right.

**You just realized that you need to find an environment that simulates the CartPole problem! ****OpenAI Gym*** *offers you a collection of various already implement environments (such as CartPole) and the ability to create your own custom one.

Then, you need a framework that trains RL algorithms in various environments. Coach is a Python framework that enables you choose an environment (for example, OpenAI Gym) and a state-of-the-art RL algorithm implementation in order to compare the performance of various algorithms in a reliable and reproducible way. It uses Deep Learning libraries (such as Tensorflow and MXNet) to define RL algorithms. Additionally, it gives you the ability to implement your own RL algorithm without worrying about low-level details.

Finally, you need a RL cloud infrastructure to run all the above without worrying about computational resources.

**You’re lucky! There is an example of solving the CartPole problem on SageMaker!**

For the sake of brevity, I will only highlight the most important code snippets. You’ll find the complete example on Github that uses Coach and MXNet.

SageMaker RL, which is one great feature of AWS SageMaker, puts everything together. **Hence, it bridges theory and practice**!

You need to define the RL algorithm (*DQNAgentParameters()*) and the preset of the algorithm in the Coach graph (Python script *src/preset-cartpole-dqn.py*). Easy, right? You don’t have to implement the DQN. It’s already implemented for you:

The Python script *src/train-coach.py* contains the code that kicks off and manages the training. It also overrides some of the parameters.

Then, you use the SageMaker Python SDK to define a *RLEstimator* that handles the training where we define the following

- The entry point script
- Any dependency python files
- The toolkit (i.e. environment)
- The Deep Learning framework (i.e. Tensorflow or MXNet)
- EC2 instance type
- …

Figure 4 depicts the increase of training reward as more and more episodes are created.

Finally, it’s really cool that you can deploy the trained RL model as a REST Endpoint!

And don’t forget to delete the endpoint and save some money :-)

**Next time, I’ll show you how to create a custom environment for Atari Pong game!**