Qlearning  A Quick Introduction (with Code)
Qlearning is a classic algorithm of reinforcement learning (RL) that has rapidly grown in popularity over the years. In this article, I hope to give you a quick introduction to Qlearning by covering the following points:
 Reinforcement Learning
 The Bellman Equation
 Qlearning algorithm and How it Works
 The Qlearning Algorithm
 Limitations of Qlearning
 Implementation
 Final Thoughts
Reinforcement Learning
Reinforcement learning flow
RL is a type of machine learning that involves training agents to make a sequence of decisions in an environment in order to maximize a reward. The agent receives feedback in the form of rewards or punishments for its actions. Finally, the agent uses this feedback to learn what actions are most likely to lead to the greatest reward.
Feel free to check out this great video, for a more thorough introduction to Reinforcement Learning.
The Bellman Equation
The Bellman equation is a fundamental concept in reinforcement learning (RL) and is at the heart of the Qlearning algorithm. It is used to determine the optimal action to take in a given state.
It states that the optimal actionvalue function at a given state is the maximum expected return starting from that state. In other words, the optimal value function is the sum of the following:

The immediate reward received for taking a particular action

The expected return of the next action, discounted by a factor of gamma (γ). This discount factor represents the importance of future rewards (0 => only immediate rewards matter and 1 => all future rewards are equally important).
The Bellman equation can be expressed mathematically as follows:
Bellman Equation for q* value
where

q(s, a)
is the optimal actionvalue function 
E
denotes the word “expectation”, meaning we don’t know for sure that the equation is satisfied 
r
is the reward received for taking actiona
in states

γ
is the discount factor 
St+1 is the next state

a'
is the next action.
The Bellman equation is used in Qlearning to update the actionvalue function based on the reward received for the current action and the expected return of the next action. By maximizing the actionvalue function, the Qlearning algorithm can learn the actions that are most likely to lead to the highest cumulative reward.
Qlearning algorithm and How it Works
Qlearning is a type of reinforcement learning algorithm that is used to find the optimal policy for an agent to follow in a given environment. It does this by using a table called the “qtable” to store the expected reward for taking a specific action in a specific state.
The qtable is initially filled with zeros. As the agent interacts with the environment, it updates the values in the qtable based on the rewards it receives. The agent follows an explorationexploitation tradeoff, meaning it will try out new actions to see if they lead to a higher reward, while also relying on the values in the qtable to guide its actions toward the most promising options.
The ExplorationExploitation Tradeoff
The explorationexploitation tradeoff refers to the balance between exploring new options and exploiting known good options in order to maximize reward. An agent faces this tradeoff when deciding which actions to take in a given environment. If the agent always chooses the action with the highest expected reward, it may miss out on the potential for even higher rewards from unexplored options. On the other hand, if the agent always explores new options, it may not make the most efficient use of its time and resources. The optimal balance between exploration and exploitation depends on the specific context and the agent’s goals.
To balance exploration and exploitation, Qlearning uses an exploration policy, such as an εgreedy policy. An εgreedy policy allows the agent to choose a random action with a small probability (ε) and the action with the highest expected return with a high probability (1ε). Accordingly, the agent tries out new actions and explores the environment while also exploiting its current knowledge to maximize reward.
The Qlearning Algorithm
 The agent selects an action based on either of the following:

The current state and the values in the qtable

A randomly chosen action.


The agent takes the action and the environment responds with a reward and the next state
 The agent updates the qvalue for the taken action based on the reward and the expected future rewards
 This is done according to the Bellman Equation that we discussed earlier.
 The process repeats from step 1
As the agent continues to interact with the environment, the values in the qtable become more accurate and the agent’s policy becomes more optimal.
Limitations of Qlearning
Qlearning is a powerful algorithm for reinforcement learning (RL), but it has its limitations and challenges. Here are some of the key limitations and challenges of using Qlearning:

Data requirements: Qlearning requires a large amount of data to learn an accurate actionvalue function. This can be a challenge in realworld environments where it is difficult to collect a sufficient amount of data.

Convergence issues: Qlearning can sometimes have difficulty converging to the optimal actionvalue function, especially in environments with a large state space or a continuous action space. This can lead to suboptimal performance or even divergence of the algorithm.

Function approximation: In environments with a large state space, it can be impractical to learn an actionvalue function for every possible state.

Online learning: Qlearning is an online learning algorithm, which means that it updates the actionvalue function as it interacts with the environment. This can be a challenge in environments where the agent may need to wait a long time to receive a reward and update its actionvalue function.

Stochastic environments: In stochastic environments, the state transitions and reward distributions are not deterministic. This can make it difficult for Qlearning to learn the optimal actionvalue function.
Implementation
This implementation of Qlearning uses the OpenAI Gym library to create a simple environment called “FrozenLakev0”, which consists of a 4x4 grid of blocks. The goal of the agent is to navigate from the start block to the goal block by taking actions to move left, right, up, or down. The environment provides observations, rewards, and whether the episode is done to the agent at each step.
Frozen Lake OpenAI Gym
The Code
import gym # import the OpenAI Gym library
import numpy as np # import NumPy library
# Create the environment
env = gym.make('FrozenLakev1', is_slippery=False)
# Set parameters
alpha = 0.1 # learning rate
gamma = 0.6 # discount factor
# Set the action and state space sizes
n_actions = env.action_space.n
n_states = env.observation_space.n
# Initialize Qtable with zeros
Q = np.zeros((n_states, n_actions))
# Set number of episodes
num_episodes = 10000
# Create a list to store rewards
r_list = []
# Loop through episodes
for i in range(num_episodes):
# Reset the environment and get the initial state
state = env.reset()[0]
r_all = 0
done = False
# The QTable learning algorithm
while not done:
# Choose an action by greedily (with noise) picking from Q table
action = np.argmax(Q[state,:] + np.random.randn(1, n_actions) / (i+1))
# Get new state and reward from environment
new_state, reward, done, _, _ = env.step(action)
# Update QTable with new knowledge
Q[state, action] = Q[state, action] + alpha * (reward + gamma * np.max(Q[new_state, :])  Q[state, action])
# Update total reward
r_all += reward
# Set new state
state = new_state
# Append total reward for this episode to the reward list
r_list.append(r_all)
print("Score over time: " + str(sum(r_list)/num_episodes))
print("Final QTable Values")
print(Q)
Quick Explanation
The Qlearning algorithm is implemented using a loop that runs for a specified number of episodes. In each episode, the environment is reset and the agent chooses an action using the Qtable and an epsilongreedyequivalent policy.
Essentially, we add noise to the action selection. We also scale the values given to each action as time progresses. Therefore, with time, the algorithm becomes more and more deterministic, similar to epsilongreedy.
Simulating the Policy
You can try running the calculated Qtable using the following code. You can copypaste below the previous code to view the result at the end of the training:
#  Simulate the policy 
policy = {}
for i, act in enumerate(np.argmax(Q, axis=1)):
policy[i] = act
print("Policy:", policy)
path = []
dirs = ['left', 'down', 'right', 'up']
state = env.reset()[0]
r_all = 0
done = False
while not done:
# Choose an action by greedily (with noise) picking from Q table
action = np.argmax(Q[state,:])
path.append(dirs[action])
# Get new state and reward from environment
new_state, reward, done, _, _ = env.step(action)
# Update total reward
r_all += reward
# Set new state
state = new_state
print("Path:", path)
print("Reward:", reward)
Final Thoughts
That’s it! Hope you found this post on Qlearning helpful. Qlearning was the basis for many powerful reinforcement learning algorithms. So, it’s good to have a basic idea about how it works. We’ll discuss more such RL algorithms, moving forward. In the meantime, feel free to check out some of my other RLrelated posts here.