Double Deep Q-Networks (DDQN) - A Quick Intro (with Code)

Double Deep Q-Networks (DDQN) - A Quick Intro (with Code)

In the previous post, we discussed how Deep Q-Networks has proven to be a powerful tool for solving RL-based problems. However, over the years, several modifications to it have resulted in a great performance. In this article, I’ll discuss one of these modifications known as Double Deep Q-Networks (DDQN).

Overestimation in Q-learning

One of the challenges in Q-learning is that the Q function is updated based on estimates of the future rewards, rather than the true rewards. This can lead to an overestimation of the Q values, especially when the estimates are made using an inaccurate model of the environment. In other words, the agent may think it would get a higher reward than it actually would, leading to wrong decisions.

There are various techniques have been developed to address the problem of overestimation in Q-learning. One such technique is Double DQN.

Introduction to Double DQN

Value estimates of DDQN vs DQN

Double DQN is a variant of the deep Q-network (DQN) algorithm that addresses the problem of overestimation in Q-learning. It was introduced in 2015 by Hado van Hasselt et al. in their paper “Deep Reinforcement Learning with Double Q-Learning”.

In traditional DQN, the Q function is updated using the Bellman equation, which involves estimating the maximum expected future reward for each action. However, this can lead to the overestimation of the Q values, as described in the previous subtopic. On the other hand, DDQN addresses this problem by separating the action selection and action evaluation steps in the Q-learning update.

Specifically, in DDQN, the action with the maximum Q value is selected using one network (the “selection network”), and the Q value for this action is evaluated using a separate network (the “target network”).

Performance of DDQN

Performance of DDQN vs DQN Performance of DDQN vs DQN

Double DQN has been evaluated on a variety of reinforcement learning tasks and has been shown to improve performance on many of them. In the original paper, the authors demonstrated improved performance on the Atari 2600 game suite compared to traditional DQN.

Additionally, other studies have also found that DDQN can lead to improved performance on tasks such as playing the card game Hanabi, controlling a simulated robot arm, and navigating a simulated 3D environment. In general, DDQN has been found to be particularly effective at reducing the overestimation of the Q values and improving the stability of learning.

However, Double DQN is not a panacea and may not always lead to improved performance. In fact, some studies have found that DDQN can perform worse than traditional DQN on certain tasks, such as playing the game of Go. Therefore, it is important to carefully evaluate the performance of DDQN on each specific task to determine whether it is likely to be beneficial. A comprehensive list of results performed by the authors of the DDQN paper is available on page 10 of the paper.

In summary, DDQN has been shown to improve performance on many reinforcement learning tasks, but it is not always the best choice and its effectiveness can vary depending on the specific task.

Implementation of DDQN

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import random


# Define the network architecture
class QNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x


# Define the replay buffer
class ReplayBuffer:
    def __init__(self, capacity):
        self.capacity = capacity
        self.buffer = []
        self.index = 0

    def push(self, state, action, reward, next_state, done):
        if len(self.buffer) < self.capacity:
            self.buffer.append(None)
        self.buffer[self.index] = (state, action, reward, next_state, done)
        self.index = (self.index + 1) % self.capacity

    def sample(self, batch_size):
        batch = np.random.choice(len(self.buffer), batch_size, replace=False)
        states, actions, rewards, next_states, dones = [], [], [], [], []
        for i in batch:
            state, action, reward, next_state, done = self.buffer[i]
            states.append(state)
            actions.append(action)
            rewards.append(reward)
            next_states.append(next_state)
            dones.append(done)
        return (
            torch.tensor(np.array(states)).float(),
            torch.tensor(np.array(actions)).long(),
            torch.tensor(np.array(rewards)).unsqueeze(1).float(),
            torch.tensor(np.array(next_states)).float(),
            torch.tensor(np.array(dones)).unsqueeze(1).int()
        )

    def __len__(self):
        return len(self.buffer)


# Define the Double DQN agent
class DDQNAgent:
    def __init__(self, state_size, action_size, seed, learning_rate=1e-3, capacity=1000000,
                 discount_factor=0.99, tau=1e-3, update_every=4, batch_size=64):
        self.state_size = state_size
        self.action_size = action_size
        self.seed = seed
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.tau = tau
        self.update_every = update_every
        self.batch_size = batch_size
        self.steps = 0

        self.qnetwork_local = QNetwork(state_size, action_size)
        self.qnetwork_target = QNetwork(state_size, action_size)
        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=learning_rate)
        self.replay_buffer = ReplayBuffer(capacity)
        self.update_target_network()

    def step(self, state, action, reward, next_state, done):
        # Save experience in replay buffer
        self.replay_buffer.push(state, action, reward, next_state, done)

        # Learn every update_every steps
        self.steps += 1
        if self.steps % self.update_every == 0:
            if len(self.replay_buffer) > self.batch_size:
                experiences = self.replay_buffer.sample(self.batch_size)
                self.learn(experiences)

    def act(self, state, eps=0.0):
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        self.qnetwork_local.eval()
        with torch.no_grad():
            action_values = self.qnetwork_local(state)
        self.qnetwork_local.train()

        # Epsilon-greedy action selection
        if random.random() > eps:
            return np.argmax(action_values.cpu().data.numpy())
        else:
            return random.choice(np.arange(self.action_size))

    def learn(self, experiences):
        states, actions, rewards, next_states, dones = experiences

        # Get max predicted Q values (for next states) from target model
        Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
        # Compute Q targets for current states 
        Q_targets = rewards + self.discount_factor * (Q_targets_next * (1 - dones))

        # Get expected Q values from local model
        Q_expected = self.qnetwork_local(states).gather(1, actions.view(-1, 1))

        # Compute loss
        loss = F.mse_loss(Q_expected, Q_targets)
        # Minimize the loss
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Update target network
        self.soft_update(self.qnetwork_local, self.qnetwork_target)

    def update_target_network(self):
        # Update target network parameters with polyak averaging
        for target_param, local_param in zip(self.qnetwork_target.parameters(), self.qnetwork_local.parameters()):
            target_param.data.copy_(self.tau * local_param.data + (1.0 - self.tau) * target_param.data)

    def soft_update(self, local_model, target_model):
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(self.tau * local_param.data + (1.0 - self.tau) * target_param.data)

Explanation

  • QNetwork: a PyTorch module that defines the architecture of the Q-network. It takes in a state and outputs the action values for all actions.

  • ReplayBuffer: a class that stores experiences in a circular buffer and samples a batch of experiences randomly for learning.

  • DDQNAgent: the main class that implements the Double DQN algorithm. It has the following methods:

    • __init__: initializes the local and target Q-networks, the optimizer, the replay buffer, and some hyperparameters. It also calls update_target_network to initialize the target network with the same weights as the local network.

    • step: stores an experience in the replay buffer and learns from a batch of experiences every update_every steps.

    • act: selects an action using an epsilon-greedy policy based on the action values output by the local Q-network.

    • learn: performs a learning step using a batch of experiences. It computes the Q-targets using the local Q-network and the action values output by the target Q-network, and then minimizes the loss between the Q-targets and the expected Q-values using the local Q-network. It then updates the target Q-network using polyak averaging.

    • update_target_network: updates the target Q-network with polyak averaging.

    • soft_update: a helper function that updates the target Q-network with polyak averaging.

To use this Double DQN implementation, you can create an instance of the DDQNAgent class and call its act and step methods in your training loop. You can also customize the hyperparameters and the network architecture by modifying the __init__ method of the DDQNAgent class.

Training DDQN

Cart Pole task from OpenAI Gym

Assuming we’ve saved the previous code snippet in a file called ddqn.py, the following snippet can be used to import and train the agent on performing the CartPole task:

import gym
import numpy as np
from ddqn import DDQNAgent
import matplotlib.pyplot as plt

# Create the environment
env = gym.make('CartPole-v0')

# Get the state and action sizes
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

# Set the random seed
seed = 0

# Create the DDQN agent
agent = DDQNAgent(state_size, action_size, seed)

# Set the number of episodes and the maximum number of steps per episode
num_episodes = 1000
max_steps = 1000

# Set the exploration rate
eps = eps_start = 1.0
eps_end = 0.01
eps_decay = 0.995

# Set the rewards and scores lists
rewards = []
scores = []

# Run the training loop
for i_episode in range(num_episodes):
    print(f'Episode: {i_episode}')
    # Initialize the environment and the state
    state = env.reset()[0]
    score = 0
    # eps = eps_end + (eps_start - eps_end) * np.exp(-i_episode / eps_decay)
    # Update the exploration rate
    eps = max(eps_end, eps_decay * eps)
    
    # Run the episode
    for t in range(max_steps):
        # Select an action and take a step in the environment
        action = agent.act(state, eps)
        next_state, reward, done, trunc, _ = env.step(action)
        # Store the experience in the replay buffer and learn from it
        agent.step(state, action, reward, next_state, done)
        # Update the state and the score
        state = next_state
        score += reward
        # Break the loop if the episode is done or truncated
        if done or trunc:
            break
        
    print(f"\tScore: {score}, Epsilon: {eps}")
    # Save the rewards and scores
    rewards.append(score)
    scores.append(np.mean(rewards[-100:]))

# Close the environment
env.close()

plt.ylabel("Score")
plt.xlabel("Episode")
plt.plot(range(len(rewards)), rewards)
plt.plot(range(len(rewards)), scores)
plt.legend(['Reward', "Score"])
plt.show()

The progress of training may look like the following:

DDQN training

Final Thoughts

That’s it! Hope you found this helpful. Let me know if there are any questions in the comments below. Also, feel free to check out my other posts on Reinforcement Learning here.